CN112528637A

CN112528637A - Text processing model training method and device, computer equipment and storage medium

Info

Publication number: CN112528637A
Application number: CN202011447964.2A
Authority: CN
Inventors: 吴天博; 王健宗; 程宁
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-12-11
Filing date: 2020-12-11
Publication date: 2021-03-19
Anticipated expiration: 2040-12-11
Also published as: CN112528637B; WO2022121251A1

Abstract

The application relates to the technical field of artificial intelligence, in particular to a text processing model training method and device, computer equipment and a storage medium. The method comprises the following steps: acquiring a first text sample set to be trained; respectively executing model training based on the first text sample set to be trained to obtain five-stroke word vector models and pinyin word vector models corresponding to different input methods; acquiring a second text sample set to be trained and a pre-trained language model; respectively extracting coded data corresponding to a second text sample set to be trained based on the language model, the five-stroke word vector model and the pinyin word vector model; and performing model training according to the encoded data to obtain a text processing model. In addition, the invention also relates to a block chain technology, and privacy information such as coded data can be stored in the block chain. By adopting the method, the training precision of the text processing model can be improved, and high-quality target text data can be obtained according to the text processing model.

Description

Text processing model training method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for training a text processing model, a computer device, and a storage medium.

Background

Chinese correction is a basic task in natural language processing, which often affects the accuracy of an upstream task, and in available cheap text data, various Chinese errors are often contained, but for the Chinese with great precision, the semantic meaning may be changed by changing several words, so that the Chinese correction is often used as an underlying module to provide a higher-quality text for the upstream task.

Bert in the traditional technology is taken as a mainstream pre-training language model at present, because 15% -10% of noise is introduced into an MLM pre-training task of the model, the Bert has certain error detection capability, but because only 15% -10% of noise is introduced, the Bert is often weak in text error detection, and the acquisition of high-quality text data is difficult.

Disclosure of Invention

In view of the above, it is necessary to provide a text processing model training method, apparatus, computer device, and storage medium capable of obtaining text processing models with high accuracy.

A method for training a text processing model, the method comprising:

acquiring a first text sample set to be trained;

respectively executing model training based on the first text sample set to be trained to obtain five-stroke word vector models and pinyin word vector models corresponding to different input methods;

acquiring a second text sample set to be trained and a pre-trained language model;

respectively extracting coded data corresponding to a second text sample set to be trained based on the language model, the five-stroke word vector model and the pinyin word vector model;

and performing model training according to the encoded data to obtain a text processing model.

In one embodiment, the method for obtaining five-stroke word vector models and pinyin word vector models corresponding to different input methods by respectively performing model training based on a first text sample set to be trained includes:

converting a first text sample set to be trained into corresponding pinyin coding vectors, sequentially traversing the pinyin coding vectors according to a pre-configured sliding window, taking the traversed pinyin coding vectors as current pinyin vectors to be processed, predicting the pinyin coding vectors at preset positions in the current pinyin vectors to be processed based on a current word vector model corresponding to current pinyin model parameters, determining target pinyin model parameters according to the predicted pinyin coding vectors and real pinyin coding vectors, and obtaining a pinyin word vector model according to the determined target pinyin model parameters;

converting a first text sample set to be trained into corresponding five-stroke coding vectors, sequentially traversing the five-stroke coding vectors according to a pre-configured sliding window, taking the traversed five-stroke coding vectors as the current five-stroke vectors to be processed, predicting the five-stroke coding vectors at preset positions in the current five-stroke vectors to be processed based on a current word vector model corresponding to the current five-stroke model parameters, determining target five-stroke model parameters according to the predicted five-stroke coding vectors and the real five-stroke coding vectors, and obtaining a five-stroke word vector model according to the determined target five-stroke model parameters.

In one embodiment, the extracting the encoded data corresponding to the second text sample set to be trained based on the language model, the five-stroke word vector model, and the pinyin word vector model includes:

extracting five-stroke coding data from a second text sample set to be trained based on a pre-trained five-stroke word vector model;

extracting pinyin encoding data from a second text sample set to be trained based on a pre-trained pinyin word vector model;

acquiring a pre-trained language model, and extracting multi-dimensional language coding data from a second sample set to be trained on the basis of the language model;

performing model training according to the encoded data to obtain a text processing model, comprising:

and taking the five-stroke coded data, the pinyin coded data and the multidimensional language coded data as input data, and performing model training according to the input data to obtain a text processing model.

In one embodiment, taking five-stroke coded data, pinyin coded data and multidimensional language coded data as input data, and performing model training according to the input data to obtain a text processing model, including:

splicing the five-stroke coded data, the pinyin coded data and the multi-dimensional language coded data to obtain spliced coded data;

predicting the spliced coded data based on the language model to obtain the corresponding prediction probability of each position;

determining an initial prediction text at a corresponding position according to the prediction probability;

and adjusting initial model parameters of the initial text processing model based on the difference between the initial prediction text and the real label text to obtain target model parameters, and determining the text processing model according to the target model parameters.

In one embodiment, determining the initial predicted text at the corresponding position according to the magnitude of the prediction probability comprises:

acquiring a prediction text with a prediction probability value larger than a preset value;

and extracting an initial prediction text from the prediction text based on a homophone principle and a pinyin principle, and storing the initial prediction text into the block link point.

In one embodiment, the model parameters of the text processing model include pinyin model parameters and wubi model parameters; adjusting initial model parameters of an initial text processing model based on the difference between the initial predicted text and the real label text to obtain target model parameters, and determining a text processing model according to the target model parameters, wherein the method comprises the following steps:

adjusting the initial five-stroke parameters of the initial text processing model based on the difference between the initial predicted text and the real label text to obtain target five-stroke model parameters;

and determining a text processing model according to the pinyin model parameters and the target five-stroke model parameters.

In one embodiment, the word vector model is trained based on word2vector and the language model is trained based on the Bert model.

A text data acquisition apparatus, the apparatus comprising:

the first training sample set acquisition module is used for acquiring a first text sample set to be trained;

the word vector training module is used for respectively executing model training based on a first text sample set to be trained to obtain five word vector models and pinyin word vector models corresponding to different input methods;

the second training sample set acquisition module is used for acquiring a second text sample set to be trained and a pre-trained language model;

the coded data extraction module is used for respectively extracting coded data corresponding to the second text sample set to be trained based on the language model, the five-stroke word vector model and the pinyin word vector model;

and the model training module is used for executing model training according to the coded data to obtain a text processing model.

The text processing model training method, the text processing model training device, the computer equipment and the storage medium obtain a first text sample set to be trained; respectively executing model training based on the first text sample set to be trained to obtain five-stroke word vector models and pinyin word vector models corresponding to different input methods; acquiring a second text sample set to be trained and a pre-trained language model; respectively extracting coded data corresponding to a second text sample set to be trained based on the language model, the five-stroke word vector model and the pinyin word vector model; and performing model training according to the encoded data to obtain a text processing model. The method comprises the steps of firstly training based on a training sample set to obtain word vector models corresponding to different input methods, then executing model training again based on the trained word vector models and language models to obtain text processing models, guaranteeing that text information with more dimensions can be integrated in the process of training the text processing models, and obtaining the text processing models with higher precision and higher prediction accuracy. The trained text processing model can be used for processing the input text data to be processed, so that more text information is considered in the text processing process, the processing capacity of the text data is improved, and the acquisition of high-quality text data is possible.

A text data acquisition method comprises the following steps:

acquiring text data to be processed;

inputting the text data to be processed into a pre-trained text processing model, and performing data processing on the text data to be processed according to model parameters in the text processing model to obtain target text data; the text processing model is obtained by training based on word vector coded data and language coded data corresponding to different input methods as input data, the word vector coded data is obtained based on a pre-trained word vector model, and the language coded data is obtained based on the pre-trained language model.

A text data acquisition apparatus, the apparatus comprising:

the acquisition module is used for acquiring text data to be processed;

the processing module is used for inputting the text data to be processed into a pre-trained text processing model so as to perform data processing on the text data to be processed according to model parameters in the text processing model to obtain target text data; the text processing model is obtained by training based on word vector coded data and language coded data corresponding to different input methods as input data, the word vector coded data is obtained based on a pre-trained word vector model, and the language coded data is obtained based on the pre-trained language model.

A computer device comprising a memory storing a computer program and a processor implementing the steps of the method of any of the embodiments described above when the computer program is executed by the processor.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any of the above embodiments.

The text data acquisition method acquires text data to be processed; inputting the text data to be processed into a pre-trained text processing model, and performing data processing on the text data to be processed according to model parameters in the text processing model to obtain target text data; the text processing model is obtained by training based on word vector coded data and language coded data corresponding to different input methods as input data, the word vector coded data is obtained based on a pre-trained word vector model, and the language coded data is obtained based on the pre-trained language model. The word vector models corresponding to different input methods are trained, and the text data are processed comprehensively based on the word vector models corresponding to the different input methods and the language models, so that more text information is considered in the text processing process, the processing capacity of the text data is improved, and the possibility of acquiring high-quality text data is realized.

Drawings

FIG. 1 is a diagram of an application environment of a method for training a text processing model in one embodiment;

FIG. 2 is a schematic flow chart diagram illustrating a method for training a text processing model in one embodiment;

FIG. 3 is a block diagram of a text processing model provided in one embodiment;

FIG. 4 is a block diagram showing the structure of a text processing model training apparatus according to an embodiment;

FIG. 5 is a flowchart illustrating a text data obtaining method according to an embodiment;

FIG. 6 is a block diagram showing a configuration of a text data acquiring apparatus according to an embodiment;

FIG. 7 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The text data acquisition method provided by the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The server 104 acquires a first text sample set to be trained uploaded by the terminal 102; respectively executing model training based on the first text sample set to be trained to obtain five-stroke word vector models and pinyin word vector models corresponding to different input methods; acquiring a second text sample set to be trained and a pre-trained language model; respectively extracting coded data corresponding to a second text sample set to be trained based on the language model, the five-stroke word vector model and the pinyin word vector model; and performing model training according to the encoded data to obtain a text processing model. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers.

In one embodiment, as shown in fig. 2, a text processing model training method is provided, which is described by taking the method as an example applied to the server in fig. 1, and includes the following steps:

step 202, a first text sample set to be trained is obtained.

The first training text sample set includes a plurality of text data, and specifically may include a plurality of text sentences. It should be noted that the text data in the first training text sample set may include text data that needs to be error-corrected, that is, the first training text sample set may include wrong text information. Specifically, the source of the first training text sample set includes data such as chinese wikipedia, historical telemarketing records, news crawled on the internet, and hectic questions and answers, which are not limited herein.

And 204, respectively executing model training based on the first text sample set to be trained to obtain five-stroke word vector models and pinyin word vector models corresponding to different input methods.

The input method specifically comprises a pinyin input method and a five-stroke input method, wherein the pinyin input method and the five-stroke input method respectively correspond to the method of identifying the text by using different coding algorithms, for example, the pinyin input method is a method of identifying the text by using a pinyin coding algorithm, and the five-stroke input method is a method of identifying the text by using a five-stroke coding algorithm. Moreover, it should be noted that, for the same text content, different encoding contents may be provided based on different encoding algorithms (pinyin encoding algorithm and five-stroke encoding algorithm), for example, the pinyin of a "word" corresponds to "zi" and the five strokes correspond to "PBF". Therefore, based on different encoding methods, word vector models corresponding to different input methods can be trained respectively.

Specifically, the word vector models corresponding to the different input methods include a pinyin word vector model and a five-stroke word vector model, the pinyin word vector model is obtained by training based on pinyin encoding data, and the five-stroke word vector model is obtained by training based on five-stroke encoding data. Therefore, the word vector models corresponding to different input methods are obtained by training based on different encoding data, so that the word vector models corresponding to different input methods are used for representing the text data according to different dimension tables, and the representation of the text data is more accurate and reliable by representing the text data according to different dimension tables. Specifically, the word vector model includes a pinyin word vector model and a five-stroke word vector model based on different input methods. And for the same text, corresponding pinyin coded data can be obtained by utilizing the pinyin word vector model, and corresponding five-stroke coded data can be obtained by utilizing the five-stroke word vector model.

And step 206, acquiring a second text sample set to be trained and a pre-trained language model.

Specifically, the server obtains a second text sample set to be trained, where the second text sample set to be trained and the first text sample set to be trained may be the same or different sample sets, and are not limited herein.

The language model is a model with language prediction capability, and may be a bert (bidirectional Encoder prediction from transforms) language model. Specifically, there are 2 training tasks of the Bert model, i.e., MLM (masked language model) and NSP (next presence prediction), where the MLM task is to predict text content at a corresponding position, and the NSP task is to determine whether the preceding sentence and the following sentence are continuous.

And 208, respectively extracting the coded data corresponding to the second text sample set to be trained based on the language model, the five-stroke word vector model and the pinyin word vector model.

Because the language model, the five-stroke word vector model and the pinyin word vector model respectively express the information of the same text data in different dimensions, at least three different expression modes of the same text data can be obtained based on different models. The information quantity of the coded data obtained by carrying out different dimensionality expressions on the second training sample set is richer, so that the text processing accuracy of the text processing model obtained when the model training is executed based on the coded data is higher.

And step 210, performing model training according to the coded data to obtain a text processing model.

The text processing model is used for carrying out error correction processing on the text data to be processed and is used for processing the text data to be processed into text data with higher precision. In specific business, model training can be performed according to text data with high precision as training data, and then the precision of model training is improved.

In one embodiment, the method for obtaining five-stroke word vector models and pinyin word vector models corresponding to different input methods by respectively performing model training based on a first text sample set to be trained includes: converting a first text sample set to be trained into corresponding pinyin coding vectors, sequentially traversing the pinyin coding vectors according to a pre-configured sliding window, taking the traversed pinyin coding vectors as current pinyin vectors to be processed, predicting the pinyin coding vectors at preset positions in the current pinyin vectors to be processed based on a current word vector model corresponding to current pinyin model parameters, determining target pinyin model parameters according to the predicted pinyin coding vectors and real pinyin coding vectors, and obtaining a pinyin word vector model according to the determined target pinyin model parameters; converting a first text sample set to be trained into corresponding five-stroke coding vectors, sequentially traversing the five-stroke coding vectors according to a pre-configured sliding window, taking the traversed five-stroke coding vectors as the current five-stroke vectors to be processed, predicting the five-stroke coding vectors at preset positions in the current five-stroke vectors to be processed based on a current word vector model corresponding to the current five-stroke model parameters, determining target five-stroke model parameters according to the predicted five-stroke coding vectors and the real five-stroke coding vectors, and obtaining a five-stroke word vector model according to the determined target five-stroke model parameters.

Specifically, the server obtains a first text sample set to be trained, converts the first text sample set to be trained into a corresponding pinyin coding vector, performs word vector model training according to the pinyin coding vector to obtain a pinyin word vector model, converts the first text sample set to be trained into a corresponding five-stroke coding vector, and performs word vector model training according to the five-stroke coding vector to obtain a five-stroke word vector model.

In a specific embodiment, the server obtains a first training text sample set, converts each text in the first training text sample set into corresponding pinyin data to obtain pinyin coding vectors, and uses the obtained pinyin coding vectors as input data of a training pinyin word vector model to further obtain the trained pinyin word vector model. And the server acquires the first training text sample set, converts each text in the first training text sample set into corresponding five-stroke data to obtain five-stroke coding vectors, and uses the obtained five-stroke coding vectors as input data of a training five-stroke word vector model to further obtain the trained five-stroke word vector model. Specifically, the training mode of the word vector coding model may be a training mode based on a Bert language model, or may be a training mode based on a word vector such as word2vec, which is not limited herein.

Specifically, the method for executing the training of the word vector model based on the training of the word vector such as word2vec includes: converting characters corresponding to the first training text sample set into five-stroke coding vectors, setting a predefined sliding window, and if the size of the sliding window can be set to be 5, then the server sequentially traverses the five-stroke coding vectors corresponding to the text data based on the size corresponding to the sliding window as a unit step length, and takes the currently traversed five-stroke coding vectors as the current five-stroke coding vectors to be processed, and performs a data prediction step in the current five-stroke coding vectors to be processed. Specifically, in each cycle, in the current five-stroke word vector model corresponding to the current five-stroke model parameter, the five-stroke coding vectors of the characters at the middle position are predicted by using the five-stroke coding vectors of the front and rear two characters, the predicted five-stroke coding vectors are compared with the actual five-stroke coding vectors, the current five-stroke word vector model parameter is adjusted according to the comparison result to obtain the target five-stroke word vector model parameter, and finally the target five-stroke word vector model is obtained according to the target five-stroke word vector model parameter.

In the above embodiment, the training sample sets are respectively represented by using different input representation methods, and then the same text data can be expressed in multiple dimensions, so that the model can acquire multi-dimensional information of the same text data, and further more data information is provided for the training model, so as to improve the precision of model training, and the models corresponding to different input methods can be trained based on a word vector mode, and the training mode cost of the word vector is low, so that the efficiency is high, and the efficiency of model training is further improved.

In one embodiment, as shown in fig. 5, a text data obtaining method is provided, which is described by taking the method as an example applied to the server in fig. 1, and includes the following steps:

step 502, text data to be processed is obtained.

The text data to be processed is text data that needs to be error-corrected, that is, the text data to be processed may include erroneous text information. In a specific embodiment, the text data to be processed may be used as a training sample set for model training, so that when the text data to be processed includes error text information, the accuracy of the model training is greatly affected when the text data to be processed including the error information is used for model training, and therefore, the text data to be processed needs to be processed to remove or correct the included error data.

Specifically, the source of the text data to be processed includes data such as chinese wikipedia, historical telemarketing records, news crawled on the internet, and hectic questions and answers, which are not limited herein.

Step 504, inputting the text data to be processed into a pre-trained text processing model, and performing data processing on the text data to be processed according to model parameters in the text processing model to obtain target text data; the text processing model is obtained by training based on word vector coded data and language coded data corresponding to different input methods as input data, the word vector coded data is obtained based on a pre-trained word vector model, and the language coded data is obtained based on the pre-trained language model.

The target text data is obtained by performing data error correction processing on the text data to be processed, that is, the data accuracy of the target text data is high, and the target text data can be used as a training sample set during model training. The input method specifically comprises a pinyin input method and a five-stroke input method, wherein the pinyin input method and the five-stroke input method respectively correspond to the method of identifying the text by using different coding algorithms, for example, the pinyin input method is a method of identifying the text by using a pinyin coding algorithm, and the five-stroke input method is a method of identifying the text by using a five-stroke coding algorithm. Moreover, it should be noted that, for the same text content, different encoding contents may be provided based on different encoding algorithms (pinyin encoding algorithm and five-stroke encoding algorithm), for example, the pinyin of a "word" corresponds to "zi" and the five strokes correspond to "PBF". Therefore, based on different encoding methods, word vector models corresponding to different input methods can be trained respectively.

Specifically, the word vector model includes a pinyin word vector model and a five-stroke word vector model based on different input methods. And for the same text, corresponding pinyin coded data can be obtained by utilizing the pinyin word vector model, and corresponding five-stroke coded data can be obtained by utilizing the five-stroke word vector model.

Based on the independent utilization of the language model for text processing, if only the token embedding feature based on text is used in an error detection module in the language model, it is difficult to solve the problem that different characters with the same pronunciation and similar components of the character pattern are different in specific floor applications, and especially for a speech recognition scene corresponding to an automatic speech recognition technology (ASR), pronunciation is a very important error correction clue. In the application, word vector models (such as pinyin and wubi) corresponding to different input methods are added to cooperate with the language model to process the text data to be processed, so that more text reference information is given to the model, and the processing capacity of the text processing model on the text data to be processed is improved.

In a specific business scenario, chinese error correction is a basic task in natural language processing, which often affects the accuracy of upstream tasks, and in available cheap text data, various chinese errors are often included, and simply pinyin errors and five errors caused by user input methods. In the ASR recognition, some homophones are replaced, for example, the adversity is recognized as mud gold by the ASR, so that although the vocalization is consistent, the text meaning is changed from being overturned, and some noise may be introduced more or less, for example, you get good. The noise text is sent to a deep learning model, the accuracy of the model is greatly influenced, and after all, for Chinese with great depth, the change of possible semantics of a plurality of characters can also cause the change of the natural language. Chinese error correction is often used as an underlying module to provide higher quality text for upstream tasks. Therefore, one of the purposes of the present application is to implement error correction processing on error data in text data to be processed, so as to ensure that target text data with high accuracy is obtained, and to train a model by using the target text data as a training sample set.

The word vector models corresponding to different input methods are creatively introduced, more information which can be referred to is brought to the models, and further the processing capacity of text data to be processed is improved, so that the target data with higher precision can be obtained.

In one embodiment, the extracting the encoded data corresponding to the second text sample set to be trained based on the language model, the five-stroke word vector model, and the pinyin word vector model includes: extracting five-stroke coding data from a second text sample set to be trained based on a pre-trained five-stroke word vector model; extracting pinyin encoding data from a second text sample set to be trained based on a pre-trained pinyin word vector model; acquiring a pre-trained language model, and extracting multi-dimensional language coding data from a second sample set to be trained on the basis of the language model; performing model training according to the encoded data to obtain a text processing model, comprising: and taking the five-stroke coded data, the pinyin coded data and the multidimensional language coded data as input data, and performing model training according to the input data to obtain a text processing model.

Specifically, the text processing model is obtained by training based on a trained word vector model and a language model together. In other words, in the specific training process, firstly, a pinyin word vector model and a five-stroke word vector model are obtained based on training of a training sample set, a pre-trained language model is obtained, and then model training is performed again based on the trained pinyin word vector model, the trained five-stroke word vector model and the trained language model to obtain a final text processing model. In other words, the present application includes at least two layers of model training processes, the first layer is training of the word vector model based on the input method, and the other layer is a text processing model obtained by performing model training again based on the word vector model based on the input method and the language model obtained by the first layer training.

Specifically, the server obtains a second text sample set to be trained, where the second text sample set to be trained and the first text sample set to be trained may be the same or different sample sets, and are not limited herein. And then inputting the second text sample set to be trained into the trained word vector model to respectively obtain pinyin coded data and five-stroke coded data, and inputting the second text sample set to be trained into the trained language model to obtain multi-dimensional language coded data. And then, the obtained pinyin coded data, the five-stroke coded data and the multi-dimensional language coded data are used as input data again to train the model, and further a text processing model is obtained. In the process, in the process of training the text processing model, the source of the input data comprises a plurality of models, and specifically comprises word vector models corresponding to different input methods and language models with higher pre-training precision, so that the source of the data in the process of training the text processing model is more accurate, the information is richer, and the training precision of the models is higher.

In a specific embodiment, the multidimensional language coded data includes one or more of word vector coded data (token embedding), classification coded data (type embedding), and position coded data (position embedding). Specifically, there are three inputs of the Bert embedding layer in the language model, which are recorded as multidimensional language coded data, and the multidimensional language coded data can express text information from multiple aspects. Specifically, the multidimensional language coded data respectively correspond to token-embedding, segment-embedding and position-embedding. Specifically, Token-embedding is a vector representation for converting words into fixed dimensions, where each word is represented as a vector of 768 dimensions in Bert-base. Segment-embedding is used for directly splicing and inputting two sections of texts into a model when Bert solves a double-sentence classification task (for example, judging whether the two sections of texts are similar in semantics), so that the model can distinguish the two sections of texts, and the answer is that Segment-embedding is passed. For two sentences, the segment-embedding part of the first sentence is all 0, and the segment-embedding part of the second sentence is all 1. BERT uses a transformer encoder to learn the representation of a sentence through self-annotation mechanism, and self-annotation does not pay attention to the position information of token, so in order to enable the transformer to learn the position information of token, position-embedding is added during input.

And in the process of training the text processing model, information enhancement is carried out by introducing five strokes and pinyin information. And in order to further improve the efficiency of model training, word vector models such as five-stroke word vector models and five-stroke embedding and pinyin embedding in the pinyin word vector model use word2 vec. By using word2vec, the data volume can be reduced, and the efficiency of model training is improved.

In one embodiment, taking five-stroke coded data, pinyin coded data and multidimensional language coded data as input data, and performing model training according to the input data to obtain a text processing model, including: splicing the five-stroke coded data, the pinyin coded data and the multi-dimensional language coded data to obtain spliced coded data; predicting the spliced coded data based on the language model to obtain the corresponding prediction probability of each position; determining an initial prediction text at a corresponding position according to the prediction probability; and adjusting initial model parameters of the initial text processing model based on the difference between the initial prediction text and the real label text to obtain target model parameters, and determining the text processing model according to the target model parameters.

Specifically, the server performs splicing processing on the acquired five-stroke coded data, the pinyin coded data and the multidimensional language data to obtain spliced coded data, and inputs the spliced coded data into the prediction module to obtain the prediction probability corresponding to each position. And extracting data with the prediction probability larger than a preset threshold value as initial prediction data, for example, data with the prediction probability ranked at the top5 can be used as the initial prediction data.

In one embodiment, determining the initial predicted text at the corresponding position according to the magnitude of the prediction probability comprises: acquiring a prediction text with a prediction probability value larger than a preset value; and extracting an initial prediction text from the prediction text based on a homophone principle and a pinyin principle, and storing the initial prediction text into the block link point.

Referring to FIG. 3, FIG. 3 is a block diagram of a text processing model provided in one embodiment. Specifically, in an error correction module of a text processing model, token embedding is not added to the last layer of the error correction module for classified output, but the token embedding is directly output through the error correction module and output is constrained by using pinyin characteristics. Specifically, the method fully utilizes the characteristics of language model training to detect the error of the text, the current language model basically gives words on the left and words on the right and then predicts the current position, and the given central word also predicts the words on the left and the right. For example, if the correct pinyin of a panda in China is "zhong guo xiong mao" and the incorrect pinyin is "zhong guo xun mao", then if the quality of the training pinyin word vector model is high, the probability that the "macro" pinyin is preceded by "xiong" is higher than "xun", and similarly, the probability that the "guo" pinyin is followed by "xiong" is higher than "xun", so that it can be seen that the probability that the "xiong" is far higher than "xun" when the "xun" pinyin word is predicted in front of "mao" in the error detection module, which plays a role in detecting errors. Therefore, in some embodiments, the reason for freezing the pinyin word vector model in the model training process can be used, so that the correct pinyin word vector is not affected by lower quality data through the freezing process.

With continuing reference to FIG. 3, in particular, the error correction portion uses a Bert model that performs softmax outputs for each word, indicating that the word needs to be corrected if the output differs from the input. For example, for smoking the wrongly-typed word, softmax of the bert model outputs 5 highest scores of bear, hit, big, good and gan. At this time, the user hopes to further perform screening according to the pinyin, the pinyin at the position is predicted to be xiong according to the output result of the previous pinyin embedding and dense, the bert result is filtered based on the xiong, other pinyins are removed, and finally only a bear is left, and the other positions are the same.

That is, the result of pinyin prediction is used for screening the prediction result in the error correction module, and besides the Top5 of the prediction result is screened for homophones, the result of pinyin prediction is also added for screening. For example, for fumigating a Chinese cat, for fumigating wrongly-written characters, supposing that Top5 of a prediction result is bear, hit, large, good and gan, if the Chinese cat is only subjected to homophone word screening, the bear with the highest probability is filtered out through the screening of xun, so that error correction fails, but if the result of the pinyin prediction of the Chinese cat is xiong, the bear, hit and gan can be obtained through the filtering of xiong and xun, and then the bear with the highest probability is taken out, so that error correction succeeds.

It is emphasized that the initial predicted text may also be stored in a node of a block chain in order to further ensure the privacy and security of the initial predicted text.

In the above embodiment, dynamic screening can be implemented by using pinyin Embedding and bidirectional GRU + Dense word list screening, rather than just fixed homophonic screening. Specifically, the result of a pinyin model is used for screening the later Bert error correction result instead of the original input pinyin, so that the error correction accuracy is improved, and text data with higher accuracy is obtained.

In one embodiment, the model parameters of the text processing model include pinyin model parameters and wubi model parameters; adjusting initial model parameters of an initial text processing model based on the difference between the initial predicted text and the real label text to obtain target model parameters, and determining a text processing model according to the target model parameters, wherein the method comprises the following steps: adjusting the initial five-stroke parameters of the initial text processing model based on the difference between the initial predicted text and the real label text to obtain target five-stroke model parameters; and determining a text processing model according to the pinyin model parameters and the target five-stroke model parameters.

Specifically, in the process of training the text processing model, the pinyin embedding is fixed and unchangeable, and the five-stroke embedding is fixed and changeable. Wherein, variable means that the parameter is variable, namely, five embedding participate in the parameter updating of back propagation in the training process, and pinyin embedding is fixed. I.e. not updated during the training process.

Although the Bert language model is strong, the cost of making Pinyin Bert is high, the quality of a pre-training text cannot be guaranteed, only information enhancement can be performed even if Pinyin Bert is performed, and the Pinyin Bert can not be used for Pinyin error detection, so that the model selects training data quality of Pinyin Embedding, selects a light and small word2vector language model to give up the Bert, and meanwhile, regarding the word2vec, the error detection capability of the model cannot be much worse than that of the Bert due to the correlation between the pre-training process and downstream error detection. The vectors of the five strokes are the same as the pinyin vectors and are obtained by training through a Word2Vector method.

Specifically, the training method of the word2vector comprises the following steps: all words are converted into five-stroke codes, and a sliding window is set to be 5, namely, the codes of the middle word are predicted by using the codes of the front and back 2 words each time.

In the embodiment, the five strokes of Embedding and the pinyin Embedding of the high-quality text are introduced into the error detection module for information enhancement, so that the capability of the original Soft-mask error detection network can be obviously improved. And by freezing the pinyin Embedding, the Embedding can be ensured not to introduce the pinyin information of the current wrong text, thereby realizing the error correction capability. Secondly, homophonic screening is carried out on Top5 in an error correction module, so that the output of a text can be effectively controlled, correct pinyin predicted by a pinyin Embedding layer, a bidirectional GRU layer and a Dense layer can be used for dynamically screening the result, and the probability of filtering out correct words by homophonic screening can be reduced.

The scheme in the application highlights the importance of the speech features and makes up the defect that the ASR speech recognition error correction scene is only performed by using the speech model.

It should be understood that, although the steps in the flowchart of fig. 2 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 2 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.

In one embodiment, as shown in fig. 4, there is provided a text processing model training apparatus including:

a first training sample set obtaining module 402, configured to obtain a first text sample set to be trained.

And the word vector training module 404 is configured to perform model training respectively based on the first to-be-trained text sample set to obtain five word vector models and pinyin word vector models corresponding to different input methods.

And a second training sample set obtaining module 406, configured to obtain a second text sample set to be trained and a pre-trained language model.

And the encoded data extraction module 408 is configured to extract encoded data corresponding to the second text sample set to be trained based on the language model, the five-stroke word vector model, and the pinyin word vector model.

And the model training module 410 is used for executing model training according to the encoded data to obtain a text processing model.

In one embodiment, the encoded data extraction module 408 is further configured to convert the first to-be-trained text sample set into corresponding pinyin coding vectors, sequentially traverse the pinyin coding vectors according to a pre-configured sliding window, use the traversed pinyin coding vectors as current to-be-processed pinyin vectors, predict pinyin coding vectors at preset positions in the current to-be-processed pinyin vectors based on a current word vector model corresponding to current pinyin model parameters, determine target pinyin model parameters according to the predicted pinyin coding vectors and the real pinyin coding vectors, and obtain a pinyin word vector model according to the determined target pinyin model parameters; converting a first text sample set to be trained into corresponding five-stroke coding vectors, sequentially traversing the five-stroke coding vectors according to a pre-configured sliding window, taking the traversed five-stroke coding vectors as the current five-stroke vectors to be processed, predicting the five-stroke coding vectors at preset positions in the current five-stroke vectors to be processed based on a current word vector model corresponding to the current five-stroke model parameters, determining target five-stroke model parameters according to the predicted five-stroke coding vectors and the real five-stroke coding vectors, and obtaining a five-stroke word vector model according to the determined target five-stroke model parameters.

In one embodiment, the encoding data extraction module 408 is further configured to extract five-stroke encoding data from the second set of text samples to be trained based on a pre-trained five-stroke word vector model; extracting pinyin encoding data from a second text sample set to be trained based on a pre-trained pinyin word vector model; acquiring a pre-trained language model, and extracting multi-dimensional language coding data from a second sample set to be trained on the basis of the language model; the model training module 410 is further configured to use the five-stroke coded data, the pinyin coded data, and the multidimensional language coded data as input data, and perform model training according to the input data to obtain a text processing model.

In one embodiment, the model training module 410 is further configured to perform splicing processing on the five-stroke coded data, the pinyin coded data, and the multi-dimensional language coded data to obtain spliced coded data; predicting the spliced coded data based on the language model to obtain the corresponding prediction probability of each position; determining an initial prediction text at a corresponding position according to the prediction probability; and adjusting initial model parameters of the initial text processing model based on the difference between the initial prediction text and the real label text to obtain target model parameters, and determining the text processing model according to the target model parameters.

In one embodiment, the model training module 410 is further configured to obtain a predicted text with a predicted probability value greater than a preset value; and extracting an initial prediction text from the prediction text based on a homophone principle and a pinyin principle, and storing the initial prediction text into the block link point.

In one embodiment, the model training module 410 is further configured to adjust the initial five-stroke parameters of the initial text processing model based on the difference between the initial predicted text and the real label text to obtain target five-stroke model parameters; and determining a text processing model according to the pinyin model parameters and the target five-stroke model parameters.

In one embodiment, as shown in fig. 6, there is provided a text data acquisition apparatus including:

an obtaining module 602, configured to obtain text data to be processed.

The processing module 604 is configured to input the text data to be processed into a pre-trained text processing model, and perform data processing on the text data to be processed according to model parameters in the text processing model to obtain target text data; the text processing model is obtained by training based on word vector coded data and language coded data corresponding to different input methods as input data, the word vector coded data is obtained based on a pre-trained word vector model, and the language coded data is obtained based on the pre-trained language model.

For the specific limitations of the text data acquisition device and the text processing model training device, reference may be made to the above limitations of the text data acquisition method and the text processing model training method, which are not described herein again. All or part of each module in the text data acquisition device and the text processing model training device can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 7. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing text data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a text data acquisition method and a text processing model training method.

Those skilled in the art will appreciate that the architecture shown in fig. 7 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program: acquiring text data to be processed; inputting the text data to be processed into a pre-trained text processing model, and performing data processing on the text data to be processed according to model parameters in the text processing model to obtain target text data; the text processing model is obtained by training based on word vector coded data and language coded data corresponding to different input methods as input data, the word vector coded data is obtained based on a pre-trained word vector model, and the language coded data is obtained based on the pre-trained language model.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program: acquiring a first text sample set to be trained; respectively executing model training based on the first text sample set to be trained to obtain five-stroke word vector models and pinyin word vector models corresponding to different input methods; acquiring a second text sample set to be trained and a pre-trained language model; respectively extracting coded data corresponding to a second text sample set to be trained based on the language model, the five-stroke word vector model and the pinyin word vector model; and performing model training according to the encoded data to obtain a text processing model.

In one embodiment, the processor, when executing the computer program, further performs the steps of: converting a first text sample set to be trained into corresponding pinyin coding vectors, sequentially traversing the pinyin coding vectors according to a pre-configured sliding window, taking the traversed pinyin coding vectors as current pinyin vectors to be processed, predicting the pinyin coding vectors at preset positions in the current pinyin vectors to be processed based on a current word vector model corresponding to current pinyin model parameters, determining target pinyin model parameters according to the predicted pinyin coding vectors and real pinyin coding vectors, and obtaining a pinyin word vector model according to the determined target pinyin model parameters; converting a first text sample set to be trained into corresponding five-stroke coding vectors, sequentially traversing the five-stroke coding vectors according to a pre-configured sliding window, taking the traversed five-stroke coding vectors as the current five-stroke vectors to be processed, predicting the five-stroke coding vectors at preset positions in the current five-stroke vectors to be processed based on a current word vector model corresponding to the current five-stroke model parameters, determining target five-stroke model parameters according to the predicted five-stroke coding vectors and the real five-stroke coding vectors, and obtaining a five-stroke word vector model according to the determined target five-stroke model parameters.

In one embodiment, the processor, when executing the computer program, further performs the steps of: extracting five-stroke coding data from a second text sample set to be trained based on a pre-trained five-stroke word vector model; extracting pinyin encoding data from a second text sample set to be trained based on a pre-trained pinyin word vector model; acquiring a pre-trained language model, and extracting multi-dimensional language coding data from a second sample set to be trained on the basis of the language model; and taking the five-stroke coded data, the pinyin coded data and the multidimensional language coded data as input data, and performing model training according to the input data to obtain a text processing model.

In one embodiment, the processor, when executing the computer program, further performs the steps of: splicing the five-stroke coded data, the pinyin coded data and the multi-dimensional language coded data to obtain spliced coded data; predicting the spliced coded data based on the language model to obtain the corresponding prediction probability of each position; determining an initial prediction text at a corresponding position according to the prediction probability; and adjusting initial model parameters of the initial text processing model based on the difference between the initial prediction text and the real label text to obtain target model parameters, and determining the text processing model according to the target model parameters.

In one embodiment, the processor, when executing the computer program, further performs the steps of: acquiring a prediction text with a prediction probability value larger than a preset value; and extracting an initial prediction text from the prediction text based on a homophone principle and a pinyin principle, and storing the initial prediction text into the block link point.

In one embodiment, the processor, when executing the computer program, further performs the steps of: adjusting the initial five-stroke parameters of the initial text processing model based on the difference between the initial predicted text and the real label text to obtain target five-stroke model parameters; and determining a text processing model according to the pinyin model parameters and the target five-stroke model parameters.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of: acquiring text data to be processed; inputting the text data to be processed into a pre-trained text processing model, and performing data processing on the text data to be processed according to model parameters in the text processing model to obtain target text data; the text processing model is obtained by training based on word vector coded data and language coded data corresponding to different input methods as input data, the word vector coded data is obtained based on a pre-trained word vector model, and the language coded data is obtained based on the pre-trained language model.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of: acquiring a first text sample set to be trained; respectively executing model training based on the first text sample set to be trained to obtain five-stroke word vector models and pinyin word vector models corresponding to different input methods; acquiring a second text sample set to be trained and a pre-trained language model; respectively extracting coded data corresponding to a second text sample set to be trained based on the language model, the five-stroke word vector model and the pinyin word vector model; and performing model training according to the encoded data to obtain a text processing model.

In one embodiment, the computer program when executed by the processor further performs the steps of: converting a first text sample set to be trained into corresponding pinyin coding vectors, sequentially traversing the pinyin coding vectors according to a pre-configured sliding window, taking the traversed pinyin coding vectors as current pinyin vectors to be processed, predicting the pinyin coding vectors at preset positions in the current pinyin vectors to be processed based on a current word vector model corresponding to current pinyin model parameters, determining target pinyin model parameters according to the predicted pinyin coding vectors and real pinyin coding vectors, and obtaining a pinyin word vector model according to the determined target pinyin model parameters; converting a first text sample set to be trained into corresponding five-stroke coding vectors, sequentially traversing the five-stroke coding vectors according to a pre-configured sliding window, taking the traversed five-stroke coding vectors as the current five-stroke vectors to be processed, predicting the five-stroke coding vectors at preset positions in the current five-stroke vectors to be processed based on a current word vector model corresponding to the current five-stroke model parameters, determining target five-stroke model parameters according to the predicted five-stroke coding vectors and the real five-stroke coding vectors, and obtaining a five-stroke word vector model according to the determined target five-stroke model parameters.

In one embodiment, the computer program when executed by the processor further performs the steps of: extracting five-stroke coding data from a second text sample set to be trained based on a pre-trained five-stroke word vector model; extracting pinyin encoding data from a second text sample set to be trained based on a pre-trained pinyin word vector model; acquiring a pre-trained language model, and extracting multi-dimensional language coding data from a second sample set to be trained on the basis of the language model; and taking the five-stroke coded data, the pinyin coded data and the multidimensional language coded data as input data, and performing model training according to the input data to obtain a text processing model.

In one embodiment, the computer program when executed by the processor further performs the steps of: splicing the five-stroke coded data, the pinyin coded data and the multi-dimensional language coded data to obtain spliced coded data; predicting the spliced coded data based on the language model to obtain the corresponding prediction probability of each position; determining an initial prediction text at a corresponding position according to the prediction probability; and adjusting initial model parameters of the initial text processing model based on the difference between the initial prediction text and the real label text to obtain target model parameters, and determining the text processing model according to the target model parameters.

In one embodiment, the computer program when executed by the processor further performs the steps of: acquiring a prediction text with a prediction probability value larger than a preset value; and extracting an initial prediction text from the prediction text based on a homophone principle and a pinyin principle, and storing the initial prediction text into the block link point.

In one embodiment, the computer program when executed by the processor further performs the steps of: adjusting the initial five-stroke parameters of the initial text processing model based on the difference between the initial predicted text and the real label text to obtain target five-stroke model parameters; and determining a text processing model according to the pinyin model parameters and the target five-stroke model parameters.

The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware related to instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for training a text processing model, the method comprising:

acquiring a first text sample set to be trained;

respectively extracting coded data corresponding to the second text sample set to be trained on the basis of the language model, the five-stroke word vector model and the pinyin word vector model;

and performing model training according to the coded data to obtain a text processing model.

2. The method of claim 1, wherein the performing model training based on the first to-be-trained text sample set to obtain five word vector models and pinyin word vector models corresponding to different input methods respectively comprises:

converting the first text sample set to be trained into corresponding pinyin coding vectors, sequentially traversing the pinyin coding vectors according to a pre-configured sliding window, taking the traversed pinyin coding vectors as current pinyin vectors to be processed, predicting the pinyin coding vectors at preset positions in the current pinyin vectors to be processed based on a current word vector model corresponding to current pinyin model parameters, determining target pinyin model parameters according to the predicted pinyin coding vectors and the real pinyin coding vectors, and obtaining a pinyin word vector model according to the determined target pinyin model parameters;

converting the first text sample set to be trained into corresponding five-stroke coding vectors, sequentially traversing the five-stroke coding vectors according to a pre-configured sliding window, taking the traversed five-stroke coding vectors as the current five-stroke vectors to be processed, predicting the five-stroke coding vectors at preset positions in the current five-stroke vectors to be processed based on a current word vector model corresponding to the current five-stroke model parameters, determining target five-stroke model parameters according to the predicted five-stroke coding vectors and the real five-stroke coding vectors, and obtaining a five-stroke word vector model according to the determined target five-stroke model parameters.

3. The method according to claim 1, wherein the extracting, based on the language model, the five-stroke word vector model, and the pinyin word vector model, the encoded data corresponding to the second text sample set to be trained includes:

extracting five-stroke coding data from the second text sample set to be trained based on the pre-trained five-stroke word vector model;

extracting pinyin coded data from the second text sample set to be trained based on the pre-trained pinyin word vector model;

acquiring a pre-trained language model, and extracting multi-dimensional language coding data from the second sample set to be trained on the basis of the language model;

the performing model training according to the encoded data to obtain a text processing model includes:

and taking the five-stroke coded data, the pinyin coded data and the multi-dimensional language coded data as input data, and performing model training according to the input data to obtain a text processing model.

4. The method as claimed in claim 3, wherein the using the five-stroke coded data, the pinyin coded data and the multidimensional language coded data as input data, and performing model training according to the input data to obtain a text processing model comprises:

splicing the five-stroke coded data, the pinyin coded data and the multidimensional language coded data to obtain spliced coded data;

predicting the spliced coded data based on the language model to obtain a corresponding prediction probability at each position;

and adjusting initial model parameters of an initial text processing model based on the difference between the initial prediction text and the real label text to obtain target model parameters, and determining the text processing model according to the target model parameters.

5. The method of claim 4, wherein determining the initial predicted text at the corresponding position according to the magnitude of the prediction probability comprises:

6. The method of claim 4, wherein the model parameters of the text processing model include pinyin model parameters and wubi model parameters; the adjusting initial model parameters of an initial text processing model based on the difference between the initial predicted text and the real label text to obtain target model parameters, and determining a text processing model according to the target model parameters includes:

7. A text data acquisition method, characterized in that the method comprises:

acquiring text data to be processed;

8. A text processing model training apparatus, the apparatus comprising:

the word vector training module is used for respectively executing model training based on the first text sample set to be trained to obtain five word vector models and pinyin word vector models corresponding to different input methods;

the coded data extraction module is used for respectively extracting coded data corresponding to the second text sample set to be trained on the basis of the language model, the five-stroke word vector model and the pinyin word vector model;

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.