WO2022121251A1 - Procédé et appareil d'entraînement de modèle de traitement de texte, dispositif informatique et support de stockage - Google Patents
Procédé et appareil d'entraînement de modèle de traitement de texte, dispositif informatique et support de stockage Download PDFInfo
- Publication number
- WO2022121251A1 WO2022121251A1 PCT/CN2021/096582 CN2021096582W WO2022121251A1 WO 2022121251 A1 WO2022121251 A1 WO 2022121251A1 CN 2021096582 W CN2021096582 W CN 2021096582W WO 2022121251 A1 WO2022121251 A1 WO 2022121251A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- model
- text
- wubi
- data
- pinyin
- Prior art date
Links
- 238000012549 training Methods 0.000 title claims abstract description 178
- 238000012545 processing Methods 0.000 title claims abstract description 162
- 238000000034 method Methods 0.000 title claims abstract description 122
- 239000013598 vector Substances 0.000 claims abstract description 302
- 230000008569 process Effects 0.000 claims description 23
- 238000013075 data extraction Methods 0.000 claims description 4
- 238000000605 extraction Methods 0.000 claims 1
- 238000013473 artificial intelligence Methods 0.000 abstract 1
- 238000012937 correction Methods 0.000 description 22
- 238000001514 detection method Methods 0.000 description 10
- 238000010586 diagram Methods 0.000 description 8
- 238000012216 screening Methods 0.000 description 8
- 238000011144 upstream manufacturing Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 230000008014 freezing Effects 0.000 description 3
- 238000007710 freezing Methods 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 230000002457 bidirectional effect Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 208000025174 PANDAS Diseases 0.000 description 1
- 208000021155 Paediatric autoimmune neuropsychiatric disorders associated with streptococcal infection Diseases 0.000 description 1
- 240000000220 Panda oleosa Species 0.000 description 1
- 235000016496 Panda oleosa Nutrition 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- PCHJSUWPFVWCPO-UHFFFAOYSA-N gold Chemical compound [Au] PCHJSUWPFVWCPO-UHFFFAOYSA-N 0.000 description 1
- 239000010931 gold Substances 0.000 description 1
- 229910052737 gold Inorganic materials 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/274—Converting codes to words; Guess-ahead of partial word inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/02—Input arrangements using manually operated switches, e.g. using keyboards or dials
- G06F3/023—Arrangements for converting discrete items of information into a coded form, e.g. arrangements for interpreting keyboard generated codes as alphanumeric codes, operand codes or instruction codes
- G06F3/0233—Character input methods
- G06F3/0237—Character input methods using prediction or retrieval techniques
Definitions
- the present application relates to a text processing model training method, apparatus, computer equipment and storage medium.
- Chinese error correction is a basic task in natural language processing, which often affects the accuracy of upstream tasks.
- various Chinese errors are often included, but for the extensive and profound Chinese , changing a few words may change the semantics dramatically, so Chinese error correction is often used as the underlying module to provide higher-quality texts for upstream tasks.
- Bert in the traditional technology is the current mainstream pre-training language model, and its MLM pre-training task will introduce 15%-10% noise due to its mask mechanism, so Bert has a certain error detection ability, but due to Only 15%-10% of the noise is introduced, so Bert is often weak in text error detection, making it difficult to obtain high-quality text data.
- a text processing model training method is provided.
- a text processing model training method comprising:
- a text data acquisition device comprising:
- a first training sample set obtaining module used for obtaining the first text sample set to be trained
- a word vector training module for performing model training based on the first text sample set to be trained to obtain Wubi word vector models and pinyin word vector models corresponding to different input methods
- the second training sample set acquisition module is used to obtain the second to-be-trained text sample set and the pre-trained language model
- an encoding data extracting module for extracting encoding data corresponding to the second text sample set to be trained based on the language model, the Wubi word vector model and the pinyin word vector model;
- the model training module is used to perform model training according to the encoded data to obtain a text processing model.
- a method for acquiring text data comprising:
- the language encoding data is obtained by training as input data, and the word vector encoding data is obtained based on the pre-trained word vector model, and the language encoding data is obtained based on the pre-trained language model.
- a text data acquisition device includes:
- an acquisition module for acquiring the text data to be processed
- the processing module is used to input the text data to be processed into the pre-trained text processing model, so as to perform data processing on the text data to be processed according to the model parameters in the text processing model to obtain the target text data; the text processing model is based on the correspondence of different input methods.
- the word vector encoding data and language encoding data are obtained by training as input data, and the word vector encoding data is obtained based on the pre-trained word vector model, and the language encoding data is obtained based on the pre-training language model.
- a computer device comprising a memory and one or more processors, the memory having computer-readable instructions stored therein, the computer-readable instructions, when executed by the processor, cause the one or more processors to execute The following steps:
- Computer-readable instructions One or more computer-readable storage media storing computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the following steps:
- the above-mentioned method for acquiring text data is to acquire text data to be processed; input the text data to be processed into a pre-trained text processing model, and perform data processing on the text data to be processed according to model parameters in the text processing model to obtain target text data; the text processing model It is obtained by training based on the word vector coding data and language coding data corresponding to different input methods as input data, and the word vector coding data is obtained based on the pre-trained word vector model, and the language coding data is obtained based on the pre-trained language model.
- FIG. 1 is an application environment diagram of a text processing model training method according to one or more embodiments.
- FIG. 2 is a schematic flowchart of a text processing model training method according to one or more embodiments.
- FIG. 3 is a structural diagram of a text processing model provided in accordance with one or more embodiments.
- FIG. 4 is a structural block diagram of an apparatus for training a text processing model according to one or more embodiments.
- FIG. 5 is a schematic flowchart of a method for acquiring text data according to one or more embodiments.
- FIG. 6 is a structural block diagram of an apparatus for acquiring text data according to one or more embodiments.
- FIG. 7 is a block diagram of a computer device in accordance with one or more embodiments.
- the text data acquisition method provided in this application can be applied to the application environment shown in FIG. 1 .
- the terminal 102 communicates with the server 104 through the network.
- the server 104 obtains the first text sample set to be trained uploaded by the terminal 102; performs model training based on the first text sample set to be trained to obtain Wubi word vector models and pinyin word vector models corresponding to different input methods; obtains the second text sample to be trained based on the language model, the Wubi word vector model and the pinyin word vector model, respectively, extract the encoded data corresponding to the second text sample set to be trained; perform model training according to the encoded data to obtain a text processing model.
- the terminal 102 can be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers and portable wearable devices, and the server 104 can be implemented by an independent server or a server cluster composed of multiple servers.
- a text processing model training method is provided, and the method is applied to the server in FIG. 1 as an example for description, including the following steps:
- Step 202 Obtain a first text sample set to be trained.
- the first training text sample set includes multiple text data, and specifically may include multiple text sentences.
- the text data in the first training text sample set may include text data requiring error correction processing, that is, the first training text sample set may include erroneous text information.
- the sources of the first training text sample set include Chinese Wikipedia, historical telemarketing records, news crawled on the Internet, Baidu Q&A and other data, which are not limited here.
- Step 204 respectively performing model training based on the first text sample set to be trained to obtain Wubi word vector models and pinyin word vector models corresponding to different input methods.
- the input method specifically includes the Pinyin input method and the Wubi input method, which correspond to the use of different coding algorithms to identify the text.
- the algorithm identifies the text.
- pinyin coding algorithm and Wubi coding algorithm can have different coding content, such as the pinyin of "word” corresponds to "zi” Wubi corresponds to "PBF”. Therefore, based on different encoding methods, word vector models corresponding to different input methods can be trained separately.
- the word vector models corresponding to different input methods include the pinyin word vector model and the Wubi word vector model, and the pinyin word vector model is obtained by training based on the pinyin coding data, and the Wubi word vector model is obtained by training based on the Wubi coding data. Therefore, since the word vector models corresponding to different input methods are obtained by training based on different encoded data, the word vector models corresponding to different input methods represent text data with different dimensions, and represent text data through different dimensions so that the text data can be The characterization is more accurate and reliable.
- the word vector model includes a pinyin word vector model and a Wubi word vector model. And for the same text, the pinyin word vector model can be used to obtain the corresponding pinyin encoded data, and the Wubi word vector model can be used to obtain the corresponding Wubi encoded data.
- Step 206 Obtain a second text sample set to be trained and a pre-trained language model.
- the server obtains the second text sample set to be trained, where the second text sample set to be trained and the first text sample set to be trained may be the same or different sample sets, which are not limited herein.
- the language model is a model with language prediction ability, and specifically, it may be a Bert (Bidirectional Encoder Representation from Transformers) language model.
- Bert Bidirectional Encoder Representation from Transformers
- MLM mask language model
- NSP next sentence prediction
- the MLM task is to predict the text content at the corresponding position
- the NSP task is to judge whether the two sentences before and after are not continuous. of.
- Step 208 based on the language model, the Wubi word vector model and the pinyin word vector model, respectively extract the encoded data corresponding to the second text sample set to be trained.
- the language model, the Wubi word vector model and the pinyin word vector model express information on the same text data in different dimensions, at least three different ways of expressing the same text data can be obtained based on different models.
- the information content of the encoded data obtained by expressing the second training sample set in different dimensions is more abundant, so the text processing model obtained when the model training is performed based on the encoded data has higher text processing accuracy.
- Step 210 Perform model training according to the encoded data to obtain a text processing model.
- the text processing model is a model for performing error correction processing on the text data to be processed, and is used for processing the text data to be processed into text data with higher precision.
- model training can be performed based on text data with higher accuracy as training data, thereby improving the accuracy of model training.
- the above-mentioned text processing model training method, device, computer equipment and storage medium obtain a first text sample set to be trained; respectively perform model training based on the first text sample set to be trained to obtain Wubi word vector models and pinyin word vectors corresponding to different input methods model; obtain the second text sample set to be trained and the pre-trained language model; extract the encoded data corresponding to the second text sample set to be trained based on the language model, the Wubi word vector model and the pinyin word vector model respectively; perform model training according to the encoded data Get the text processing model.
- the text processing model Based on the training sample set, first train to obtain word vector models corresponding to different input methods, and then perform model training again based on the trained word vector model and language model to obtain a text processing model, ensuring that more dimensions can be integrated in the process of training the text processing model.
- the text information obtained by the text processing model has higher accuracy and higher prediction accuracy.
- the text processing model obtained by training can be used to process the input text data to be processed, so that more text information is taken into account in the process of text processing, thereby improving the processing ability of text data, thereby making it possible to obtain high-quality data. quality text data becomes possible.
- performing model training based on the first set of text samples to be trained to obtain Wubi word vector models and pinyin word vector models corresponding to different input methods includes: converting the first set of text samples to be trained into corresponding pinyin Encoding vector, traverse the pinyin encoding vector in turn according to the pre-configured sliding window, take the traversed pinyin encoding vector as the current pinyin vector to be processed, and predict the prediction in the current pinyin vector based on the current word vector model corresponding to the current pinyin model parameters.
- the Wubi encoding vector at the preset position is predicted in the middle, and the target Wubi model parameters are determined according to the predicted Wubi encoding vector and the real Wubi encoding vector, and the Wubi word vector model is obtained according to the determined target Wubi model parameters.
- the server obtains the first text sample set to be trained, converts the first text sample set to be trained into a corresponding pinyin coding vector, performs word vector model training according to the pinyin coding vector to obtain a pinyin word vector model, and converts the first text to be trained
- the sample set is converted into the corresponding Wubi encoding vector, and the word vector model is trained according to the Wubi encoding vector to obtain the Wubi word vector model.
- the server obtains the first training text sample set, converts each text in the first training text sample set into corresponding pinyin data to obtain a pinyin coding vector, and uses the obtained pinyin coding vector as the training pinyin
- the input data of the word vector model, and then the trained pinyin word vector model is obtained.
- the server obtains the first training text sample set, converts each text in the first training text sample set into corresponding Wubi data to obtain Wubi encoding vector, and uses the obtained Wubi encoding vector as the input data for training the Wubi word vector model, Then, the trained Wubi word vector model is obtained.
- the training method of the word vector encoding model may be a training method based on a Bert language model, or a training method based on a word vector such as word2vec, which is not limited here.
- performing the training method of the word vector model based on the training method of word vectors such as word2vec includes: converting the text corresponding to the first training text sample set into a Wubi encoding vector, and setting a predefined sliding window, such as a sliding window can be set The size of the window is 5, and then the server traverses the Wubi encoding vector corresponding to the text data in turn based on the size corresponding to the sliding window as the unit step, and uses the currently traversed Wubi encoding vector as the currently pending Wubi encoding vector, and in the current pending Wubi encoding vector
- the data prediction step is performed in processing Wubi encoded vectors.
- the Wubi encoding vector of the two characters before and after is used to predict the Wubi encoding vector of the text at the middle position, and the prediction is obtained.
- Compare the Wubi encoding vector with the actual Wubi encoding vector to adjust the current Wubi word vector model parameters according to the comparison result to obtain the target Wubi word vector model parameters, and finally obtain the target Wubi word vector according to the target Wubi word vector model parameters.
- model in the same way, the pinyin word vector model can be obtained.
- the same text data can be expressed in multiple dimensions, so that the model can obtain multi-dimensional information of the same text data, which is then used for training the model.
- the training method of word vectors is cost-effective and efficient, which further improves model training. s efficiency.
- a method for acquiring text data is provided, which is described by taking the method applied to the server in FIG. 1 as an example, including the following steps:
- Step 502 acquiring text data to be processed.
- the text data to be processed is text data that needs to be processed for error correction, that is, the text data to be processed may include erroneous text information.
- the to-be-processed text data can be used as a training sample set for model training. Therefore, when the to-be-processed text data includes erroneous text information, when the model training is performed using the to-be-processed text data including the error information, the The accuracy of training has a great impact, therefore, it is necessary to perform data processing on the text data to be processed, so as to remove or correct the included erroneous data.
- the sources of the text data to be processed include Chinese Wikipedia, historical telephone sales records, news crawled on the Internet, Baidu Q&A and other data, which are not limited here.
- Step 504 Input the text data to be processed into a pre-trained text processing model, and perform data processing on the text data to be processed according to the model parameters in the text processing model to obtain target text data; the text processing model is based on word vectors corresponding to different input methods.
- the encoded data and the language encoded data are obtained by training as input data, the word vector encoded data is obtained based on the pre-trained word vector model, and the language encoded data is obtained based on the pre-trained language model.
- the text processing model is a model for performing error correction processing on the text data to be processed, and is used for processing the text data to be processed into text data with higher precision.
- model training can be performed based on text data with higher accuracy as training data, thereby improving the accuracy of model training.
- the target text data is the data obtained after performing data error correction on the text data to be processed, that is to say, the target text data has high data accuracy and can be used as a training sample set during model training.
- the input method specifically includes the Pinyin input method and the Wubi input method, which correspond to the use of different coding algorithms to identify the text.
- the algorithm identifies the text.
- pinyin coding algorithm and Wubi coding algorithm can have different coding content, such as the pinyin of "word” corresponds to "zi” Wubi corresponds to "PBF”. Therefore, based on different encoding methods, word vector models corresponding to different input methods can be trained separately.
- the word vector model includes a pinyin word vector model and a Wubi word vector model.
- the pinyin word vector model can be used to obtain the corresponding pinyin encoded data
- the Wubi word vector model can be used to obtain the corresponding Wubi encoded data.
- the language model is a model with language prediction ability, and specifically, it may be a Bert (Bidirectional Encoder Representation from Transformers) language model.
- Bert Bidirectional Encoder Representation from Transformers
- MLM mask language model
- NSP next sentence prediction
- the MLM task is to predict the text content at the corresponding position
- the NSP task is to judge whether the two sentences before and after are not continuous. of.
- the above-mentioned method for acquiring text data is to acquire text data to be processed; input the text data to be processed into a pre-trained text processing model, and perform data processing on the text data to be processed according to model parameters in the text processing model to obtain target text data; the text processing model It is obtained by training based on the word vector coding data and language coding data corresponding to different input methods as input data, and the word vector coding data is obtained based on the pre-trained word vector model, and the language coding data is obtained based on the pre-trained language model.
- Chinese error correction is often used as a low-level module to provide higher-quality texts for upstream tasks. Therefore, one of the purposes of this application is to perform error correction processing on the erroneous data in the text data to be processed, so as to ensure the acquisition of target text data with a high accuracy rate, and to use the target text data as a training sample set for model training.
- the present application creatively introduces word vector models corresponding to different input methods, which brings more reference information to the model, thereby improving the processing capability of the text data to be processed, and making it possible to obtain target data with higher precision.
- extracting the encoded data corresponding to the second text sample set to be trained based on the language model, the Wubi word vector model, and the pinyin word vector model, respectively includes: based on the pre-trained Wubi word vector model from the second text to be trained Extract Wubi coding data from the sample set; extract pinyin coding data from the second text sample set to be trained based on the pre-trained pinyin word vector model; obtain the pre-trained language model, and extract multi-dimensional language coding data from the second training sample set based on the language model
- the text processing model is obtained by performing model training according to the coded data, including: taking Wubi coded data, pinyin coded data and multi-dimensional coded data as input data, and performing model training according to the input data to obtain a text processing model.
- the text processing model is obtained by jointly training the trained word vector model and the language model. That is to say, in the specific training process, the pinyin word vector model and the Wubi word vector model are first trained based on the training sample set, and the pre-trained language model is obtained, and then based on the trained pinyin word vector model, Wubi word vector model and The language model is trained again to obtain the final text processing model.
- this application includes at least two layers of model training process, the first layer is the training of the word vector model based on the input method, and the other layer is the word vector model and language model based on the input method obtained based on the training of the first layer. The text processing model obtained by model training again.
- the server obtains the second text sample set to be trained, where the second text sample set to be trained and the first text sample set to be trained may be the same or different sample sets, which are not limited herein. Then input the second set of text samples to be trained into the trained word vector model to obtain pinyin coding data and Wubi coding data respectively, and input the second set of text samples to be trained into the trained language model to obtain multi-dimensional language encoded data. Then, the obtained pinyin coded data, Wubi coded data and multi-dimensional language coded data are used as input data to train the model again, and then a text processing model is obtained.
- the sources of input data include multiple models, specifically including word vector models corresponding to different input methods and pre-trained language models with high precision, thereby enabling the training of the text processing model.
- the source of the data is more accurate and the information is more abundant in the process, which makes the training accuracy of the model higher.
- the multi-dimensional language encoding data includes one or more of word vector encoding data (token embedding), classification encoding data (type embedding) and position encoding data (position embedding).
- the Bert embedding layer in the language model has three inputs recorded as multi-dimensional language encoding data, and multi-dimensional language encoding data can express text information from various aspects.
- the multi-dimensional language encoding data corresponds to token-embedding, segment-embedding and position-embedding, respectively.
- Token-embedding is used to convert words into a fixed-dimensional vector representation, and each word is represented as a 768-dimensional vector in Bert-base.
- Segment-embedding is used for Bert to directly splicing the two texts into the model when solving the double sentence classification task (such as judging whether the two texts are semantically similar), so how does the model distinguish the two texts? , the answer is through segment-embedding.
- the segment-embedding part of the first sentence is all 0, and the segment-embedding part of the second sentence is all 1.
- BERT uses the transformer encoder to learn the representation of the sentence through the self-attention mechanism. The self-attention does not pay attention to the position information of the token, so in order for the transformer to learn the position information of the token, position-embedding is added to the input.
- word vector models such as Wubi word vector model and pinyin word vector model Wubi embedding and pinyin embedding use word2vec. By using word2vec, the amount of data can be reduced, thereby improving the efficiency of model training.
- the Wubi encoded data, the pinyin encoded data and the multi-dimensional language encoded data are used as input data, and model training is performed according to the input data to obtain a text processing model, including: the Wubi encoded data, the pinyin encoded data and the multi-dimensional language encoded data
- the data is spliced to obtain spliced coded data; the prediction processing is performed on the spliced coded data based on the language model to obtain the corresponding prediction probability at each position; the initial predicted text at the corresponding position is determined according to the size of the predicted probability; based on the initial predicted text and real labels
- the difference between texts adjusts the initial model parameters of the initial text processing model to obtain target model parameters, and determines the text processing model according to the target model parameters.
- the server performs splicing processing on the acquired Wubi coded data, pinyin coded data and multi-dimensional language data to obtain spliced coded data, and inputs the spliced coded data into the prediction module to obtain the corresponding prediction probability at each position.
- the data with the predicted probability greater than the preset threshold is extracted as the initial predicted data, for example, the data with the predicted probability ranked in the top 5 can be used as the initial predicted data.
- determining the initial predicted text at the corresponding position according to the size of the predicted probability includes: obtaining the predicted text whose predicted probability value is greater than a preset value; and extracting the initial predicted text from the predicted text based on the homophone principle and the pinyin principle, The initial predicted text is stored in the blockchain node.
- FIG. 3 is a structural diagram of a text processing model provided in an embodiment.
- this application does not add token embedding to the last layer of the error correction module for classification output, but directly outputs through the error correction module and uses pinyin features to constrain the output.
- this application makes full use of the characteristics of language model training to detect errors in texts.
- the current language model basically predicts the current position given the words on the left and the right, and there is also a given central word to predict the words on the left and right sides. , through this training, the model can learn which words a word is adjacent to and the probability of being adjacent, and the same is true using pinyin training.
- the correct pinyin of Chinese panda is "zhong guo xiong mao"
- the wrong pinyin is "zhong guo xun mao”
- the "mao" pinyin is preceded by "xiong”
- the probability of "xun” is higher than that of "xun”
- the probability of "guo” followed by "xiong” is higher than that of "xun”
- the reason for freezing the pinyin word vector model during the model training process can also be used to prevent the correct pinyin word vector from being affected by the lower quality data through the freezing process.
- the Bert model used in the error correction part it will perform softmax output on each word. If the output result is different from the input, it means that the word needs to be corrected. For example, for the typo of Xun, the 5 highest scores of the softmax output of the bert model are bear, search, big, good, and xun. At this time, I hope to further filter according to pinyin. According to the output result of the previous pinyin embedding and dense connection, the pinyin of this position is predicted to be xiong. Based on this, the bert result is filtered, other pinyin is removed, and finally only "bear" is retained, and other positions are the same reason.
- the results of pinyin prediction are also added for screening.
- Chinese Xunmao for the typo in Xun, assuming that the top 5 of the predicted results are Xiong, Xun, Da, Hao, Xun, if only homophones are filtered, then after the "xun" filter, the bear with the highest probability will be filtered out.
- the above-mentioned initial predicted text can also be stored in a node of a blockchain.
- dynamic screening can be achieved by using Pinyin Embedding and two-way GRU+Dense to perform word list screening, not just fixed homophone screening.
- the results of the Pinyin model are used for the screening of the Bert error correction results, instead of the original input Pinyin, which improves the accuracy of error correction and obtains text data with higher accuracy.
- the model parameters of the text processing model include pinyin model parameters and Wubi model parameters;
- the target model parameters are obtained by adjusting the initial model parameters of the initial text processing model based on the difference between the initial predicted text and the real label text, and determining the text processing model according to the target model parameters, including: adjusting the initial Wubi parameters of the initial text processing model based on the difference between the initial predicted text and the real label text to obtain the target Wubi model parameters; according to the pinyin model parameters and the target Wubi model parameters Determine the text processing model.
- the Pinyin embedding is fixed and immutable, and the Wubi embedding is fixed and variable.
- variable means that the parameters are variable, that is, Wubi embedding participates in the parameter update of backpropagation during the training process, and the Pinyin embedding is fixed. That is, it will not be updated during training.
- the word vector model is obtained by training based on word2vector, and the language model is obtained by training based on Bert model.
- the Bert language model is very strong, the cost of making Pinyin Bert is very high, and because the quality of the pre-training text cannot be guaranteed, even if the Bert is made of Pinyin, it can only do information enhancement, which is not suitable for pinyin error detection, so the model I chose to work hard on the quality of the training data of Pinyin Embedding, and chose a lighter word2vector language model instead of Bert. At the same time, I also think that for word2vec, because the pre-training process is related to the downstream error detection, its error detection ability will not be comparable. Bert is much worse.
- the vector of Wubi is the same as the pinyin vector, which is obtained by the method of Word2Vector.
- the training method of word2vector includes: converting all characters into Wubi codes, setting the sliding window to 5, that is, using the codes of the two characters before and after each time to predict the code of the middle character.
- Wubi Embedding and Pinyin Embedding of high-quality text are introduced into the error detection module for information enhancement, which can significantly improve the capability of the original Soft-mask error detection network.
- the homophonic screening of Top5 in the error correction module can effectively control the output of the text, and the correct pinyin predicted by the Pinyin Embedding + two-way GRU + Dense layer can be used to dynamically screen the results, which can also reduce the homophony screening to filter out the correct words. probability.
- a text processing model training apparatus including:
- the first training sample set obtaining module 402 is configured to obtain a first text sample set to be trained.
- the word vector training module 404 is configured to perform model training based on the first text sample set to be trained to obtain Wubi word vector models and pinyin word vector models corresponding to different input methods.
- the second training sample set obtaining module 406 is configured to obtain a second to-be-trained text sample set and a pre-trained language model.
- the coded data extraction module 408 is configured to extract coded data corresponding to the second text sample set to be trained based on the language model, the Wubi word vector model and the pinyin word vector model, respectively.
- the model training module 410 is configured to perform model training according to the encoded data to obtain a text processing model.
- the coded data extraction module 408 is further configured to convert the first set of text samples to be trained into corresponding pinyin coding vectors, traverse the pinyin coding vectors sequentially according to the preconfigured sliding window, and traverse the traversed pinyin coding vectors.
- the vector is used as the current pinyin vector to be processed.
- the pinyin coding vector at the preset position is predicted in the current pinyin vector to be processed, and determined according to the predicted pinyin coding vector and the actual pinyin coding vector.
- the target pinyin model parameters are obtained according to the determined target pinyin model parameters to obtain the pinyin word vector model; the first text sample set to be trained is converted into the corresponding Wubi coding vector, and the Wubi coding vector is traversed in turn according to the preconfigured sliding window, and the traversal to The Wubi encoding vector is used as the current Wubi vector to be processed, based on the current word vector model corresponding to the current Wubi model parameters, the Wubi encoding vector at the preset position is predicted in the current Wubi vector to be processed, and according to the predicted Wubi encoding vector and the real Wubi
- the encoding vector determines the target Wubi model parameters, and the Wubi word vector model is obtained according to the determined target Wubi model parameters.
- the encoded data extraction module 408 is further configured to extract Wubi encoded data from the second text sample set to be trained based on the pre-trained Wubi word vector model; based on the pre-trained pinyin word vector model from the second text to be trained
- the phonetic coding data is extracted from the sample set; the pre-trained language model is obtained, and the multi-dimensional language coding data is extracted from the second training sample set based on the language model; the model training module 410 is also used for Wubi coding data, Pinyin coding data and multi-dimensional language coding data
- a text processing model is obtained by model training according to the input data.
- the model training module 410 is further configured to perform splicing processing on Wubi coded data, pinyin coded data and multi-dimensional coded data to obtain spliced coded data; perform prediction processing on the spliced coded data based on the language model to obtain the corresponding correspondence at each position According to the size of the predicted probability, the initial predicted text at the corresponding position is determined; based on the difference between the initial predicted text and the real label text, the initial model parameters of the initial text processing model are adjusted to obtain the target model parameters, and according to the target model The parameters determine the text processing model.
- the model training module 410 is further configured to obtain the predicted text whose predicted probability value is greater than the preset value; the initial predicted text is extracted from the predicted text based on the homophone principle and the pinyin principle, and the initial predicted text is stored in the blockchain node middle.
- the model training module 410 is further configured to adjust the initial Wubi parameters of the initial text processing model based on the difference between the initial predicted text and the actual label text to obtain target Wubi model parameters; according to the pinyin model parameters and the target Wubi The model parameters determine the text processing model.
- a text data acquisition device including:
- the obtaining module 602 is used for obtaining the text data to be processed.
- the processing module 604 is used to input the text data to be processed into a pre-trained text processing model, so as to perform data processing on the text data to be processed according to the model parameters in the text processing model to obtain target text data; the text processing model is based on different input methods.
- Corresponding word vector encoding data and language encoding data are obtained by training as input data, and the word vector encoding data is obtained based on the pre-trained word vector model, and the language encoding data is obtained based on the pre-trained language model.
- Each module in the above-mentioned text data acquisition device and text processing model training device can be implemented in whole or in part by software, hardware and combinations thereof.
- the above modules can be embedded in or independent of the processor in the computer device in the form of hardware, or stored in the memory in the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
- a computer device in one of the embodiments, the computer device may be a server, and its internal structure diagram may be as shown in FIG. 7 .
- the computer device includes a processor, memory, and a network interface connected by a system bus. Among them, the processor of the computer device is used to provide computing and control capabilities.
- the memory of the computer device includes a non-volatile storage medium, an internal memory.
- the non-volatile storage medium stores an operating system, computer readable instructions and a database.
- the internal memory provides an environment for the execution of the operating system and computer-readable instructions in the non-volatile storage medium.
- the computer device's database is used to store textual data.
- the network interface of the computer device is used to communicate with an external terminal through a network connection.
- the computer-readable instructions when executed by the processor, implement a text data acquisition method and a text processing model training method.
- FIG. 7 is only a block diagram of a partial structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied. Include more or fewer components than shown in the figures, or combine certain components, or have a different arrangement of components.
- a computer device comprising a memory and one or more processors, the memory stores computer-readable instructions, and when the computer-readable instructions are executed by the processor, causes the one or more processors to execute the method in any one of the foregoing embodiments the steps involved.
- One or more computer-readable storage media storing computer-readable instructions, when the computer-readable instructions are executed by one or more processors, cause the one or more processors to perform the method involved in any one of the above embodiments. step.
- the computer-readable storage medium may be non-volatile or volatile.
- the blockchain referred to in the present invention is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
- Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
- the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
- Non-volatile memory may include read-only memory (Read-Only Memory, ROM), magnetic tape, floppy disk, flash memory, or optical memory, and the like.
- Volatile memory may include random access memory (RAM) or external cache memory.
- the RAM may be in various forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM).
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Artificial Intelligence (AREA)
- Human Computer Interaction (AREA)
- Machine Translation (AREA)
- Document Processing Apparatus (AREA)
Abstract
L'invention concerne un procédé d'entraînement d'un modèle de traitement de texte, se rapportant au domaine technique de l'intelligence artificielle et comprenant les étapes consistant à : acquérir un premier ensemble d'échantillons de texte destinés à l'entraînement (202) ; effectuer un entraînement de modèle respectivement sur la base dudit premier ensemble d'échantillons de texte pour obtenir un modèle de vecteur de mot à cinq traits et un modèle de vecteur de mot pinyin correspondant à différents procédés d'entrée (204) ; acquérir un second ensemble d'échantillons de texte destinés à l'entraînement et un modèle de langage pré-entraîné (206) ; sur la base du modèle de langage, du modèle de vecteur de mot à cinq traits et du modèle de vecteur de mot pinyin, extraire respectivement des données codées correspondant audit second ensemble d'échantillons de texte (208) ; et effectuer un entraînement de modèle conformément aux données codées pour obtenir un modèle de traitement de texte (210).
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011447964.2A CN112528637B (zh) | 2020-12-11 | 2020-12-11 | 文本处理模型训练方法、装置、计算机设备和存储介质 |
CN202011447964.2 | 2020-12-11 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022121251A1 true WO2022121251A1 (fr) | 2022-06-16 |
Family
ID=74998573
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/096582 WO2022121251A1 (fr) | 2020-12-11 | 2021-05-28 | Procédé et appareil d'entraînement de modèle de traitement de texte, dispositif informatique et support de stockage |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN112528637B (fr) |
WO (1) | WO2022121251A1 (fr) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116667326A (zh) * | 2023-05-30 | 2023-08-29 | 淮阴工学院 | 一种电动汽车充电负荷预测方法 |
CN117609781A (zh) * | 2023-11-20 | 2024-02-27 | 北京中关村科金技术有限公司 | 文本评估模型的训练方法、文本评估方法及装置 |
CN117831573A (zh) * | 2024-03-06 | 2024-04-05 | 青岛理工大学 | 基于多模态的语言障碍人群言语录音分析方法及系统 |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112528637B (zh) * | 2020-12-11 | 2024-03-29 | 平安科技(深圳)有限公司 | 文本处理模型训练方法、装置、计算机设备和存储介质 |
CN113434699B (zh) * | 2021-06-30 | 2023-07-18 | 平安科技(深圳)有限公司 | 用于文本匹配的bert模型的预训练方法、计算机装置和存储介质 |
CN113609157B (zh) * | 2021-08-09 | 2023-06-30 | 平安科技(深圳)有限公司 | 语言转换模型训练、语言转换方法、装置、设备及介质 |
CN114139524B (zh) * | 2021-11-29 | 2022-09-13 | 浙江大学 | 故事文本的预测方法、装置以及电子设备 |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140214401A1 (en) * | 2013-01-29 | 2014-07-31 | Tencent Technology (Shenzhen) Company Limited | Method and device for error correction model training and text error correction |
US20180349327A1 (en) * | 2017-06-05 | 2018-12-06 | Baidu Online Network Technology (Beijing)Co., Ltd. | Text error correction method and apparatus based on recurrent neural network of artificial intelligence |
CN110750959A (zh) * | 2019-10-28 | 2020-02-04 | 腾讯科技(深圳)有限公司 | 文本信息处理的方法、模型训练的方法以及相关装置 |
CN111310443A (zh) * | 2020-02-12 | 2020-06-19 | 新华智云科技有限公司 | 一种文本纠错方法和系统 |
CN111476036A (zh) * | 2020-04-10 | 2020-07-31 | 电子科技大学 | 一种基于中文单词特征子串的词嵌入学习方法 |
CN111523306A (zh) * | 2019-01-17 | 2020-08-11 | 阿里巴巴集团控股有限公司 | 文本的纠错方法、装置和系统 |
CN111597815A (zh) * | 2020-05-22 | 2020-08-28 | 北京慧闻科技(集团)有限公司 | 一种多嵌入命名实体识别方法、装置、设备及存储介质 |
CN112528637A (zh) * | 2020-12-11 | 2021-03-19 | 平安科技(深圳)有限公司 | 文本处理模型训练方法、装置、计算机设备和存储介质 |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107170453B (zh) * | 2017-05-18 | 2020-11-03 | 百度在线网络技术(北京)有限公司 | 基于人工智能的跨语种语音转录方法、设备及可读介质 |
CN110472251B (zh) * | 2018-05-10 | 2023-05-30 | 腾讯科技(深圳)有限公司 | 翻译模型训练的方法、语句翻译的方法、设备及存储介质 |
CN110110041B (zh) * | 2019-03-15 | 2022-02-15 | 平安科技(深圳)有限公司 | 错词纠正方法、装置、计算机装置及存储介质 |
CN110288980A (zh) * | 2019-06-17 | 2019-09-27 | 平安科技(深圳)有限公司 | 语音识别方法、模型的训练方法、装置、设备及存储介质 |
CN110795935A (zh) * | 2020-01-06 | 2020-02-14 | 广东博智林机器人有限公司 | 文字词向量模型的训练方法、装置、终端及存储介质 |
CN111488466B (zh) * | 2020-04-16 | 2023-06-06 | 清华大学 | 中文带标记错误语料生成方法、计算装置和存储介质 |
-
2020
- 2020-12-11 CN CN202011447964.2A patent/CN112528637B/zh active Active
-
2021
- 2021-05-28 WO PCT/CN2021/096582 patent/WO2022121251A1/fr active Application Filing
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140214401A1 (en) * | 2013-01-29 | 2014-07-31 | Tencent Technology (Shenzhen) Company Limited | Method and device for error correction model training and text error correction |
US20180349327A1 (en) * | 2017-06-05 | 2018-12-06 | Baidu Online Network Technology (Beijing)Co., Ltd. | Text error correction method and apparatus based on recurrent neural network of artificial intelligence |
CN111523306A (zh) * | 2019-01-17 | 2020-08-11 | 阿里巴巴集团控股有限公司 | 文本的纠错方法、装置和系统 |
CN110750959A (zh) * | 2019-10-28 | 2020-02-04 | 腾讯科技(深圳)有限公司 | 文本信息处理的方法、模型训练的方法以及相关装置 |
CN111310443A (zh) * | 2020-02-12 | 2020-06-19 | 新华智云科技有限公司 | 一种文本纠错方法和系统 |
CN111476036A (zh) * | 2020-04-10 | 2020-07-31 | 电子科技大学 | 一种基于中文单词特征子串的词嵌入学习方法 |
CN111597815A (zh) * | 2020-05-22 | 2020-08-28 | 北京慧闻科技(集团)有限公司 | 一种多嵌入命名实体识别方法、装置、设备及存储介质 |
CN112528637A (zh) * | 2020-12-11 | 2021-03-19 | 平安科技(深圳)有限公司 | 文本处理模型训练方法、装置、计算机设备和存储介质 |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116667326A (zh) * | 2023-05-30 | 2023-08-29 | 淮阴工学院 | 一种电动汽车充电负荷预测方法 |
CN116667326B (zh) * | 2023-05-30 | 2024-02-23 | 淮阴工学院 | 一种电动汽车充电负荷预测方法 |
CN117609781A (zh) * | 2023-11-20 | 2024-02-27 | 北京中关村科金技术有限公司 | 文本评估模型的训练方法、文本评估方法及装置 |
CN117609781B (zh) * | 2023-11-20 | 2024-05-28 | 北京中关村科金技术有限公司 | 文本评估模型的训练方法、文本评估方法及装置 |
CN117831573A (zh) * | 2024-03-06 | 2024-04-05 | 青岛理工大学 | 基于多模态的语言障碍人群言语录音分析方法及系统 |
CN117831573B (zh) * | 2024-03-06 | 2024-05-14 | 青岛理工大学 | 基于多模态的语言障碍人群言语录音分析方法及系统 |
Also Published As
Publication number | Publication date |
---|---|
CN112528637B (zh) | 2024-03-29 |
CN112528637A (zh) | 2021-03-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2022121251A1 (fr) | Procédé et appareil d'entraînement de modèle de traitement de texte, dispositif informatique et support de stockage | |
US20210141799A1 (en) | Dialogue system, a method of obtaining a response from a dialogue system, and a method of training a dialogue system | |
JP5901001B1 (ja) | 音響言語モデルトレーニングのための方法およびデバイス | |
CN111310441A (zh) | 基于bert的语音识别后文本修正方法、装置、终端及介质 | |
CN114580382A (zh) | 文本纠错方法以及装置 | |
US11636272B2 (en) | Hybrid natural language understanding | |
CN114218932B (zh) | 基于故障因果图谱的航空故障文本摘要生成方法及其装置 | |
WO2021143206A1 (fr) | Procédé et appareil de traitement en langage naturel à énoncé individuel, dispositif informatique et support de stockage lisible par ordinateur | |
WO2017052817A1 (fr) | Adaptation dynamique de modèles de langue et suivi sémantique pour reconnaissance vocale automatique | |
CN113053367B (zh) | 语音识别方法、语音识别的模型训练方法以及装置 | |
US20230104228A1 (en) | Joint Unsupervised and Supervised Training for Multilingual ASR | |
CN116956835B (zh) | 一种基于预训练语言模型的文书生成方法 | |
US20230096805A1 (en) | Contrastive Siamese Network for Semi-supervised Speech Recognition | |
US20230237993A1 (en) | Systems and Methods for Training Dual-Mode Machine-Learned Speech Recognition Models | |
WO2022086939A1 (fr) | Modèles de langage dynamique d'évolution en continu de contenu | |
US20230410794A1 (en) | Audio recognition method, method of training audio recognition model, and electronic device | |
CN113160820A (zh) | 语音识别的方法、语音识别模型的训练方法、装置及设备 | |
TWI818427B (zh) | 使用基於文本的說話者變更檢測的說話者劃分糾正方法及系統 | |
CN115858776A (zh) | 一种变体文本分类识别方法、系统、存储介质和电子设备 | |
CN115525749A (zh) | 语音问答方法、装置、电子设备和存储介质 | |
US11687723B2 (en) | Natural language processing with missing tokens in a corpus | |
CN111090720B (zh) | 一种热词的添加方法和装置 | |
US20230116268A1 (en) | System and a method for phonetic-based transliteration | |
US20230252225A1 (en) | Automatic Text Summarisation Post-processing for Removal of Erroneous Sentences | |
CN109241539B (zh) | 机器学习人工智能翻译数据库的更新方法 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21901964 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21901964 Country of ref document: EP Kind code of ref document: A1 |