WO2022121251A1 - Procédé et appareil d'entraînement de modèle de traitement de texte, dispositif informatique et support de stockage - Google Patents

Procédé et appareil d'entraînement de modèle de traitement de texte, dispositif informatique et support de stockage Download PDF

Info

Publication number
WO2022121251A1
WO2022121251A1 PCT/CN2021/096582 CN2021096582W WO2022121251A1 WO 2022121251 A1 WO2022121251 A1 WO 2022121251A1 CN 2021096582 W CN2021096582 W CN 2021096582W WO 2022121251 A1 WO2022121251 A1 WO 2022121251A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
text
wubi
data
pinyin
Prior art date
Application number
PCT/CN2021/096582
Other languages
English (en)
Chinese (zh)
Inventor
吴天博
王健宗
程宁
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022121251A1 publication Critical patent/WO2022121251A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/274Converting codes to words; Guess-ahead of partial word inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/02Input arrangements using manually operated switches, e.g. using keyboards or dials
    • G06F3/023Arrangements for converting discrete items of information into a coded form, e.g. arrangements for interpreting keyboard generated codes as alphanumeric codes, operand codes or instruction codes
    • G06F3/0233Character input methods
    • G06F3/0237Character input methods using prediction or retrieval techniques

Definitions

  • the present application relates to a text processing model training method, apparatus, computer equipment and storage medium.
  • Chinese error correction is a basic task in natural language processing, which often affects the accuracy of upstream tasks.
  • various Chinese errors are often included, but for the extensive and profound Chinese , changing a few words may change the semantics dramatically, so Chinese error correction is often used as the underlying module to provide higher-quality texts for upstream tasks.
  • Bert in the traditional technology is the current mainstream pre-training language model, and its MLM pre-training task will introduce 15%-10% noise due to its mask mechanism, so Bert has a certain error detection ability, but due to Only 15%-10% of the noise is introduced, so Bert is often weak in text error detection, making it difficult to obtain high-quality text data.
  • a text processing model training method is provided.
  • a text processing model training method comprising:
  • a text data acquisition device comprising:
  • a first training sample set obtaining module used for obtaining the first text sample set to be trained
  • a word vector training module for performing model training based on the first text sample set to be trained to obtain Wubi word vector models and pinyin word vector models corresponding to different input methods
  • the second training sample set acquisition module is used to obtain the second to-be-trained text sample set and the pre-trained language model
  • an encoding data extracting module for extracting encoding data corresponding to the second text sample set to be trained based on the language model, the Wubi word vector model and the pinyin word vector model;
  • the model training module is used to perform model training according to the encoded data to obtain a text processing model.
  • a method for acquiring text data comprising:
  • the language encoding data is obtained by training as input data, and the word vector encoding data is obtained based on the pre-trained word vector model, and the language encoding data is obtained based on the pre-trained language model.
  • a text data acquisition device includes:
  • an acquisition module for acquiring the text data to be processed
  • the processing module is used to input the text data to be processed into the pre-trained text processing model, so as to perform data processing on the text data to be processed according to the model parameters in the text processing model to obtain the target text data; the text processing model is based on the correspondence of different input methods.
  • the word vector encoding data and language encoding data are obtained by training as input data, and the word vector encoding data is obtained based on the pre-trained word vector model, and the language encoding data is obtained based on the pre-training language model.
  • a computer device comprising a memory and one or more processors, the memory having computer-readable instructions stored therein, the computer-readable instructions, when executed by the processor, cause the one or more processors to execute The following steps:
  • Computer-readable instructions One or more computer-readable storage media storing computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the following steps:
  • the above-mentioned method for acquiring text data is to acquire text data to be processed; input the text data to be processed into a pre-trained text processing model, and perform data processing on the text data to be processed according to model parameters in the text processing model to obtain target text data; the text processing model It is obtained by training based on the word vector coding data and language coding data corresponding to different input methods as input data, and the word vector coding data is obtained based on the pre-trained word vector model, and the language coding data is obtained based on the pre-trained language model.
  • FIG. 1 is an application environment diagram of a text processing model training method according to one or more embodiments.
  • FIG. 2 is a schematic flowchart of a text processing model training method according to one or more embodiments.
  • FIG. 3 is a structural diagram of a text processing model provided in accordance with one or more embodiments.
  • FIG. 4 is a structural block diagram of an apparatus for training a text processing model according to one or more embodiments.
  • FIG. 5 is a schematic flowchart of a method for acquiring text data according to one or more embodiments.
  • FIG. 6 is a structural block diagram of an apparatus for acquiring text data according to one or more embodiments.
  • FIG. 7 is a block diagram of a computer device in accordance with one or more embodiments.
  • the text data acquisition method provided in this application can be applied to the application environment shown in FIG. 1 .
  • the terminal 102 communicates with the server 104 through the network.
  • the server 104 obtains the first text sample set to be trained uploaded by the terminal 102; performs model training based on the first text sample set to be trained to obtain Wubi word vector models and pinyin word vector models corresponding to different input methods; obtains the second text sample to be trained based on the language model, the Wubi word vector model and the pinyin word vector model, respectively, extract the encoded data corresponding to the second text sample set to be trained; perform model training according to the encoded data to obtain a text processing model.
  • the terminal 102 can be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers and portable wearable devices, and the server 104 can be implemented by an independent server or a server cluster composed of multiple servers.
  • a text processing model training method is provided, and the method is applied to the server in FIG. 1 as an example for description, including the following steps:
  • Step 202 Obtain a first text sample set to be trained.
  • the first training text sample set includes multiple text data, and specifically may include multiple text sentences.
  • the text data in the first training text sample set may include text data requiring error correction processing, that is, the first training text sample set may include erroneous text information.
  • the sources of the first training text sample set include Chinese Wikipedia, historical telemarketing records, news crawled on the Internet, Baidu Q&A and other data, which are not limited here.
  • Step 204 respectively performing model training based on the first text sample set to be trained to obtain Wubi word vector models and pinyin word vector models corresponding to different input methods.
  • the input method specifically includes the Pinyin input method and the Wubi input method, which correspond to the use of different coding algorithms to identify the text.
  • the algorithm identifies the text.
  • pinyin coding algorithm and Wubi coding algorithm can have different coding content, such as the pinyin of "word” corresponds to "zi” Wubi corresponds to "PBF”. Therefore, based on different encoding methods, word vector models corresponding to different input methods can be trained separately.
  • the word vector models corresponding to different input methods include the pinyin word vector model and the Wubi word vector model, and the pinyin word vector model is obtained by training based on the pinyin coding data, and the Wubi word vector model is obtained by training based on the Wubi coding data. Therefore, since the word vector models corresponding to different input methods are obtained by training based on different encoded data, the word vector models corresponding to different input methods represent text data with different dimensions, and represent text data through different dimensions so that the text data can be The characterization is more accurate and reliable.
  • the word vector model includes a pinyin word vector model and a Wubi word vector model. And for the same text, the pinyin word vector model can be used to obtain the corresponding pinyin encoded data, and the Wubi word vector model can be used to obtain the corresponding Wubi encoded data.
  • Step 206 Obtain a second text sample set to be trained and a pre-trained language model.
  • the server obtains the second text sample set to be trained, where the second text sample set to be trained and the first text sample set to be trained may be the same or different sample sets, which are not limited herein.
  • the language model is a model with language prediction ability, and specifically, it may be a Bert (Bidirectional Encoder Representation from Transformers) language model.
  • Bert Bidirectional Encoder Representation from Transformers
  • MLM mask language model
  • NSP next sentence prediction
  • the MLM task is to predict the text content at the corresponding position
  • the NSP task is to judge whether the two sentences before and after are not continuous. of.
  • Step 208 based on the language model, the Wubi word vector model and the pinyin word vector model, respectively extract the encoded data corresponding to the second text sample set to be trained.
  • the language model, the Wubi word vector model and the pinyin word vector model express information on the same text data in different dimensions, at least three different ways of expressing the same text data can be obtained based on different models.
  • the information content of the encoded data obtained by expressing the second training sample set in different dimensions is more abundant, so the text processing model obtained when the model training is performed based on the encoded data has higher text processing accuracy.
  • Step 210 Perform model training according to the encoded data to obtain a text processing model.
  • the text processing model is a model for performing error correction processing on the text data to be processed, and is used for processing the text data to be processed into text data with higher precision.
  • model training can be performed based on text data with higher accuracy as training data, thereby improving the accuracy of model training.
  • the above-mentioned text processing model training method, device, computer equipment and storage medium obtain a first text sample set to be trained; respectively perform model training based on the first text sample set to be trained to obtain Wubi word vector models and pinyin word vectors corresponding to different input methods model; obtain the second text sample set to be trained and the pre-trained language model; extract the encoded data corresponding to the second text sample set to be trained based on the language model, the Wubi word vector model and the pinyin word vector model respectively; perform model training according to the encoded data Get the text processing model.
  • the text processing model Based on the training sample set, first train to obtain word vector models corresponding to different input methods, and then perform model training again based on the trained word vector model and language model to obtain a text processing model, ensuring that more dimensions can be integrated in the process of training the text processing model.
  • the text information obtained by the text processing model has higher accuracy and higher prediction accuracy.
  • the text processing model obtained by training can be used to process the input text data to be processed, so that more text information is taken into account in the process of text processing, thereby improving the processing ability of text data, thereby making it possible to obtain high-quality data. quality text data becomes possible.
  • performing model training based on the first set of text samples to be trained to obtain Wubi word vector models and pinyin word vector models corresponding to different input methods includes: converting the first set of text samples to be trained into corresponding pinyin Encoding vector, traverse the pinyin encoding vector in turn according to the pre-configured sliding window, take the traversed pinyin encoding vector as the current pinyin vector to be processed, and predict the prediction in the current pinyin vector based on the current word vector model corresponding to the current pinyin model parameters.
  • the Wubi encoding vector at the preset position is predicted in the middle, and the target Wubi model parameters are determined according to the predicted Wubi encoding vector and the real Wubi encoding vector, and the Wubi word vector model is obtained according to the determined target Wubi model parameters.
  • the server obtains the first text sample set to be trained, converts the first text sample set to be trained into a corresponding pinyin coding vector, performs word vector model training according to the pinyin coding vector to obtain a pinyin word vector model, and converts the first text to be trained
  • the sample set is converted into the corresponding Wubi encoding vector, and the word vector model is trained according to the Wubi encoding vector to obtain the Wubi word vector model.
  • the server obtains the first training text sample set, converts each text in the first training text sample set into corresponding pinyin data to obtain a pinyin coding vector, and uses the obtained pinyin coding vector as the training pinyin
  • the input data of the word vector model, and then the trained pinyin word vector model is obtained.
  • the server obtains the first training text sample set, converts each text in the first training text sample set into corresponding Wubi data to obtain Wubi encoding vector, and uses the obtained Wubi encoding vector as the input data for training the Wubi word vector model, Then, the trained Wubi word vector model is obtained.
  • the training method of the word vector encoding model may be a training method based on a Bert language model, or a training method based on a word vector such as word2vec, which is not limited here.
  • performing the training method of the word vector model based on the training method of word vectors such as word2vec includes: converting the text corresponding to the first training text sample set into a Wubi encoding vector, and setting a predefined sliding window, such as a sliding window can be set The size of the window is 5, and then the server traverses the Wubi encoding vector corresponding to the text data in turn based on the size corresponding to the sliding window as the unit step, and uses the currently traversed Wubi encoding vector as the currently pending Wubi encoding vector, and in the current pending Wubi encoding vector
  • the data prediction step is performed in processing Wubi encoded vectors.
  • the Wubi encoding vector of the two characters before and after is used to predict the Wubi encoding vector of the text at the middle position, and the prediction is obtained.
  • Compare the Wubi encoding vector with the actual Wubi encoding vector to adjust the current Wubi word vector model parameters according to the comparison result to obtain the target Wubi word vector model parameters, and finally obtain the target Wubi word vector according to the target Wubi word vector model parameters.
  • model in the same way, the pinyin word vector model can be obtained.
  • the same text data can be expressed in multiple dimensions, so that the model can obtain multi-dimensional information of the same text data, which is then used for training the model.
  • the training method of word vectors is cost-effective and efficient, which further improves model training. s efficiency.
  • a method for acquiring text data is provided, which is described by taking the method applied to the server in FIG. 1 as an example, including the following steps:
  • Step 502 acquiring text data to be processed.
  • the text data to be processed is text data that needs to be processed for error correction, that is, the text data to be processed may include erroneous text information.
  • the to-be-processed text data can be used as a training sample set for model training. Therefore, when the to-be-processed text data includes erroneous text information, when the model training is performed using the to-be-processed text data including the error information, the The accuracy of training has a great impact, therefore, it is necessary to perform data processing on the text data to be processed, so as to remove or correct the included erroneous data.
  • the sources of the text data to be processed include Chinese Wikipedia, historical telephone sales records, news crawled on the Internet, Baidu Q&A and other data, which are not limited here.
  • Step 504 Input the text data to be processed into a pre-trained text processing model, and perform data processing on the text data to be processed according to the model parameters in the text processing model to obtain target text data; the text processing model is based on word vectors corresponding to different input methods.
  • the encoded data and the language encoded data are obtained by training as input data, the word vector encoded data is obtained based on the pre-trained word vector model, and the language encoded data is obtained based on the pre-trained language model.
  • the text processing model is a model for performing error correction processing on the text data to be processed, and is used for processing the text data to be processed into text data with higher precision.
  • model training can be performed based on text data with higher accuracy as training data, thereby improving the accuracy of model training.
  • the target text data is the data obtained after performing data error correction on the text data to be processed, that is to say, the target text data has high data accuracy and can be used as a training sample set during model training.
  • the input method specifically includes the Pinyin input method and the Wubi input method, which correspond to the use of different coding algorithms to identify the text.
  • the algorithm identifies the text.
  • pinyin coding algorithm and Wubi coding algorithm can have different coding content, such as the pinyin of "word” corresponds to "zi” Wubi corresponds to "PBF”. Therefore, based on different encoding methods, word vector models corresponding to different input methods can be trained separately.
  • the word vector model includes a pinyin word vector model and a Wubi word vector model.
  • the pinyin word vector model can be used to obtain the corresponding pinyin encoded data
  • the Wubi word vector model can be used to obtain the corresponding Wubi encoded data.
  • the language model is a model with language prediction ability, and specifically, it may be a Bert (Bidirectional Encoder Representation from Transformers) language model.
  • Bert Bidirectional Encoder Representation from Transformers
  • MLM mask language model
  • NSP next sentence prediction
  • the MLM task is to predict the text content at the corresponding position
  • the NSP task is to judge whether the two sentences before and after are not continuous. of.
  • the above-mentioned method for acquiring text data is to acquire text data to be processed; input the text data to be processed into a pre-trained text processing model, and perform data processing on the text data to be processed according to model parameters in the text processing model to obtain target text data; the text processing model It is obtained by training based on the word vector coding data and language coding data corresponding to different input methods as input data, and the word vector coding data is obtained based on the pre-trained word vector model, and the language coding data is obtained based on the pre-trained language model.
  • Chinese error correction is often used as a low-level module to provide higher-quality texts for upstream tasks. Therefore, one of the purposes of this application is to perform error correction processing on the erroneous data in the text data to be processed, so as to ensure the acquisition of target text data with a high accuracy rate, and to use the target text data as a training sample set for model training.
  • the present application creatively introduces word vector models corresponding to different input methods, which brings more reference information to the model, thereby improving the processing capability of the text data to be processed, and making it possible to obtain target data with higher precision.
  • extracting the encoded data corresponding to the second text sample set to be trained based on the language model, the Wubi word vector model, and the pinyin word vector model, respectively includes: based on the pre-trained Wubi word vector model from the second text to be trained Extract Wubi coding data from the sample set; extract pinyin coding data from the second text sample set to be trained based on the pre-trained pinyin word vector model; obtain the pre-trained language model, and extract multi-dimensional language coding data from the second training sample set based on the language model
  • the text processing model is obtained by performing model training according to the coded data, including: taking Wubi coded data, pinyin coded data and multi-dimensional coded data as input data, and performing model training according to the input data to obtain a text processing model.
  • the text processing model is obtained by jointly training the trained word vector model and the language model. That is to say, in the specific training process, the pinyin word vector model and the Wubi word vector model are first trained based on the training sample set, and the pre-trained language model is obtained, and then based on the trained pinyin word vector model, Wubi word vector model and The language model is trained again to obtain the final text processing model.
  • this application includes at least two layers of model training process, the first layer is the training of the word vector model based on the input method, and the other layer is the word vector model and language model based on the input method obtained based on the training of the first layer. The text processing model obtained by model training again.
  • the server obtains the second text sample set to be trained, where the second text sample set to be trained and the first text sample set to be trained may be the same or different sample sets, which are not limited herein. Then input the second set of text samples to be trained into the trained word vector model to obtain pinyin coding data and Wubi coding data respectively, and input the second set of text samples to be trained into the trained language model to obtain multi-dimensional language encoded data. Then, the obtained pinyin coded data, Wubi coded data and multi-dimensional language coded data are used as input data to train the model again, and then a text processing model is obtained.
  • the sources of input data include multiple models, specifically including word vector models corresponding to different input methods and pre-trained language models with high precision, thereby enabling the training of the text processing model.
  • the source of the data is more accurate and the information is more abundant in the process, which makes the training accuracy of the model higher.
  • the multi-dimensional language encoding data includes one or more of word vector encoding data (token embedding), classification encoding data (type embedding) and position encoding data (position embedding).
  • the Bert embedding layer in the language model has three inputs recorded as multi-dimensional language encoding data, and multi-dimensional language encoding data can express text information from various aspects.
  • the multi-dimensional language encoding data corresponds to token-embedding, segment-embedding and position-embedding, respectively.
  • Token-embedding is used to convert words into a fixed-dimensional vector representation, and each word is represented as a 768-dimensional vector in Bert-base.
  • Segment-embedding is used for Bert to directly splicing the two texts into the model when solving the double sentence classification task (such as judging whether the two texts are semantically similar), so how does the model distinguish the two texts? , the answer is through segment-embedding.
  • the segment-embedding part of the first sentence is all 0, and the segment-embedding part of the second sentence is all 1.
  • BERT uses the transformer encoder to learn the representation of the sentence through the self-attention mechanism. The self-attention does not pay attention to the position information of the token, so in order for the transformer to learn the position information of the token, position-embedding is added to the input.
  • word vector models such as Wubi word vector model and pinyin word vector model Wubi embedding and pinyin embedding use word2vec. By using word2vec, the amount of data can be reduced, thereby improving the efficiency of model training.
  • the Wubi encoded data, the pinyin encoded data and the multi-dimensional language encoded data are used as input data, and model training is performed according to the input data to obtain a text processing model, including: the Wubi encoded data, the pinyin encoded data and the multi-dimensional language encoded data
  • the data is spliced to obtain spliced coded data; the prediction processing is performed on the spliced coded data based on the language model to obtain the corresponding prediction probability at each position; the initial predicted text at the corresponding position is determined according to the size of the predicted probability; based on the initial predicted text and real labels
  • the difference between texts adjusts the initial model parameters of the initial text processing model to obtain target model parameters, and determines the text processing model according to the target model parameters.
  • the server performs splicing processing on the acquired Wubi coded data, pinyin coded data and multi-dimensional language data to obtain spliced coded data, and inputs the spliced coded data into the prediction module to obtain the corresponding prediction probability at each position.
  • the data with the predicted probability greater than the preset threshold is extracted as the initial predicted data, for example, the data with the predicted probability ranked in the top 5 can be used as the initial predicted data.
  • determining the initial predicted text at the corresponding position according to the size of the predicted probability includes: obtaining the predicted text whose predicted probability value is greater than a preset value; and extracting the initial predicted text from the predicted text based on the homophone principle and the pinyin principle, The initial predicted text is stored in the blockchain node.
  • FIG. 3 is a structural diagram of a text processing model provided in an embodiment.
  • this application does not add token embedding to the last layer of the error correction module for classification output, but directly outputs through the error correction module and uses pinyin features to constrain the output.
  • this application makes full use of the characteristics of language model training to detect errors in texts.
  • the current language model basically predicts the current position given the words on the left and the right, and there is also a given central word to predict the words on the left and right sides. , through this training, the model can learn which words a word is adjacent to and the probability of being adjacent, and the same is true using pinyin training.
  • the correct pinyin of Chinese panda is "zhong guo xiong mao"
  • the wrong pinyin is "zhong guo xun mao”
  • the "mao" pinyin is preceded by "xiong”
  • the probability of "xun” is higher than that of "xun”
  • the probability of "guo” followed by "xiong” is higher than that of "xun”
  • the reason for freezing the pinyin word vector model during the model training process can also be used to prevent the correct pinyin word vector from being affected by the lower quality data through the freezing process.
  • the Bert model used in the error correction part it will perform softmax output on each word. If the output result is different from the input, it means that the word needs to be corrected. For example, for the typo of Xun, the 5 highest scores of the softmax output of the bert model are bear, search, big, good, and xun. At this time, I hope to further filter according to pinyin. According to the output result of the previous pinyin embedding and dense connection, the pinyin of this position is predicted to be xiong. Based on this, the bert result is filtered, other pinyin is removed, and finally only "bear" is retained, and other positions are the same reason.
  • the results of pinyin prediction are also added for screening.
  • Chinese Xunmao for the typo in Xun, assuming that the top 5 of the predicted results are Xiong, Xun, Da, Hao, Xun, if only homophones are filtered, then after the "xun" filter, the bear with the highest probability will be filtered out.
  • the above-mentioned initial predicted text can also be stored in a node of a blockchain.
  • dynamic screening can be achieved by using Pinyin Embedding and two-way GRU+Dense to perform word list screening, not just fixed homophone screening.
  • the results of the Pinyin model are used for the screening of the Bert error correction results, instead of the original input Pinyin, which improves the accuracy of error correction and obtains text data with higher accuracy.
  • the model parameters of the text processing model include pinyin model parameters and Wubi model parameters;
  • the target model parameters are obtained by adjusting the initial model parameters of the initial text processing model based on the difference between the initial predicted text and the real label text, and determining the text processing model according to the target model parameters, including: adjusting the initial Wubi parameters of the initial text processing model based on the difference between the initial predicted text and the real label text to obtain the target Wubi model parameters; according to the pinyin model parameters and the target Wubi model parameters Determine the text processing model.
  • the Pinyin embedding is fixed and immutable, and the Wubi embedding is fixed and variable.
  • variable means that the parameters are variable, that is, Wubi embedding participates in the parameter update of backpropagation during the training process, and the Pinyin embedding is fixed. That is, it will not be updated during training.
  • the word vector model is obtained by training based on word2vector, and the language model is obtained by training based on Bert model.
  • the Bert language model is very strong, the cost of making Pinyin Bert is very high, and because the quality of the pre-training text cannot be guaranteed, even if the Bert is made of Pinyin, it can only do information enhancement, which is not suitable for pinyin error detection, so the model I chose to work hard on the quality of the training data of Pinyin Embedding, and chose a lighter word2vector language model instead of Bert. At the same time, I also think that for word2vec, because the pre-training process is related to the downstream error detection, its error detection ability will not be comparable. Bert is much worse.
  • the vector of Wubi is the same as the pinyin vector, which is obtained by the method of Word2Vector.
  • the training method of word2vector includes: converting all characters into Wubi codes, setting the sliding window to 5, that is, using the codes of the two characters before and after each time to predict the code of the middle character.
  • Wubi Embedding and Pinyin Embedding of high-quality text are introduced into the error detection module for information enhancement, which can significantly improve the capability of the original Soft-mask error detection network.
  • the homophonic screening of Top5 in the error correction module can effectively control the output of the text, and the correct pinyin predicted by the Pinyin Embedding + two-way GRU + Dense layer can be used to dynamically screen the results, which can also reduce the homophony screening to filter out the correct words. probability.
  • a text processing model training apparatus including:
  • the first training sample set obtaining module 402 is configured to obtain a first text sample set to be trained.
  • the word vector training module 404 is configured to perform model training based on the first text sample set to be trained to obtain Wubi word vector models and pinyin word vector models corresponding to different input methods.
  • the second training sample set obtaining module 406 is configured to obtain a second to-be-trained text sample set and a pre-trained language model.
  • the coded data extraction module 408 is configured to extract coded data corresponding to the second text sample set to be trained based on the language model, the Wubi word vector model and the pinyin word vector model, respectively.
  • the model training module 410 is configured to perform model training according to the encoded data to obtain a text processing model.
  • the coded data extraction module 408 is further configured to convert the first set of text samples to be trained into corresponding pinyin coding vectors, traverse the pinyin coding vectors sequentially according to the preconfigured sliding window, and traverse the traversed pinyin coding vectors.
  • the vector is used as the current pinyin vector to be processed.
  • the pinyin coding vector at the preset position is predicted in the current pinyin vector to be processed, and determined according to the predicted pinyin coding vector and the actual pinyin coding vector.
  • the target pinyin model parameters are obtained according to the determined target pinyin model parameters to obtain the pinyin word vector model; the first text sample set to be trained is converted into the corresponding Wubi coding vector, and the Wubi coding vector is traversed in turn according to the preconfigured sliding window, and the traversal to The Wubi encoding vector is used as the current Wubi vector to be processed, based on the current word vector model corresponding to the current Wubi model parameters, the Wubi encoding vector at the preset position is predicted in the current Wubi vector to be processed, and according to the predicted Wubi encoding vector and the real Wubi
  • the encoding vector determines the target Wubi model parameters, and the Wubi word vector model is obtained according to the determined target Wubi model parameters.
  • the encoded data extraction module 408 is further configured to extract Wubi encoded data from the second text sample set to be trained based on the pre-trained Wubi word vector model; based on the pre-trained pinyin word vector model from the second text to be trained
  • the phonetic coding data is extracted from the sample set; the pre-trained language model is obtained, and the multi-dimensional language coding data is extracted from the second training sample set based on the language model; the model training module 410 is also used for Wubi coding data, Pinyin coding data and multi-dimensional language coding data
  • a text processing model is obtained by model training according to the input data.
  • the model training module 410 is further configured to perform splicing processing on Wubi coded data, pinyin coded data and multi-dimensional coded data to obtain spliced coded data; perform prediction processing on the spliced coded data based on the language model to obtain the corresponding correspondence at each position According to the size of the predicted probability, the initial predicted text at the corresponding position is determined; based on the difference between the initial predicted text and the real label text, the initial model parameters of the initial text processing model are adjusted to obtain the target model parameters, and according to the target model The parameters determine the text processing model.
  • the model training module 410 is further configured to obtain the predicted text whose predicted probability value is greater than the preset value; the initial predicted text is extracted from the predicted text based on the homophone principle and the pinyin principle, and the initial predicted text is stored in the blockchain node middle.
  • the model training module 410 is further configured to adjust the initial Wubi parameters of the initial text processing model based on the difference between the initial predicted text and the actual label text to obtain target Wubi model parameters; according to the pinyin model parameters and the target Wubi The model parameters determine the text processing model.
  • a text data acquisition device including:
  • the obtaining module 602 is used for obtaining the text data to be processed.
  • the processing module 604 is used to input the text data to be processed into a pre-trained text processing model, so as to perform data processing on the text data to be processed according to the model parameters in the text processing model to obtain target text data; the text processing model is based on different input methods.
  • Corresponding word vector encoding data and language encoding data are obtained by training as input data, and the word vector encoding data is obtained based on the pre-trained word vector model, and the language encoding data is obtained based on the pre-trained language model.
  • Each module in the above-mentioned text data acquisition device and text processing model training device can be implemented in whole or in part by software, hardware and combinations thereof.
  • the above modules can be embedded in or independent of the processor in the computer device in the form of hardware, or stored in the memory in the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
  • a computer device in one of the embodiments, the computer device may be a server, and its internal structure diagram may be as shown in FIG. 7 .
  • the computer device includes a processor, memory, and a network interface connected by a system bus. Among them, the processor of the computer device is used to provide computing and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium, an internal memory.
  • the non-volatile storage medium stores an operating system, computer readable instructions and a database.
  • the internal memory provides an environment for the execution of the operating system and computer-readable instructions in the non-volatile storage medium.
  • the computer device's database is used to store textual data.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer-readable instructions when executed by the processor, implement a text data acquisition method and a text processing model training method.
  • FIG. 7 is only a block diagram of a partial structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied. Include more or fewer components than shown in the figures, or combine certain components, or have a different arrangement of components.
  • a computer device comprising a memory and one or more processors, the memory stores computer-readable instructions, and when the computer-readable instructions are executed by the processor, causes the one or more processors to execute the method in any one of the foregoing embodiments the steps involved.
  • One or more computer-readable storage media storing computer-readable instructions, when the computer-readable instructions are executed by one or more processors, cause the one or more processors to perform the method involved in any one of the above embodiments. step.
  • the computer-readable storage medium may be non-volatile or volatile.
  • the blockchain referred to in the present invention is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • Non-volatile memory may include read-only memory (Read-Only Memory, ROM), magnetic tape, floppy disk, flash memory, or optical memory, and the like.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • the RAM may be in various forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM).

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)

Abstract

L'invention concerne un procédé d'entraînement d'un modèle de traitement de texte, se rapportant au domaine technique de l'intelligence artificielle et comprenant les étapes consistant à : acquérir un premier ensemble d'échantillons de texte destinés à l'entraînement (202) ; effectuer un entraînement de modèle respectivement sur la base dudit premier ensemble d'échantillons de texte pour obtenir un modèle de vecteur de mot à cinq traits et un modèle de vecteur de mot pinyin correspondant à différents procédés d'entrée (204) ; acquérir un second ensemble d'échantillons de texte destinés à l'entraînement et un modèle de langage pré-entraîné (206) ; sur la base du modèle de langage, du modèle de vecteur de mot à cinq traits et du modèle de vecteur de mot pinyin, extraire respectivement des données codées correspondant audit second ensemble d'échantillons de texte (208) ; et effectuer un entraînement de modèle conformément aux données codées pour obtenir un modèle de traitement de texte (210).
PCT/CN2021/096582 2020-12-11 2021-05-28 Procédé et appareil d'entraînement de modèle de traitement de texte, dispositif informatique et support de stockage WO2022121251A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011447964.2A CN112528637B (zh) 2020-12-11 2020-12-11 文本处理模型训练方法、装置、计算机设备和存储介质
CN202011447964.2 2020-12-11

Publications (1)

Publication Number Publication Date
WO2022121251A1 true WO2022121251A1 (fr) 2022-06-16

Family

ID=74998573

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/096582 WO2022121251A1 (fr) 2020-12-11 2021-05-28 Procédé et appareil d'entraînement de modèle de traitement de texte, dispositif informatique et support de stockage

Country Status (2)

Country Link
CN (1) CN112528637B (fr)
WO (1) WO2022121251A1 (fr)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116667326A (zh) * 2023-05-30 2023-08-29 淮阴工学院 一种电动汽车充电负荷预测方法
CN117609781A (zh) * 2023-11-20 2024-02-27 北京中关村科金技术有限公司 文本评估模型的训练方法、文本评估方法及装置
CN117831573A (zh) * 2024-03-06 2024-04-05 青岛理工大学 基于多模态的语言障碍人群言语录音分析方法及系统

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112528637B (zh) * 2020-12-11 2024-03-29 平安科技(深圳)有限公司 文本处理模型训练方法、装置、计算机设备和存储介质
CN113434699B (zh) * 2021-06-30 2023-07-18 平安科技(深圳)有限公司 用于文本匹配的bert模型的预训练方法、计算机装置和存储介质
CN113609157B (zh) * 2021-08-09 2023-06-30 平安科技(深圳)有限公司 语言转换模型训练、语言转换方法、装置、设备及介质
CN114139524B (zh) * 2021-11-29 2022-09-13 浙江大学 故事文本的预测方法、装置以及电子设备

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140214401A1 (en) * 2013-01-29 2014-07-31 Tencent Technology (Shenzhen) Company Limited Method and device for error correction model training and text error correction
US20180349327A1 (en) * 2017-06-05 2018-12-06 Baidu Online Network Technology (Beijing)Co., Ltd. Text error correction method and apparatus based on recurrent neural network of artificial intelligence
CN110750959A (zh) * 2019-10-28 2020-02-04 腾讯科技(深圳)有限公司 文本信息处理的方法、模型训练的方法以及相关装置
CN111310443A (zh) * 2020-02-12 2020-06-19 新华智云科技有限公司 一种文本纠错方法和系统
CN111476036A (zh) * 2020-04-10 2020-07-31 电子科技大学 一种基于中文单词特征子串的词嵌入学习方法
CN111523306A (zh) * 2019-01-17 2020-08-11 阿里巴巴集团控股有限公司 文本的纠错方法、装置和系统
CN111597815A (zh) * 2020-05-22 2020-08-28 北京慧闻科技(集团)有限公司 一种多嵌入命名实体识别方法、装置、设备及存储介质
CN112528637A (zh) * 2020-12-11 2021-03-19 平安科技(深圳)有限公司 文本处理模型训练方法、装置、计算机设备和存储介质

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107170453B (zh) * 2017-05-18 2020-11-03 百度在线网络技术(北京)有限公司 基于人工智能的跨语种语音转录方法、设备及可读介质
CN110472251B (zh) * 2018-05-10 2023-05-30 腾讯科技(深圳)有限公司 翻译模型训练的方法、语句翻译的方法、设备及存储介质
CN110110041B (zh) * 2019-03-15 2022-02-15 平安科技(深圳)有限公司 错词纠正方法、装置、计算机装置及存储介质
CN110288980A (zh) * 2019-06-17 2019-09-27 平安科技(深圳)有限公司 语音识别方法、模型的训练方法、装置、设备及存储介质
CN110795935A (zh) * 2020-01-06 2020-02-14 广东博智林机器人有限公司 文字词向量模型的训练方法、装置、终端及存储介质
CN111488466B (zh) * 2020-04-16 2023-06-06 清华大学 中文带标记错误语料生成方法、计算装置和存储介质

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140214401A1 (en) * 2013-01-29 2014-07-31 Tencent Technology (Shenzhen) Company Limited Method and device for error correction model training and text error correction
US20180349327A1 (en) * 2017-06-05 2018-12-06 Baidu Online Network Technology (Beijing)Co., Ltd. Text error correction method and apparatus based on recurrent neural network of artificial intelligence
CN111523306A (zh) * 2019-01-17 2020-08-11 阿里巴巴集团控股有限公司 文本的纠错方法、装置和系统
CN110750959A (zh) * 2019-10-28 2020-02-04 腾讯科技(深圳)有限公司 文本信息处理的方法、模型训练的方法以及相关装置
CN111310443A (zh) * 2020-02-12 2020-06-19 新华智云科技有限公司 一种文本纠错方法和系统
CN111476036A (zh) * 2020-04-10 2020-07-31 电子科技大学 一种基于中文单词特征子串的词嵌入学习方法
CN111597815A (zh) * 2020-05-22 2020-08-28 北京慧闻科技(集团)有限公司 一种多嵌入命名实体识别方法、装置、设备及存储介质
CN112528637A (zh) * 2020-12-11 2021-03-19 平安科技(深圳)有限公司 文本处理模型训练方法、装置、计算机设备和存储介质

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116667326A (zh) * 2023-05-30 2023-08-29 淮阴工学院 一种电动汽车充电负荷预测方法
CN116667326B (zh) * 2023-05-30 2024-02-23 淮阴工学院 一种电动汽车充电负荷预测方法
CN117609781A (zh) * 2023-11-20 2024-02-27 北京中关村科金技术有限公司 文本评估模型的训练方法、文本评估方法及装置
CN117609781B (zh) * 2023-11-20 2024-05-28 北京中关村科金技术有限公司 文本评估模型的训练方法、文本评估方法及装置
CN117831573A (zh) * 2024-03-06 2024-04-05 青岛理工大学 基于多模态的语言障碍人群言语录音分析方法及系统
CN117831573B (zh) * 2024-03-06 2024-05-14 青岛理工大学 基于多模态的语言障碍人群言语录音分析方法及系统

Also Published As

Publication number Publication date
CN112528637B (zh) 2024-03-29
CN112528637A (zh) 2021-03-19

Similar Documents

Publication Publication Date Title
WO2022121251A1 (fr) Procédé et appareil d'entraînement de modèle de traitement de texte, dispositif informatique et support de stockage
US20210141799A1 (en) Dialogue system, a method of obtaining a response from a dialogue system, and a method of training a dialogue system
JP5901001B1 (ja) 音響言語モデルトレーニングのための方法およびデバイス
CN111310441A (zh) 基于bert的语音识别后文本修正方法、装置、终端及介质
CN114580382A (zh) 文本纠错方法以及装置
US11636272B2 (en) Hybrid natural language understanding
CN114218932B (zh) 基于故障因果图谱的航空故障文本摘要生成方法及其装置
WO2021143206A1 (fr) Procédé et appareil de traitement en langage naturel à énoncé individuel, dispositif informatique et support de stockage lisible par ordinateur
WO2017052817A1 (fr) Adaptation dynamique de modèles de langue et suivi sémantique pour reconnaissance vocale automatique
CN113053367B (zh) 语音识别方法、语音识别的模型训练方法以及装置
US20230104228A1 (en) Joint Unsupervised and Supervised Training for Multilingual ASR
CN116956835B (zh) 一种基于预训练语言模型的文书生成方法
US20230096805A1 (en) Contrastive Siamese Network for Semi-supervised Speech Recognition
US20230237993A1 (en) Systems and Methods for Training Dual-Mode Machine-Learned Speech Recognition Models
WO2022086939A1 (fr) Modèles de langage dynamique d'évolution en continu de contenu
US20230410794A1 (en) Audio recognition method, method of training audio recognition model, and electronic device
CN113160820A (zh) 语音识别的方法、语音识别模型的训练方法、装置及设备
TWI818427B (zh) 使用基於文本的說話者變更檢測的說話者劃分糾正方法及系統
CN115858776A (zh) 一种变体文本分类识别方法、系统、存储介质和电子设备
CN115525749A (zh) 语音问答方法、装置、电子设备和存储介质
US11687723B2 (en) Natural language processing with missing tokens in a corpus
CN111090720B (zh) 一种热词的添加方法和装置
US20230116268A1 (en) System and a method for phonetic-based transliteration
US20230252225A1 (en) Automatic Text Summarisation Post-processing for Removal of Erroneous Sentences
CN109241539B (zh) 机器学习人工智能翻译数据库的更新方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21901964

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21901964

Country of ref document: EP

Kind code of ref document: A1