WO2023093525A1 - 模型训练方法、中文文本纠错方法、电子设备和存储介质 - Google Patents

模型训练方法、中文文本纠错方法、电子设备和存储介质 Download PDF

Info

Publication number
WO2023093525A1
WO2023093525A1 PCT/CN2022/130617 CN2022130617W WO2023093525A1 WO 2023093525 A1 WO2023093525 A1 WO 2023093525A1 CN 2022130617 W CN2022130617 W CN 2022130617W WO 2023093525 A1 WO2023093525 A1 WO 2023093525A1
Authority
WO
WIPO (PCT)
Prior art keywords
chinese
error correction
model
training
phonetic
Prior art date
Application number
PCT/CN2022/130617
Other languages
English (en)
French (fr)
Inventor
郑浩杰
屠要峰
李忠良
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2023093525A1 publication Critical patent/WO2023093525A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • This application relates to but is not limited to the technical fields of natural language processing and artificial intelligence, such as a model training method, a Chinese text error correction method, electronic equipment and storage media.
  • the language model cannot learn the information of near-phonetic characters and similar characters in Chinese.
  • the phonetic model when using the phonetic model to correct errors in Chinese texts, it is impossible to use the information of near-phonetic characters and similar characters to correct typos.
  • the accuracy of Chinese text error correction results Low and poor interpretability.
  • Embodiments of the present application provide a model training method, a Chinese text error correction method, electronic equipment, and a storage medium.
  • the embodiment of the present application provides a model training method, including: obtaining the training Chinese corpus and the phonetic and font confusion set, wherein the phonetic and font confusion set is a collection of Chinese near-phonetic confusion sets and Chinese shape-like confusion sets Construct phonetic model and font model according to described pronunciation and font confusion set; Determine character embedding according to described training Chinese corpus; Input described phonetic model and described font model with described training Chinese corpus, obtain pinyin embedding and font embedding respectively; The character embedding, the pinyin embedding and the glyph embedding are input into a deep two-way pre-training language model and pre-trained using a mask strategy; fine-tuning the pre-trained deep two-way pre-training language model to obtain Chinese text Error Correcting Language Models.
  • the present application also provides a Chinese text error correction method, including: obtaining the Chinese text to be corrected; inputting the Chinese text to be corrected into the trained Chinese text error correction language model to obtain the corrected text,
  • the Chinese text error correction language model is trained by the model training method described in the first aspect above.
  • the present application also provides a Chinese speech recognition error correction method, including: obtaining the speech to be corrected; performing speech recognition processing on the speech to be corrected to obtain the Chinese text to be corrected; Error correcting Chinese text is input into the trained Chinese text error correction language model to obtain the error correction text, wherein the Chinese text error correction language model is trained by the model training method described in the first aspect above.
  • the present application also provides an electronic device, including: a memory, a processor, and a computer program stored in the memory and operable on the processor, when the processor executes the computer program, the above first The model training method described in the above aspect, or the Chinese text error correction method described in the second aspect above, or the Chinese speech recognition error correction method described in the third aspect above.
  • the present application also provides a computer-readable storage medium, the computer-readable storage medium stores a computer-executable program, and the computer-executable program is used to make the computer execute the model described in the first aspect above
  • Fig. 1 is a flowchart of a model training method provided by an embodiment of the present application
  • Fig. 2 is the flow chart of corpus and confusion set processing provided by another embodiment of the present application.
  • Fig. 3 is the flow chart of the phonetic and font model processing that another embodiment of the present application provides;
  • Fig. 4 is a flow chart of determining model loss provided by another embodiment of the present application.
  • Fig. 5 is a flow chart of model fine-tuning provided by another embodiment of the present application.
  • FIG. 6 is a flowchart of a Chinese text error correction method provided by another embodiment of the present application.
  • Fig. 7 is a flowchart of a Chinese speech recognition error correction method provided by another embodiment of the present application.
  • Fig. 8 is a system block diagram of a Chinese text error correction system provided by another embodiment of the present application.
  • FIG. 9 is a system block diagram of a Chinese speech recognition error correction system provided by another embodiment of the present application.
  • FIG. 10 is a system block diagram of language model design optimization provided by another embodiment of the present application.
  • Fig. 11 is a structural diagram of an electronic device provided by another embodiment of the present application.
  • the language model cannot learn the information of near-phonetic characters and similar characters in Chinese.
  • the phonetic model when using the phonetic model to correct errors in Chinese texts, it is impossible to use the information of near-phonetic characters and similar characters to correct typos.
  • the accuracy of Chinese text error correction results Low and poor interpretability.
  • the model training method includes: obtaining training Chinese corpus And the phonetic and font confusion set, wherein the phonetic and font confusion set is a collection of Chinese near-phonetic confusion sets and Chinese-like word confusion sets; construct phonetic models and font models based on the phonetic and font confusion sets; determine character embedding according to training Chinese corpus; train Chinese Input the phonetic model and font model of the corpus to obtain pinyin embedding and font embedding respectively; input character embedding, pinyin embedding and font embedding into the deep two-way pre-training language model and use the mask strategy for pre-training; pre-training deep two-way pre-training The language model is fine-tuned to obtain the Chinese text error correction language model.
  • the training Chinese corpus and the phonetic and font confusion set are used for end-to-end model training, and then the Chinese text error correction phonetic model is obtained, and the Chinese text error correction phonetic model can learn near-phonetic word information and shape similarity.
  • Character information when correcting Chinese text, can use near phonetic word information and similar character information to correct typos, and improve the accuracy and interpretability of Chinese text error correction results.
  • Natural Language Processing is an important direction in the field of computer science and artificial intelligence. It studies various theories and methods that can realize effective communication between humans and computers using natural language.
  • CNN Convolutional Neural Networks
  • the Long Short-Term Memory (LSTM) network is a variant of the cyclic neural network, which has the ability to model sequence features, and optimizes the cyclic neural network by introducing input gates, forget gates, and output gates.
  • the deep bidirectional pre-trained language model (Bidirectional Encoder Representations from Transformers, BERT) is a pre-trained language representation model. It emphasizes that instead of using the traditional one-way language model or the method of shallow splicing two one-way language models for pre-training, a new masked language model (Masked Language Model, MLM) is used. As a result, deep bidirectional language representations can be generated.
  • MLM Mask Language Model
  • Speech Recognition also known as Automatic Speech Recognition (ASR), Computer Speech Recognition (Computer Speech Recognition) or Speech To Text (STT), its goal is to automatically Convert human speech content into corresponding text.
  • ASR Automatic Speech Recognition
  • Computer Speech Recognition Computer Speech Recognition
  • STT Speech To Text
  • OCR Optical Character Recognition
  • FIG. 1 is a flowchart of a model training method provided by an embodiment of the present application.
  • the model training method includes but not limited to the following steps:
  • Step 110 obtaining training Chinese corpus and phonetic and font confusion sets, wherein the phonetic and font confusion set is a collection of Chinese near-phonetic confusion sets and Chinese shape-similar word confusion sets;
  • Step 120 constructing a phonetic model and a font model according to the phonetic and font confusion set
  • Step 130 determine character embedding according to training Chinese corpus
  • Step 140 inputting the training Chinese corpus into the phonetic model and the glyph model to obtain pinyin embedding and glyph embedding respectively;
  • Step 150 input character embedding, pinyin embedding and glyph embedding into the deep two-way pre-training language model and use mask strategy for pre-training;
  • Step 160 fine-tuning the pre-trained deep bidirectional pre-trained language model to obtain a Chinese text error-correcting language model.
  • the training Chinese corpus and phonetic and font confusion sets are obtained from the existing database, and the phonetic model and font model are constructed using the phonetic and font confusion sets, and then the character embedding, pinyin embedding and glyph embedding are determined, and the character embedding, pinyin embedding And glyph embedding input BRET, and pre-training through the preset mask strategy, so that BERT can learn the information of near phonetic characters and similar characters, and then fine-tune BERT to obtain Chinese text that conforms to the real Chinese text error correction application scenario Error Correcting Language Models.
  • the end-to-end model training is carried out by using the training Chinese corpus and the phonetic and font confusion set, and then the Chinese text error correction speech model is obtained.
  • the Chinese text error correction speech model is obtained.
  • it can use near-sound word information and shape-similar word information to correct typos and improve the accuracy and interpretability of Chinese text error correction results.
  • BERT is converted into a prediction mode, and then deployed to obtain a Chinese text error correction language model.
  • the Chinese text error correction language model only needs to input Wrong Chinese text, and then output the corrected Chinese text, without inputting pinyin embedding and glyph embedding.
  • Extracting training Chinese corpus refers to removing the text data that contains more English corpus Text data, and the rest of the text data is used as training Chinese corpus.
  • step 110 in the embodiment shown in FIG. 1 the following steps are also included but not limited to:
  • Step 210 preprocessing the training Chinese corpus, wherein the preprocessing includes punctuation standardization and simplification;
  • Step 220 Simplify the phonetic and font confusion set.
  • preprocessing includes, but is not limited to, punctuation standardization and simplification.
  • Punctuation standardization refers to the normalization of Chinese and English labels and the normalization of full-width labels. Chinese labels and settings are in full-width format; Simplified processing refers to changing traditional characters into simplified characters.
  • step 140 in the embodiment shown in FIG. 1 also includes but is not limited to the following steps:
  • Step 310 performing word segmentation processing on the training Chinese corpus to obtain Chinese characters
  • Step 320 input Chinese characters into the preset Chinese pinyin conversion module to obtain a pinyin sequence
  • Step 330 input the pinyin sequence into the phonetic model to obtain the pinyin embedding
  • Step 340 input Chinese characters into a preset Chinese image conversion module to obtain character images
  • Step 350 performing image enhancement processing on the character image to obtain an image data set
  • Step 360 input the image data set into the glyph model to obtain the glyph embedding.
  • word segmentation processing uses BERT's word segmentation tool
  • the Chinese pinyin conversion module is the pypinyin open source toolkit, and after pypinyin obtains the pronunciation of Chinese characters, it generates the corresponding pinyin sequence
  • the Chinese image conversion module can convert Chinese characters into 64*64 Pixel picture.
  • image enhancement processing includes, but is not limited to, symmetry, rotation, and addition of noise information to the character image to obtain an image data set that has undergone image enhancement, thereby improving the quality of the glyph model.
  • the phonetic model includes a long short-term memory network LSTM
  • the font model includes a convolutional neural network CNN.
  • the pronunciation is a sequence composed of pinyin and tones, and the effect of LSTM modeling is better; in addition, the Chinese glyph itself can reflect the meaning of the word itself to a certain extent, using CNN modeling, through the convolution of Chinese character pictures Compared with the method of sequence model modeling, the strokes of Chinese characters can better reflect the degree of similarity between two Chinese characters, thereby improving the accuracy and interpretability of Chinese text error correction results.
  • the dimension of the hidden layer of LSTM is set to 32; the dimension of the hidden layer of CNN is set to 32, the convolution kernel size of CNN is 2*2 or 3*3, the total number of convolution kernels is 64, and the convolution network The number of layers is 2 layers.
  • step 150 in the embodiment shown in FIG. 1 also includes but is not limited to the following steps:
  • Step 410 input character embedding, pinyin embedding and glyph embedding into the deep two-way pre-trained language model to obtain character prediction, near-phonetic word confusion set prediction and shape-similar word confusion set prediction;
  • Step 420 determining a mask loss based on character embedding and character prediction
  • Step 430 determine the prediction loss of the confusion set of near-phonetic words according to the pinyin embedding and the prediction of the confusion set of near-phonetic words;
  • Step 440 determine the prediction loss of the similar character confusion set according to the font embedding and the similar character confusion set prediction
  • Step 450 determine the model loss according to the mask loss, the prediction loss of the near phonetic word confusion set and the similar word confusion set prediction loss;
  • Step 460 pre-train the deep two-way pre-trained language model by using the mask strategy.
  • the calculation formula of the model loss is as follows:
  • L( ⁇ ) is the model loss
  • L(mlm) is the mask loss
  • L(p) is the prediction loss of the near phonetic word confusion set
  • L(v) is the shape near word confusion set prediction loss
  • W A is the parameter matrix to be trained, W A ⁇ R h ⁇ d , h is the dimension of the hidden layer of BERT, d is the size of the vocabulary, f i is the hidden layer representation of the ith word;
  • represents the cross-entropy loss
  • y i is the MLM first task label
  • D is the data set
  • n represents the sentence length
  • L(p) and L(v) are calculated in the same way.
  • the sigmod activation function is used for the token, and then the cross-entropy loss is used.
  • the calculation formula is as follows:
  • W B is the parameter matrix to be trained, W B ⁇ R h ⁇ d , h is the dimension of the hidden layer of BERT, d is the size of the vocabulary, f i is the hidden layer representation of the i-th word;
  • represents the cross-entropy loss
  • p i is the MLM second task label
  • D is the data set
  • n represents the sentence length
  • Masking strategies include but are not limited to: randomly select 15% of all Chinese characters as mask characters, select 10% of the mask characters as non-replacement characters, select 10% of characters as random replacement characters, select 80% of the characters are replaced with the special character [MASK]; when pre-training the model, set the pre-training parameters as follows: maximum length: 512, batch size: 16, learning rate: dynamically decreasing learning rate.
  • step 160 in the embodiment shown in FIG. 1 also includes but is not limited to the following steps:
  • Step 510 obtaining the first error correction corpus and the second error correction corpus, wherein the first error correction corpus is generated by a preset Chinese error correction corpus generation algorithm, and the second error correction corpus is generated by a preset Chinese text error correction data set get;
  • Step 520 performing preprocessing on the first error correction corpus and the second error correction corpus, wherein the preprocessing includes punctuation standardization and simplification;
  • Step 530 fine-tuning the deep two-way pre-trained language model according to the preprocessed first error correction corpus and the preset first fine-tuning parameters
  • Step 540 fine-tuning the deep bidirectional pre-trained language model according to the preprocessed second error-correcting corpus and preset second fine-tuning parameters to obtain a Chinese text error-correcting language model.
  • the first error correction corpus is the Chinese typo data generated by the algorithm, and the first round of fine-tuning of BERT can be performed using the first error correction corpus, which can solve the problem of insufficient data; in addition, the Chinese text error correction data set is real Chinese The data set of the error correction corpus.
  • the second error correction corpus is Chinese typo data that fits the real scene of Chinese error correction. Using the second error correction corpus to perform a second round of fine-tuning on BERT can make the Chinese text error correction language model conform to the real Chinese text error correction application scenario.
  • preprocessing includes, but is not limited to, punctuation standardization and simplification.
  • Punctuation standardization refers to the normalization of Chinese and English labels and the normalization of full-width labels. Chinese labels and settings are in full-width format; Simplified processing refers to changing traditional characters into simplified characters.
  • the Chinese error correction corpus generation algorithm includes but is not limited to the Automatic-Corpus-Generation open source algorithm;
  • the Chinese text error correction data set includes but is not limited to the SIGHAN13, SIGHAN14 and SIGHAN15 data sets;
  • the first fine-tuning parameters are set as follows: number of iterations : 8, Batch size: 32, Learning rate: 0.00002, Maximum sentence length: 512;
  • the second fine-tuning parameters are set as follows: Number of iterations: 6, Batch size: 32, Learning rate: 0.00002, Maximum sentence length: 512.
  • the Chinese text error correction language model trained by the model training method of the present application can be applied in different scenarios, for example, the Chinese text recognized by OCR is input into the trained Chinese text error correction language model for error correction processing , or through speech recognition, the speech to be corrected is recognized as the Chinese text to be corrected, and then the Chinese text to be corrected is input into the trained Chinese text error correction language model for error correction processing.
  • the types of Chinese text errors in different scenarios and fields are quite different. For example, the Chinese text obtained by OCR recognition has more near-word errors, and the Chinese text obtained by speech recognition has more near-phonetic errors.
  • Using the trained Chinese text to correct Error correction processing using the error language model can improve the accuracy and interpretability of Chinese text error correction results.
  • FIG. 6 is a flowchart of a Chinese text error correction method provided by another embodiment of the present application.
  • the Chinese text error correction method includes but not limited to the following steps:
  • Step 610 obtaining the Chinese text to be corrected
  • Step 620 Input the Chinese text to be error-corrected into the trained Chinese text error-correcting language model to obtain the error-corrected text, wherein the Chinese text error-correcting language model is trained by the above-mentioned model training method.
  • the Chinese text to be error-corrected is input into the trained Chinese text error-correcting language model to obtain the error-corrected text; based on this, the end-to-end model training is performed using the training Chinese corpus and the phonetic and font confusion set, and then the Chinese text is obtained
  • the error correction speech model realizes the error correction of Chinese text.
  • the speech model can learn the information of near phonetic characters and similar characters. When correcting Chinese text, it can use the information of near phonetic characters and similar characters to correct typos and improve the results of Chinese text error correction. accuracy and interpretability.
  • FIG. 7 is a flowchart of an error correction method for Chinese speech recognition provided by another embodiment of the present application.
  • the Chinese speech recognition error correction method includes but not limited to the following steps:
  • Step 710 acquiring the speech to be corrected
  • Step 720 performing speech recognition processing on the speech to be corrected to obtain the Chinese text to be corrected
  • Step 730 Input the Chinese text to be error-corrected into the trained Chinese text error-correcting language model to obtain the error-corrected text, wherein the Chinese text error-correcting language model is trained by the above-mentioned model training method.
  • the Chinese text to be corrected is obtained, and the Chinese text to be corrected is input into the trained Chinese text error correction language model to obtain the corrected text; based on this, using the training Chinese End-to-end model training is carried out on the corpus and phonetic and font confusion sets, and then the Chinese text error correction speech model is obtained.
  • the Chinese text error correction speech model can learn the information of near phonetic characters and similar characters. When correcting Chinese text, it can Correct typos by using near phonetic information and similar information to improve the accuracy and interpretability of Chinese text error correction results.
  • the speech to be corrected is subjected to speech recognition processing, and after the Chinese text to be corrected is obtained, the Chinese text to be corrected needs to be preprocessed.
  • the preprocessing includes but is not limited to punctuation standardization and simplification. Processing refers to the normalization of Chinese and English labels and the normalization of full-width labels. In one example, all punctuation marks are uniformly changed to Chinese labels and set to full-width format; Simplified processing refers to uniformly changing traditional characters to simplified characters.
  • FIG. 8 is a system block diagram of a Chinese text error correction system provided by another embodiment of the present application.
  • the Chinese text error correction system includes but is not limited to: processing pre-training data module, pre-training module, fine-tuning module and Chinese text error correction module; Simplify the phonetic and font confusion sets, obtain the training Chinese corpus, preprocess the training Chinese corpus, perform word segmentation processing on the training Chinese corpus, and determine the pre-training data; the pre-training module is used to build the phonetic model, build the glyph model, and the language model Design optimization, determine loss function, set pre-training parameters, and start pre-training; the fine-tuning module is used to obtain the first error-correction corpus, preprocess the first error-correction corpus, use the first error-correction corpus to fine-tune the model, and obtain the second error-correction corpus Error corpus, preprocessing the second error correction corpus and using the second error correction corpus to fine-tune the model; the Chinese text error correction module is used to obtain the Chinese text to be corrected, preprocess the Chinese text to be corrected, and input the Chinese
  • FIG. 9 is a system block diagram of a Chinese speech recognition error correction system provided by another embodiment of the present application.
  • the Chinese speech recognition error correction system includes but is not limited to: a pre-training data processing module, a pre-training module, a fine-tuning module and a Chinese speech recognition error correction module; wherein, the pre-training data processing module is used to obtain phonetic and font confusion sets , Simplify the phonetic and font confusion sets, obtain the training Chinese corpus, preprocess the training Chinese corpus, perform word segmentation processing on the training Chinese corpus, and determine the pre-training data; the pre-training module is used to build the phonetic model, build the glyph model, Optimize language model design, determine loss function, set pre-training parameters, and start pre-training; fine-tuning module is used to obtain the first error-correcting corpus, preprocess the first error-correcting corpus, use the first error-correcting corpus to fine-tune the model, obtain the second The second error correction corpus, preprocessing the second error correction corpus and using the second error correction corpus to fine-tune the model; the Chinese speech recognition error correction module is used to
  • FIG. 10 is a system block diagram of language model design optimization provided by another embodiment of the present application.
  • BERT design optimization includes: the input of the original BERT is character embedding, and the input of the optimized BERT adds pinyin embedding and glyph embedding; the pre-training tasks of the original BERT are mask language model tasks and next sentence prediction Task, the optimized BERT removes the next sentence prediction task, and adds the near-phonetic word confusion set prediction task and the shape-like word confusion set prediction task; when the loss of the loss function is the smallest, the pre-training of BERT is completed.
  • an embodiment of the present application also provides an electronic device.
  • the electronic device includes: one or more processors and memories, and one processor and memories are taken as an example in FIG. 11 .
  • the processor and the memory may be connected through a bus or in other ways, and connection through a bus is taken as an example in FIG. 11 .
  • the memory can be used to store non-transitory software programs and non-transitory computer-executable programs, such as the model training method, the Chinese text error correction method or the Chinese Speech recognition error correction method.
  • the processor implements the model training method, the Chinese text error correction method or the Chinese speech recognition error correction method in the above embodiments of the present application by running the non-transitory software program and the program stored in the memory.
  • the memory may include a program storage area and a data storage area, wherein the program storage area may store an operating system and an application program required by at least one function; Error method or the data required for Chinese speech recognition error correction method, etc.
  • the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage devices.
  • the memory may optionally include memory located remotely from the processor, and these remote memories may be connected to the electronic device via a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
  • the non-transient software programs and programs required to realize the model training method, the Chinese text error correction method or the Chinese speech recognition error correction method in the above-mentioned embodiments of the present application are stored in the memory, and when executed by one or more processors, Execute the above-mentioned model training method in the embodiment of the present application, for example, execute the above-described method steps 110 to 160 in FIG. 1 , method steps 210 to 220 in FIG. 2 , and method steps 310 to 360 in FIG. 3 , method steps 410 to 460 in FIG. 4, method steps 510 to 540 in FIG. 5, or execute the Chinese text error correction method in the above-mentioned embodiment of the application, for example, execute the method steps in FIG.
  • the phonetic and font confusion set is a collection of Chinese near-phonetic confusion sets and Chinese-like word confusion sets; construct phonetic models and font models based on the phonetic and font confusion sets; determine character embeddings based on training Chinese corpus; input training Chinese corpus into phonetic models and font models , to obtain pinyin embedding and glyph embedding respectively; input character embedding, pinyin embedding and glyph embedding into the deep two-way pre-training language model and use the mask strategy for pre-training; fine-tune the pre-trained deep two-way pre-training language model to get Chinese Text error correction language model.
  • the end-to-end model training is carried out by using the training Chinese corpus and the phonetic and font confusion set, and then the Chinese text error correction speech model is obtained.
  • the Chinese text error correction speech model is obtained.
  • it can use near-sound word information and shape-similar word information to correct typos and improve the accuracy and interpretability of Chinese text error correction results.
  • an embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions are executed by a processor or a controller, for example, by the above-mentioned Execution by a processor in the embodiment of the electronic device can cause the above-mentioned processor to execute the model training method in the above-mentioned embodiment of the present application, for example, execute the method steps 110 to 160 in FIG. 1 described above, and the method in FIG.
  • the error correction method for example, execute the above-described method step 610 to step 620 in Figure 6, or execute the above-mentioned Chinese speech recognition error correction method in the embodiment of the present application, for example, execute the above-described method step 710 in Figure 7 To step 730, by obtaining the training Chinese corpus and the phonetic and font confusion set, wherein the phonetic and font confusing set is a collection of Chinese near-phonetic and Chinese-like word confusion sets; constructing phonetic models and font models according to the phonetic and font confusion sets;
  • the Chinese corpus determines the character embedding; the training Chinese corpus is input into the phonetic model and the glyph model, and the pinyin embedding and the glyph embedding are respectively obtained; the character embedding, pinyin embedding and
  • the end-to-end model training is carried out by using the training Chinese corpus and the phonetic and font confusion set, and then the Chinese text error correction speech model is obtained.
  • the Chinese text error correction speech model is obtained.
  • it can use near-sound word information and shape-similar word information to correct typos and improve the accuracy and interpretability of Chinese text error correction results.
  • Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cartridges, tape, magnetic disk storage or other magnetic storage devices, or can Any other medium used to store desired information and which can be accessed by a computer.
  • communication media typically embodies computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism, and may include any information delivery media .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Human Computer Interaction (AREA)
  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)

Abstract

本申请提供了一种模型训练方法、中文文本纠错方法、电子设备和存储介质,该模型训练方法包括:获取训练中文语料和字音字形混淆集(110);根据字音字形混淆集构建字音模型和字形模型(120);根据训练中文语料确定字符嵌入(130);将训练中文语料输入字音模型和字形模型,分别得到拼音嵌入和字形嵌入(140);将字符嵌入、拼音嵌入和字形嵌入输入深度双向预训练语言模型并利用掩码策略进行预训练(150);对预训练后的深度双向预训练语言模型进行微调,得到中文文本纠错语言模型(160)。

Description

模型训练方法、中文文本纠错方法、电子设备和存储介质
相关申请的交叉引用
本申请基于申请号为202111394466.0、申请日为2021年11月23日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。
技术领域
本申请涉及但不限于自然语言处理和人工智能技术领域,例如涉及一种模型训练方法、中文文本纠错方法、电子设备和存储介质。
背景技术
互联网中存有海量的文本信息,这些文本中包含许多错别字。在日常生活中,经常在公众号,微博等自媒体平台发现错误文字。据统计,在新媒体中文本出错率在2%左右,在一些问答系统中,出错率高达9%。在中文文本中大约83%的错误与相似发音相关,因为互联网内的中文基本是以拼音输入为主,48%的错误与相似字形相关,主要是由于五笔输入法和相似字形容易误选。输入准确性是自然语言处理领域内上层任务的前提,故而文本纠错是提升上层任务性能的关键,也是自然语言处理领域中的一项巨大挑战。
目前,语言模型无法学习到中文的近音字信息和形似字信息,导致在利用语音模型进行中文文本纠错时,无法利用近音字信息和形似字信息对错别字纠正,中文文本纠错结果的准确率低和可解释性差。
发明内容
以下是对本文详细描述的主题的概述。本概述并非是为了限制权利要求的保护范围。
本申请实施例提供了一种模型训练方法、中文文本纠错方法、电子设备和存储介质。
第一方面,本申请实施例提供了一种模型训练方法,包括:获取训练中文语料和字音字形混淆集,其中,所述字音字形混淆集为中文近音字混淆集和中文形似字混淆集的合集;根据所述字音字形混淆集构建字音模型和字形模型;根据所述训练中文语料确定字符嵌入;将所述训练中文语料输入所述字音模型和所述字形模型,分别得到拼音嵌入和字形嵌入;将所述字符嵌入、所述拼音嵌入和所述字形嵌入输入深度双向预训练语言模型并利用掩码策略进行预训练;对所述预训练后的深度双向预训练语言模型进行微调,得到中文文本纠错语言模型。
第二方面,本申请还提供了一种中文文本纠错方法,包括:获取待纠错中文文本;将所述待纠错中文文本输入训练后的中文文本纠错语言模型,得到纠错文本,其中,所述中文文本纠错语言模型由如上第一方面所述的模型训练方法训练得到。
第三方面,本申请还提供了一种中文语音识别纠错方法,包括:获取待纠错语音;对所述待纠错语音进行语音识别处理,得到待纠错中文文本;将所述待纠错中文文本输入训练后的中文文本纠错语言模型,得到纠错文本,其中,所述中文文本纠错语言模型由如上第一方面所述的模型训练方法训练得到。
第四方面,本申请还提供了一种电子设备,包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如上第一方面所述 的模型训练方法,或者如上第二方面所述的中文文本纠错方法,或者如上第三方面所述的中文语音识别纠错方法。
第五方面,本申请还提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可执行程序,所述计算机可执行程序用于使计算机执行如上第一方面所述的模型训练方法,或者如上第二方面所述的中文文本纠错方法,或者如上第三方面所述的中文语音识别纠错方法。
本申请的其它特征和优点将在随后的说明书中阐述,并且,部分地从说明书中变得显而易见,或者通过实施本申请而了解。本申请的目的和其他优点可通过在说明书、权利要求书以及附图中所特别指出的结构来实现和获得。
附图说明
附图用来提供对本申请技术方案的进一步理解,并且构成说明书的一部分,与本申请的实施例一起用于解释本申请的技术方案,并不构成对本申请技术方案的限制。
图1是本申请一个实施例提供的模型训练方法的流程图;
图2是本申请另一个实施例提供的语料和混淆集处理的流程图;
图3是本申请另一个实施例提供的字音字形模型处理的流程图;
图4是本申请另一个实施例提供的确定模型损失的流程图;
图5是本申请另一个实施例提供的模型微调的流程图;
图6是本申请另一个实施例提供的中文文本纠错方法的流程图;
图7是本申请另一个实施例提供的中文语音识别纠错方法的流程图;
图8是本申请另一个实施例提供的中文文本纠错系统的系统框图;
图9是本申请另一个实施例提供的中文语音识别纠错系统的系统框图;
图10是本申请另一个实施例提供的语言模型设计优化的系统框图;
图11是本申请另一个实施例提供的电子设备的结构图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。
需要说明的是,虽然在装置示意图中进行了功能模块划分,在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于装置中的模块划分,或流程图中的顺序执行所示出或描述的步骤。说明书、权利要求书或上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。
互联网中存有海量的文本信息,这些文本中包含许多错别字。在日常生活中,经常在公众号,微博等自媒体平台发现错误文字。据统计,在新媒体中文本出错率在2%左右,在一些问答系统中,出错率高达9%。在中文文本中大约83%的错误与相似发音相关,因为互联网内的中文基本是以拼音输入为主,48%的错误与相似字形相关,主要是由于五笔输入法和相似字形容易误选。输入准确性是自然语言处理领域内上层任务的前提,故而文本纠错是提升上层任务性能的关键,也是自然语言处理领域中的一项巨大挑战。
目前,语言模型无法学习到中文的近音字信息和形似字信息,导致在利用语音模型进行中文文本纠错时,无法利用近音字信息和形似字信息对错别字纠正,中文文本纠错结果的准确率低和可解释性差。
针对语言模型无法学习到中文的近音字信息和形似字信息的问题,本申请提供了一种模型训练方法、中文文本纠错方法、电子设备和存储介质,该模型训练方法包括:获取训练中文语料和字音字形混淆集,其中,字音字形混淆集为中文近音字混淆集和中文形似字混淆集的合集;根据字音字形混淆集构建字音模型和字形模型;根据训练中文语料确定字符嵌入;将训练中文语料输入字音模型和字形模型,分别得到拼音嵌入和字形嵌入;将字符嵌入、拼音嵌入和字形嵌入输入深度双向预训练语言模型并利用掩码策略进行预训练;对预训练后的深度双向预训练语言模型进行微调,得到中文文本纠错语言模型。根据本申请实施例提供的方案,利用训练中文语料和字音字形混淆集进行端到端模型训练,进而得到中文文本纠错语音模型,实现了中文文本纠错语音模型能学习到近音字信息和形似字信息,在进行中文文本纠错时,能够利用近音字信息和形似字信息纠正错别字,提高中文文本纠错结果的准确率和可解释性。
首先,对本申请中涉及的若干名词进行解析:
自然语言处理(Natural Language Processing,NLP),是计算机科学领域与人工智能领域中的一个重要方向,它研究能实现人与计算机之间用自然语言进行有效通信的各种理论和方法。
卷积神经网络(Convolutional Neural Networks,CNN),是一种前馈神经网络,它的人工神经元可以响应一部分覆盖范围内的周围单元;卷积神经网络被广泛应用于图片特征提取,其可以通过对局部底层特征的提取,通过堆叠的方式逐渐学习到一些高级的特征。
长短期记忆网络(Long Short-Term Memory,LSTM),是循环神经网络的一个变种,其具有建模序列特征的能力,通过引入输入门、忘记门和输出门对循环神经网络进行优化。
深度双向预训练语言模型(Bidirectional Encoder Representations from Transformers,BERT),是一个预训练的语言表征模型。它强调了不再像以往一样采用传统的单向语言模型或者把两个单向语言模型进行浅层拼接的方法进行预训练,而是采用新的掩码语言模型(Masked Language Model,MLM),以致能生成深度的双向语言表征。BERT论文发表时提及在11个NLP任务中获得了新的最佳效果;Transformer是自然语言处理领域目前主流的特征抽取器,有很强的抽象表达能力。
语音识别(Speech Recognition),也被称为自动语音识别(Automatic Speech Recognition,ASR)、电脑语音识别(Computer Speech Recognition)或是语音转文本识别(Speech To Text,STT),其目标是以电脑自动将人类的语音内容转换为相应的文字。
光学字符识别(Optical Character Recognition,OCR)是指对文本资料的图像文件进行分析识别处理,获取文字及版面信息的过程。
下面结合附图,对本申请实施例作进一步阐述。
如图1所示,图1是本申请一个实施例提供的一种模型训练方法的流程图。该模型训练方法包括但不限于如下步骤:
步骤110,获取训练中文语料和字音字形混淆集,其中,字音字形混淆集为中文近音字混淆集和中文形似字混淆集的合集;
步骤120,根据字音字形混淆集构建字音模型和字形模型;
步骤130,根据训练中文语料确定字符嵌入;
步骤140,将训练中文语料输入字音模型和字形模型,分别得到拼音嵌入和字形嵌入;
步骤150,将字符嵌入、拼音嵌入和字形嵌入输入深度双向预训练语言模型并利用掩码策略进行预训练;
步骤160,对预训练后的深度双向预训练语言模型进行微调,得到中文文本纠错语言模型。
可以理解的是,从现有的数据库中获取训练中文语料和字音字形混淆集,利用字音字形混淆集构建字音模型和字形模型,进而确定字符嵌入、拼音嵌入和字形嵌入,将字符嵌入、拼音嵌入和字形嵌入输入BRET,并通过预设的掩码策略进行预训练,使得BERT能学习到近音字信息和形似字信息,然后对BERT进行微调,得到符合真实的中文文本纠错应用场景的中文文本纠错语言模型。基于此,利用训练中文语料和字音字形混淆集进行端到端模型训练,进而得到中文文本纠错语音模型,实现了中文文本纠错语音模型能学习到近音字信息和形似字信息,在进行中文文本纠错时,能够利用近音字信息和形似字信息纠正错别字,提高中文文本纠错结果的准确率和可解释性。
值得注意的是,BRET在预训练和微调完成后,需要去除掉不需要的参数,将BERT转化为预测模式,进而部署得到中文文本纠错语言模型,中文文本纠错语言模型只需输入待纠错中文文本,然后输出纠正后的中文文本,无需输入拼音嵌入和字形嵌入。
在一示例中,需要先获取大规模的文本数据,例如获取20G以上的文本数据,然后从文本数据中提取训练模型所需的训练中文语料,提取训练中文语料是指去除包含较多英文语料的文本数据,其余的文本数据作为训练中文语料。
需要说明的是,获取训练中文语料和字音字形混淆集的具体步骤,属于本领域技术人员熟知的技术,在此不多作赘述。
另外,参照图2,在一实施例中,图1所示实施例中的步骤110之后,还包括但不限于有以下步骤:
步骤210,对训练中文语料进行预处理,其中,预处理包括标点符号标准化处理和简体化处理;
步骤220,对字音字形混淆集进行简体化处理。
需要说明的是,预处理包括但不限于标点符号标准化处理和简体化处理,标点符号标准化处理是指中英文标号归一和全半角标号归一,在一示例中为将标点符号都统一改为中文标号和设置为全角格式;简体化处理是指将繁体字统一改为简体字。
另外,参照图3,在一实施例中,图1所示实施例中的步骤140,还包括但不限于有以下步骤:
步骤310,对训练中文语料进行分词处理,得到中文字符;
步骤320,将中文字符输入预设的中文拼音转换模块,得到拼音序列;
步骤330,将拼音序列输入字音模型,得到拼音嵌入;
步骤340,将中文字符输入预设的中文图片转换模块,得到字符图像;
步骤350,对字符图像进行图像增强处理,得到图像数据集;
步骤360,将图像数据集输入字形模型,得到字形嵌入。
在一示例中,分词处理采用BERT的分词工具;中文拼音转换模块为pypinyin开源工具包,pypinyin获得中文字符的发音后,生成对应的拼音序列;中文图片转换模块能够将中文字符转换为64*64像素的图片。
可以理解的是,训练中文语料确定字符嵌入之前,也需要对训练中文语料进行分词处理。
需要说明的是,图像增强处理包括但不限于将字符图像进行对称、旋转和加入噪声信息,进而得到经过图像增强的图像数据集,从而提高字形模型的质量。
在一实施例中,字音模型包括长短期记忆网络LSTM,字形模型包括卷积神经网络CNN。
可以理解的是,字音是一个拼音和声调组成的序列,采用LSTM建模,效果更佳;另外,中文字形本身就能一定程度反应字本身意思,采用CNN建模,通过对汉字图片卷积的方式完成中文字形的建模,与采用序列模型建模的方式相比,汉字笔画更能反应两个汉字的形似程度,从而提高中文文本纠错结果的准确率和可解释性。
在一示例中,LSTM的隐藏层的维度设置为32;CNN的隐藏层的维度设置为32,CNN的卷积核大小为2*2或3*3,卷积核总数为64,卷积网络层数为2层。
另外,参照图4,在一实施例中,图1所示实施例中的步骤150,还包括但不限于有以下步骤:
步骤410,将字符嵌入、拼音嵌入和字形嵌入输入深度双向预训练语言模型,得到字符预测、近音字混淆集预测和形似字混淆集预测;
步骤420,根据字符嵌入和字符预测确定掩码损失;
步骤430,根据拼音嵌入和近音字混淆集预测确定近音字混淆集预测损失;
步骤440,根据字形嵌入和形似字混淆集预测确定形似字混淆集预测损失;
步骤450,根据掩码损失、近音字混淆集预测损失和形似字混淆集预测损失确定模型损失;
步骤460,根据模型损失,利用掩码策略对深度双向预训练语言模型进行预训练。
在一示例中,模型损失的计算公式如下:
L(θ)=L(mlm)+L(p)+L(v),
其中,L(θ)为模型损失,L(mlm)为掩码损失,L(p)为近音字混淆集预测损失,L(v)为形近字混淆集预测损失;
L(mlm)计算方法如下:
先对token采用softmax激活函数,再采用交叉熵损失,计算公式如下:
Figure PCTCN2022130617-appb-000001
其中,W A为待训练参数矩阵,W A∈R h×d,h为BERT隐藏层的维度,d为词表大小,f i为第i个字的隐藏层表示;
Figure PCTCN2022130617-appb-000002
其中,λ表示交叉熵损失,y i为MLM第一任务标签,D为数据集,n代表句子长度;
L(p)和L(v)计算方法相同,先对token采用sigmod激活函数,再采用交叉熵损失,以L(p)为例子,计算公式如下:
Figure PCTCN2022130617-appb-000003
其中,W B为待训练参数矩阵,W B∈R h×d,h为BERT隐藏层的维度,d为词表大小,f i为第i个字的隐藏层表示;
Figure PCTCN2022130617-appb-000004
其中,λ表示交叉熵损失,p i为MLM第二任务标签,D为数据集,n代表句子长度;
掩码策略包括但不限于:在所有中文字符中随机选取15%的字符作为掩码字符,在掩码字符中选取10%的字符作为不替换字符,选取10%的字符作为随机替换字符,选择80%的字符利用特殊字符[MASK]进行替换;预训练模型时,设置预训练参数为:最大长度:512、批大小:16、学习率:动态递减学习率。
另外,参照图5,在一实施例中,图1所示实施例中的步骤160,还包括但不限于有以下步骤:
步骤510,获取第一纠错语料和第二纠错语料,其中,第一纠错语料由预设的中文纠错语料生成算法生成,第二纠错语料由预设的中文文本纠错数据集得到;
步骤520,对第一纠错语料和第二纠错语料进行预处理,其中,预处理包括标点符号标准化处理和简体化处理;
步骤530,根据预处理后的第一纠错语料和预设的第一微调参数,对深度双向预训练语言模型进行微调;
步骤540,根据预处理后的第二纠错语料和预设的第二微调参数,对深度双向预训练语言模型进行微调,得到中文文本纠错语言模型。
可以理解的是,第一纠错语料是算法生成的中文错别字数据,利用第一纠错语料对BERT进行第一轮微调,能够解决数据不足的问题;另外,中文文本纠错数据集为真实中文纠错语料的数据集,第二纠错语料是契合中文纠错的真实场景的中文错别字数据,利用第二纠错语料对BERT进行第二轮微调,能够使中文文本纠错语言模型符合真实的中文文本纠错应用场景。
需要说明的是,预处理包括但不限于标点符号标准化处理和简体化处理,标点符号标准化处理是指中英文标号归一和全半角标号归一,在一示例中为将标点符号都统一改为中文标号和设置为全角格式;简体化处理是指将繁体字统一改为简体字。
在一示例中,中文纠错语料生成算法包括但不限于Automatic-Corpus-Generation开源算法;中文文本纠错数据集包括但不限于SIGHAN13、SIGHAN14和SIGHAN15数据集;第一微调参数设置如下:迭代次数:8、批大小:32、学习率:0.00002、最大句子长度:512;第二微调参数设置如下:迭代次数:6、批大小:32、学习率:0.00002、最大句子长度:512。
可以理解的是,本申请的模型训练方法训练得到的中文文本纠错语言模型可应用在不同的场景,例如,将OCR识别的中文文本输入训练后的中文文本纠错语言模型,进行纠错处理,或者通过语音识别,将待纠错语音识别为待纠错中文文本,进而将待纠错中文文本输入训练后的中文文本纠错语言模型,进行纠错处理。不同场景和不同领域的中文文本错误类型相差较大,例如OCR识别得到中文文本有较多的形近字错误,语音识别得到的中文文本有较多的近音字错误,利用训练后的中文文本纠错语言模型进行纠错处理,能够提高中文文本纠错结果的准确率和可解释性。
如图6所示,图6是本申请另一个实施例提供的中文文本纠错方法的流程图。该中文文本纠错方法包括但不限于如下步骤:
步骤610,获取待纠错中文文本;
步骤620,将待纠错中文文本输入训练后的中文文本纠错语言模型,得到纠错文本,其中,中文文本纠错语言模型由上述模型训练方法训练得到。
可以理解的是,将待纠错中文文本输入训练后的中文文本纠错语言模型,得到纠错文本;基于此,利用训练中文语料和字音字形混淆集进行端到端模型训练,进而得到中文文本纠错语音模型,实现了中文文本纠错语音模型能学习到近音字信息和形似字信息,在进行中文文本纠错时,能够利用近音字信息和形似字信息纠正错别字,提高中文文本纠错结果的准确率和可解释性。
如图7所示,图7是本申请另一个实施例提供的中文语音识别纠错方法的流程图。该中文语音识别纠错方法包括但不限于如下步骤:
步骤710,获取待纠错语音;
步骤720,对待纠错语音进行语音识别处理,得到待纠错中文文本;
步骤730,将待纠错中文文本输入训练后的中文文本纠错语言模型,得到纠错文本,其中,中文文本纠错语言模型由上述模型训练方法训练得到。
可以理解的是,待纠错语音经过语音识别处理后,得到待纠错中文文本,将待纠错中文文本输入训练后的中文文本纠错语言模型,得到纠错文本;基于此,利用训练中文语料和字音字形混淆集进行端到端模型训练,进而得到中文文本纠错语音模型,实现了中文文本纠错语音模型能学习到近音字信息和形似字信息,在进行中文文本纠错时,能够利用近音字信息和形似字信息纠正错别字,提高中文文本纠错结果的准确率和可解释性。
在一示例中,对待纠错语音进行语音识别处理,得到待纠错中文文本之后,需要对待纠错中文文本进行预处理,预处理包括但不限于标点符号标准化处理和简体化处理,标点符号标准化处理是指中英文标号归一和全半角标号归一,在一示例中为将标点符号都统一改为中文标号和设置为全角格式;简体化处理是指将繁体字统一改为简体字。
需要说明的是,语音识别处理的技术,属于本领域技术人员熟知的技术,在此不多作赘述。
如图8所示,图8是本申请另一个实施例提供的中文文本纠错系统的系统框图。
可以理解的是,中文文本纠错系统包括但不限于:处理预训练数据模块、预训练模块、微调模块和中文文本纠错模块;其中,处理预训练数据模块用于获取字音字形混淆集、对字音字形混淆集进行简体化处理、获取训练中文语料、对训练中文语料进行预处理、对训练中文语料进行分词处理和确定预训练数据;预训练模块用于构建字音模型、构建字形模型、语言模型设计优化、确定损失函数、设置预训练参数和开始预训练;微调模块用于获取第一纠错语料、对第一纠错语料进行预处理、利用第一纠错语料微调模型、获取第二纠错语料、对第二纠错语料进行预处理和利用第二纠错语料微调模型;中文文本纠错模块用于获取待纠错中文文本、对待纠错中文文本进行预处理、输入中文文本纠错语音模型和输出纠错文本。
如图9所示,图9是本申请另一个实施例提供的中文语音识别纠错系统的系统框图。
可以理解的是,中文语音识别纠错系统包括但不限于:处理预训练数据模块、预训练模块、微调模块和中文语音识别纠错模块;其中,处理预训练数据模块用于获取字音字形混淆集、对字音字形混淆集进行简体化处理、获取训练中文语料、对训练中文语料进行预处理、对训练中文语料进行分词处理和确定预训练数据;预训练模块用于构建字音模型、构建字形模型、语言模型设计优化、确定损失函数、设置预训练参数和开始预训练;微调模块用于获 取第一纠错语料、对第一纠错语料进行预处理、利用第一纠错语料微调模型、获取第二纠错语料、对第二纠错语料进行预处理和利用第二纠错语料微调模型;中文语音识别纠错模块用于获取待纠错语音、对待纠错语音进行语音识别处理、得到待纠错中文文本、对待纠错中文文本进行预处理、输入中文文本纠错语音模型和输出纠错文本。
如图10所示,图10是本申请另一个实施例提供的语言模型设计优化的系统框图。
可以理解的是,BERT设计优化包括:原先的BERT的输入为字符嵌入,优化后的BERT的输入增加了拼音嵌入和字形嵌入;原先的BERT的预训练任务为掩码语言模型任务和下一句预测任务,优化后的BERT去除下一句预测任务,增加了近音字混淆集预测任务和形似字混淆集预测任务;当损失函数的损失最小时,BERT的预训练完成。
另外,参照图11,本申请的一个实施例还提供了一种电子设备。
在一示例中,该电子设备包括:一个或多个处理器和存储器,图11中以一个处理器及存储器为例。处理器和存储器可以通过总线或者其他方式连接,图11中以通过总线连接为例。
存储器作为一种非暂态计算机可读存储介质,可用于存储非暂态软件程序以及非暂态性计算机可执行程序,如上述本申请实施例中的模型训练方法、中文文本纠错方法或中文语音识别纠错方法。处理器通过运行存储在存储器中的非暂态软件程序以及程序,从而实现上述本申请实施例中的模型训练方法、中文文本纠错方法或中文语音识别纠错方法。
存储器可以包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需要的应用程序;存储数据区可存储执行上述本申请实施例中的模型训练方法、中文文本纠错方法或中文语音识别纠错方法所需的数据等。此外,存储器可以包括高速随机存取存储器,还可以包括非暂态存储器,例如至少一个磁盘存储器件、闪存器件、或其他非暂态固态存储器件。在一些实施方式中,存储器可选包括相对于处理器远程设置的存储器,这些远程存储器可以通过网络连接至该电子设备。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
实现上述本申请实施例中的模型训练方法、中文文本纠错方法或中文语音识别纠错方法所需的非暂态软件程序以及程序存储在存储器中,当被一个或者多个处理器执行时,执行上述本申请实施例中的模型训练方法,例如,执行以上描述的图1中的方法步骤110至步骤160、图2中的方法步骤210至步骤220、图3中的方法步骤310至步骤360、图4中的方法步骤410至步骤460、图5中的方法步骤510至步骤540,或者执行上述本申请实施例中的中文文本纠错方法,例如,执行以上描述的图6中的方法步骤610至步骤620,或者执行上述本申请实施例中的中文语音识别纠错方法,例如,执行以上描述的图7中的方法步骤710至步骤730,通过获取训练中文语料和字音字形混淆集,其中,字音字形混淆集为中文近音字混淆集和中文形似字混淆集的合集;根据字音字形混淆集构建字音模型和字形模型;根据训练中文语料确定字符嵌入;将训练中文语料输入字音模型和字形模型,分别得到拼音嵌入和字形嵌入;将字符嵌入、拼音嵌入和字形嵌入输入深度双向预训练语言模型并利用掩码策略进行预训练;对预训练后的深度双向预训练语言模型进行微调,得到中文文本纠错语言模型。基于此,利用训练中文语料和字音字形混淆集进行端到端模型训练,进而得到中文文本纠错语音模型,实现了中文文本纠错语音模型能学习到近音字信息和形似字信息,在进行中文文本纠错时,能够利用近音字信息和形似字信息纠正错别字,提高中文文本纠错结果的准确率和可解释性。
此外,本申请的一个实施例还提供了一种计算机可读存储介质,该计算机可读存储介质存储有计算机可执行指令,该计算机可执行指令被一个处理器或控制器执行,例如,被上述电子设备实施例中的一个处理器执行,可使得上述处理器执行上述本申请实施例中的模型训练方法,例如,执行以上描述的图1中的方法步骤110至步骤160、图2中的方法步骤210至步骤220、图3中的方法步骤310至步骤360、图4中的方法步骤410至步骤460、图5中的方法步骤510至步骤540,或者执行上述本申请实施例中的中文文本纠错方法,例如,执行以上描述的图6中的方法步骤610至步骤620,或者执行上述本申请实施例中的中文语音识别纠错方法,例如,执行以上描述的图7中的方法步骤710至步骤730,通过获取训练中文语料和字音字形混淆集,其中,字音字形混淆集为中文近音字混淆集和中文形似字混淆集的合集;根据字音字形混淆集构建字音模型和字形模型;根据训练中文语料确定字符嵌入;将训练中文语料输入字音模型和字形模型,分别得到拼音嵌入和字形嵌入;将字符嵌入、拼音嵌入和字形嵌入输入深度双向预训练语言模型并利用掩码策略进行预训练;对预训练后的深度双向预训练语言模型进行微调,得到中文文本纠错语言模型。基于此,利用训练中文语料和字音字形混淆集进行端到端模型训练,进而得到中文文本纠错语音模型,实现了中文文本纠错语音模型能学习到近音字信息和形似字信息,在进行中文文本纠错时,能够利用近音字信息和形似字信息纠正错别字,提高中文文本纠错结果的准确率和可解释性。
本领域普通技术人员可以理解,上文中所公开方法中的全部或某些步骤、系统可以被实施为软件、固件、硬件及其适当的组合。某些物理组件或所有物理组件可以被实施为由处理器,如中央处理器、数字信号处理器或微处理器执行的软件,或者被实施为硬件,或者被实施为集成电路,如专用集成电路。这样的软件可以分布在计算机可读介质上,计算机可读介质可以包括计算机存储介质(或非暂时性介质)和通信介质(或暂时性介质)。如本领域普通技术人员公知的,术语计算机存储介质包括在用于存储信息(诸如计算机可读指令、数据结构、程序模块或其他数据)的任何方法或技术中实施的易失性和非易失性、可移除和不可移除介质。计算机存储介质包括但不限于RAM、ROM、EEPROM、闪存或其他存储器技术、CD-ROM、数字多功能盘(DVD)或其他光盘存储、磁盒、磁带、磁盘存储或其他磁存储装置、或者可以用于存储期望的信息并且可以被计算机访问的任何其他的介质。此外,本领域普通技术人员公知的是,通信介质通常包含计算机可读指令、数据结构、程序模块或者诸如载波或其他传输机制之类的调制数据信号中的其他数据,并且可包括任何信息递送介质。
以上是对本申请的若干实施进行了具体说明,但本申请并不局限于上述实施方式,熟悉本领域的技术人员在不违背本申请本质的前提下还可作出种种的等同变形或替换,这些等同的变形或替换均包含在本申请权利要求所限定的范围内。

Claims (10)

  1. 一种模型训练方法,包括:
    获取训练中文语料和字音字形混淆集,其中,所述字音字形混淆集为中文近音字混淆集和中文形似字混淆集的合集;
    根据所述字音字形混淆集构建字音模型和字形模型;
    根据所述训练中文语料确定字符嵌入;
    将所述训练中文语料输入所述字音模型和所述字形模型,分别得到拼音嵌入和字形嵌入;
    将所述字符嵌入、所述拼音嵌入和所述字形嵌入输入深度双向预训练语言模型并利用掩码策略进行预训练;
    对所述预训练后的深度双向预训练语言模型进行微调,得到中文文本纠错语言模型。
  2. 根据权利要求1所述的方法,其中,所述获取训练中文语料和字音字形混淆集的步骤之后,还包括:
    对所述训练中文语料进行预处理,其中,所述预处理包括标点符号标准化处理和简体化处理;
    对所述字音字形混淆集进行简体化处理。
  3. 根据权利要求1所述的方法,其中,所述将所述训练中文语料输入所述字音模型和所述字形模型,分别得到拼音嵌入和字形嵌入,包括:
    对所述训练中文语料进行分词处理,得到中文字符;
    将所述中文字符输入预设的中文拼音转换模块,得到拼音序列;
    将所述拼音序列输入所述字音模型,得到拼音嵌入;
    将所述中文字符输入预设的中文图片转换模块,得到字符图像;
    对所述字符图像进行图像增强处理,得到图像数据集;
    将所述图像数据集输入所述字形模型,得到字形嵌入。
  4. 根据权利要求1所述的方法,其中,所述字音模型包括长短期记忆网络,所述字形模型包括卷积神经网络。
  5. 根据权利要求1所述的方法,其中,所述将所述字符嵌入、所述拼音嵌入和所述字形嵌入输入深度双向预训练语言模型并利用掩码策略进行预训练,包括:
    将所述字符嵌入、所述拼音嵌入和所述字形嵌入输入深度双向预训练语言模型,得到字符预测、近音字混淆集预测和形似字混淆集预测;
    根据所述字符嵌入和所述字符预测确定掩码损失;
    根据所述拼音嵌入和所述近音字混淆集预测确定近音字混淆集预测损失;
    根据所述字形嵌入和所述形似字混淆集预测确定形似字混淆集预测损失;
    根据所述掩码损失、所述近音字混淆集预测损失和所述形似字混淆集预测损失确定模型损失;
    根据所述模型损失,利用掩码策略对所述深度双向预训练语言模型进行预训练。
  6. 根据权利要求1所述的方法,其中,所述对所述预训练后的深度双向预训练语言模型进行微调,得到中文文本纠错语言模型,包括:
    获取第一纠错语料和第二纠错语料,其中,所述第一纠错语料由预设的中文纠错语料生成算法生成,所述第二纠错语料由预设的中文文本纠错数据集得到;
    对所述第一纠错语料和所述第二纠错语料进行预处理,其中,所述预处理包括标点符号标准化处理和简体化处理;
    根据所述预处理后的第一纠错语料和预设的第一微调参数,对所述深度双向预训练语言模型进行微调;
    根据所述预处理后的第二纠错语料和预设的第二微调参数,对所述深度双向预训练语言模型进行微调,得到中文文本纠错语言模型。
  7. 一种中文文本纠错方法,包括:
    获取待纠错中文文本;
    将所述待纠错中文文本输入训练后的中文文本纠错语言模型,得到纠错文本,其中,所述中文文本纠错语言模型由权利要求1至6任一所述的模型训练方法训练得到。
  8. 一种中文语音识别纠错方法,包括:
    获取待纠错语音;
    对所述待纠错语音进行语音识别处理,得到待纠错中文文本;
    将所述待纠错中文文本输入训练后的中文文本纠错语言模型,得到纠错文本,其中,所述中文文本纠错语言模型由权利要求1至6任一所述的模型训练方法训练得到。
  9. 一种电子设备,包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其中,所述处理器执行所述计算机程序时实现如权利要求1至6任意一项所述的模型训练方法,或者如权利要求7所述的中文文本纠错方法,或者如权利要求8所述的中文语音识别纠错方法。
  10. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有计算机可执行程序,所述计算机可执行程序用于使计算机执行如权利要求1至6任意一项所述的模型训练方法,或者如权利要求7所述的中文文本纠错方法,或者如权利要求8所述的中文语音识别纠错方法。
PCT/CN2022/130617 2021-11-23 2022-11-08 模型训练方法、中文文本纠错方法、电子设备和存储介质 WO2023093525A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111394466.0 2021-11-23
CN202111394466.0A CN116167362A (zh) 2021-11-23 2021-11-23 模型训练方法、中文文本纠错方法、电子设备和存储介质

Publications (1)

Publication Number Publication Date
WO2023093525A1 true WO2023093525A1 (zh) 2023-06-01

Family

ID=86410059

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/130617 WO2023093525A1 (zh) 2021-11-23 2022-11-08 模型训练方法、中文文本纠错方法、电子设备和存储介质

Country Status (2)

Country Link
CN (1) CN116167362A (zh)
WO (1) WO2023093525A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116822498A (zh) * 2023-08-30 2023-09-29 深圳前海环融联易信息科技服务有限公司 文本纠错处理方法、模型处理方法、装置、设备及介质
CN117056522A (zh) * 2023-10-11 2023-11-14 青岛网信信息科技有限公司 一种互联网言论优化处理方法、介质及系统
CN117829147A (zh) * 2024-01-04 2024-04-05 北京新数科技有限公司 一种基于词性的掩码策略与对抗文本攻击的防御方法、系统、设备及可读存储介质
CN118114743A (zh) * 2024-04-29 2024-05-31 支付宝(杭州)信息技术有限公司 医疗模型预训练的方法、装置、电子设备及存储介质

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118013957B (zh) * 2024-04-07 2024-07-12 江苏网进科技股份有限公司 一种文本序列纠错方法、设备和存储介质
CN118133813A (zh) * 2024-05-08 2024-06-04 北京澜舟科技有限公司 中文拼写纠错模型的训练方法以及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108491392A (zh) * 2018-03-29 2018-09-04 广州视源电子科技股份有限公司 文字拼写错误的修正方法、系统、计算机设备及存储介质
CN112287670A (zh) * 2020-11-18 2021-01-29 北京明略软件系统有限公司 文本纠错方法、系统、计算机设备及可读存储介质
CN112966496A (zh) * 2021-05-19 2021-06-15 灯塔财经信息有限公司 一种基于拼音特征表征的中文纠错方法及系统
WO2021189851A1 (zh) * 2020-09-03 2021-09-30 平安科技(深圳)有限公司 文本纠错方法、系统、设备及可读存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108491392A (zh) * 2018-03-29 2018-09-04 广州视源电子科技股份有限公司 文字拼写错误的修正方法、系统、计算机设备及存储介质
WO2021189851A1 (zh) * 2020-09-03 2021-09-30 平安科技(深圳)有限公司 文本纠错方法、系统、设备及可读存储介质
CN112287670A (zh) * 2020-11-18 2021-01-29 北京明略软件系统有限公司 文本纠错方法、系统、计算机设备及可读存储介质
CN112966496A (zh) * 2021-05-19 2021-06-15 灯塔财经信息有限公司 一种基于拼音特征表征的中文纠错方法及系统

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116822498A (zh) * 2023-08-30 2023-09-29 深圳前海环融联易信息科技服务有限公司 文本纠错处理方法、模型处理方法、装置、设备及介质
CN116822498B (zh) * 2023-08-30 2023-12-01 深圳前海环融联易信息科技服务有限公司 文本纠错处理方法、模型处理方法、装置、设备及介质
CN117056522A (zh) * 2023-10-11 2023-11-14 青岛网信信息科技有限公司 一种互联网言论优化处理方法、介质及系统
CN117056522B (zh) * 2023-10-11 2024-03-15 青岛网信信息科技有限公司 一种互联网言论优化处理方法、介质及系统
CN117829147A (zh) * 2024-01-04 2024-04-05 北京新数科技有限公司 一种基于词性的掩码策略与对抗文本攻击的防御方法、系统、设备及可读存储介质
CN118114743A (zh) * 2024-04-29 2024-05-31 支付宝(杭州)信息技术有限公司 医疗模型预训练的方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
CN116167362A (zh) 2023-05-26

Similar Documents

Publication Publication Date Title
WO2023093525A1 (zh) 模型训练方法、中文文本纠错方法、电子设备和存储介质
WO2021212749A1 (zh) 命名实体标注方法、装置、计算机设备和存储介质
CN107729309B (zh) 一种基于深度学习的中文语义分析的方法及装置
US20210200961A1 (en) Context-based multi-turn dialogue method and storage medium
CN109344830B (zh) 语句输出、模型训练方法、装置、计算机设备及存储介质
CN110276069A (zh) 一种中国盲文错误自动检测方法、系统及存储介质
CN113177412A (zh) 基于bert的命名实体识别方法、系统、电子设备及存储介质
CN110781672A (zh) 基于机器智能的题库生产方法及系统
CN111914825B (zh) 文字识别方法、装置及电子设备
CN113268576B (zh) 一种基于深度学习的部门语义信息抽取的方法及装置
CN113743101B (zh) 文本纠错方法、装置、电子设备和计算机存储介质
CN115759119B (zh) 一种金融文本情感分析方法、系统、介质和设备
CN112307130B (zh) 一种文档级远程监督关系抽取方法及系统
US11615247B1 (en) Labeling method and apparatus for named entity recognition of legal instrument
CN112016271A (zh) 语言风格转换模型的训练方法、文本处理方法以及装置
Shan et al. Robust encoder-decoder learning framework towards offline handwritten mathematical expression recognition based on multi-scale deep neural network
CN115658898A (zh) 一种中英文本实体关系抽取方法、系统及设备
Kišš et al. AT-ST: self-training adaptation strategy for OCR in domains with limited transcriptions
CN111832248A (zh) 文本规整方法、装置、电子设备和存储介质
CN115064154A (zh) 混合语言语音识别模型的生成方法及装置
CN111126059B (zh) 一种短文文本的生成方法、生成装置及可读存储介质
CN112307749A (zh) 文本检错方法、装置、计算机设备和存储介质
Sharma et al. Full-page handwriting recognition and automated essay scoring for in-the-wild essays
CN112488111A (zh) 一种基于多层级表达引导注意力网络的指示表达理解方法
CN112989839A (zh) 一种基于关键词特征嵌入语言模型的意图识别方法及系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22897608

Country of ref document: EP

Kind code of ref document: A1