CN115563959A - Chinese pinyin spelling error correction-oriented self-supervision pre-training method, system and medium - Google Patents

Chinese pinyin spelling error correction-oriented self-supervision pre-training method, system and medium Download PDF

Info

Publication number
CN115563959A
CN115563959A CN202211156374.3A CN202211156374A CN115563959A CN 115563959 A CN115563959 A CN 115563959A CN 202211156374 A CN202211156374 A CN 202211156374A CN 115563959 A CN115563959 A CN 115563959A
Authority
CN
China
Prior art keywords
pinyin
list
character
input
chinese
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211156374.3A
Other languages
Chinese (zh)
Inventor
苏锦钿
曹庭毓
顾伟正
吴清培
高浩然
刘亚菲
洪奕槐
郑欣若
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202211156374.3A priority Critical patent/CN115563959A/en
Publication of CN115563959A publication Critical patent/CN115563959A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation

Abstract

The invention discloses a Chinese pinyin spelling error correction oriented self-supervision pre-training method, a system and a medium, wherein the method comprises the following steps: acquiring a Chinese text sequence, and converting the Chinese text sequence into an input sentence X; acquiring a list of characters needing to be replaced by a pinyin confusion set from an input sentence X, and recording the list as PYList (X); for each character X in the PYList (X), obtaining the pinyin of the character X, obtaining a homophone list according to the pinyin, and replacing the character X with a new character according to the homophone list; after all the characters in the PYList (X) are processed, obtaining a new Input sentence PYInput (X), and obtaining the Input (X) of the BERT model according to the Input sentence PYInput (X); and after the Input (X) is used as the Input of the BERT model and is trained, predicting the correct value of each word in the Input (X) through a mask language model in the BERT model. The method carries out replacement after the phonetic confusion set, converts the MLM task in the BERT into the prediction of the correct value of the masked code word, enhances the error correction capability of the BERT, and can be widely applied to the field of natural language processing.

Description

Chinese pinyin spelling error correction-oriented self-supervision pre-training method, system and medium
Technical Field
The invention relates to the field of natural language processing, in particular to a Chinese pinyin spelling error correction-oriented self-supervision pre-training method, system and medium.
Background
Text error correction is an important subtask and research direction in Natural Language Processing (NLP), and other main purposes are to automatically implement text inspection, error recognition and error correction in a machine learning manner, so that the accuracy of Language expression is improved and the labor verification cost is reduced. Chinese text correction faces more difficulties and challenges than english text correction research, especially because there is no delimiter between chinese words and there is no morphological change, which makes both syntax and semantic interpretation of chinese highly dependent on its context. As an important component in the study of Chinese text correction, the purpose of Chinese Spelling Correction (CSC) is to detect and correct misspellings of characters (or words) appearing in Chinese text. Early research on chinese spell correction mainly employed rule-based or statistical methods based on long-term accumulated error correction rules and error correction dictionaries, and divided error correction into three steps of error detection, candidate recall, and error correction, and the main disadvantages were that various complicated rules need to be manually formulated, that there is no versatility, and that it depends heavily on the quality and quantity of training data.
In recent two years, as the pre-training language model and the task-based two-stage method for fine tuning acquire new best results on a plurality of natural language processing tasks, some scholars begin to apply the pre-training language model BERT/RoBERTA and the like to Chinese spelling correction tasks, and acquire an effect obviously superior to that of the traditional rule-based or statistical machine translation method on a plurality of linguistic materials such as Sighan2013, sighan2014 and Sighan 2015. Some typical representative works are FASPell, soft-Masked BERT, spellGCN, DCN, etc.
Compared with the traditional methods based on rules or statistical machine translation and the like, the CSC method based on the BERT/RoBERTA pre-training language model can greatly improve the effect of the model in the Chinese spelling error correction task through the pre-training of massive texts and the fine tuning of downstream tasks. But because BERT and the like mainly adopt MASK Language Model (MLM) and adjacent Sentence Prediction (NSP) tasks during pre-training, correct values of words are predicted after the words are masked by [ MASK ]. This makes the model lack of error recognition and correction capabilities and leads to inconsistency in the pre-training and fine-tuning targets of the model.
Disclosure of Invention
In order to solve at least one of the technical problems in the prior art to a certain extent, the present invention aims to provide a self-supervision pre-training method, system and medium for spelling error correction of Chinese pinyin.
The technical scheme adopted by the invention is as follows:
a Chinese pinyin spelling error correction oriented self-supervision pre-training method comprises the following steps:
acquiring a Chinese text sequence, and converting the Chinese text sequence into an input sentence X meeting the requirements of a BERT model according to a preset mark;
acquiring a list of characters needing to be replaced by a pinyin confusion set from an input sentence X, and recording the list as PYList (X);
for each character X in the PYList (X), obtaining the pinyin of the character X, obtaining a homophone list according to the pinyin, and replacing the character X with a new character according to the homophone list;
after all the characters in the PYList (X) are processed, obtaining a new Input sentence PYInput (X), and obtaining the Input (X) of the BERT model according to the Input sentence PYInput (X);
after Input (X) is used as the Input of the BERT model and is trained, the correct values of the words except [ CLS ] and [ SEP ] in the Input (X) are predicted through a mask language model BertForMaskedLM in the BERT model.
Further, the obtaining a list of words to be replaced with the pinyin confusion set from the input sentence X, and denoted as PYList (X), includes:
for an input sentence X, selecting a corresponding masked word by adopting a mask strategy of BERT, namely selecting 15% of words in the input sentence X for replacement; wherein the selected word has 80% probability to be replaced by the word of the word in the pinyin confusion set, 10% probability to be replaced by a word in the vocabulary randomly, and 10% probability to be kept unchanged;
for convenience of representation, a list of all the selected words in the input sentence X as the words to be replaced with the pinyin confusion set is denoted as PYList (X).
Further, the obtaining a pinyin of the word X for each word X in the PYList (X), obtaining a homophone list according to the pinyin, and replacing the word X with a new word according to the homophone list includes:
for each word X in PYList (X) the following steps are performed:
obtaining the pinyin and monotony of the character by using a Chinese character pinyin-converting tool pypinyin in Python;
according to the Pinyin of the character, a Pinyin-to-Chinese character tool Pinyin2Hanzi in Python is used for obtaining a homophone list under the Pinyin; if the list is not empty, obtaining the homonym and homonym different-tone list SamePYList (x) of the character; if the list is empty, the list SamePYList (x) is set to be empty;
if the end of the pinyin of the character is g, removing g, and obtaining a near sound list DiffPYList (x) of the character by using a pinyin-to-Chinese character tool in Python;
selecting a replacement word according to the list SamePYList (x) and the list DiffPYList (x); if the list SamePYList (x) or the list DiffPYList (x) is empty, the original word is left unchanged.
Further, the obtaining a new Input sentence PYInput (X) after processing all the words in PYList (X), obtaining the Input (X) of the BERT model according to the Input sentence PYInput (X), includes:
converting each word in PYInput (X) into a serial number in a vocabulary, and combining position word vectors, word vector information and segmentation information to be used as an input sentence of a BERT layer; and simultaneously setting the serial numbers of characters which do not need to be replaced and the completion mark [ PAD ] in the Input sentence X as-100 to obtain the Input (X) of the BERT model.
Further, in the training process, the loss function adopts standard cross entropy, and the loss value is recorded as LossMAPSSC;
if the training data is a single sentence, the loss value only comprises LossMAPSYC; if the training data is a plurality of sentences, predicting NSP by combining the relation of adjacent sentences in BERT or the sequence relation SOP of sentences in ALBERT, further predicting the relation between sentences, calculating corresponding loss values, and then summing the loss values with LossMAPSYC to obtain the loss value of the model.
Further, in order to better utilize the grammar and semantic knowledge learned by the pre-training language model on massive texts, the pre-training process adopts a preset pre-training language model as a basic model for initialization.
Further, the pre-training process adopts a batch training mode, and the learning rate is 5e -5 The 10 rounds were trained and optimized using an Adam optimizer.
The other technical scheme adopted by the invention is as follows:
a self-supervision pre-training system for Chinese pinyin spelling error correction comprises:
the Chinese text input module is used for acquiring a Chinese text sequence and converting the Chinese text sequence into an input sentence X meeting the requirement of a BERT model according to a preset mark;
a substituted character selection module, which is used for obtaining a list of characters needing to be substituted by using the pinyin confusion set from the input sentence X and recording the list as PYList (X);
the character replacing module is used for acquiring the pinyin of each character X in the PYList (X), acquiring a homophone list according to the pinyin, and replacing the character X with a new character according to the homophone list;
the training set acquisition module is used for acquiring a new Input sentence PYInput (X) after all the characters in the PYList (X) are processed, and acquiring the Input (X) of the BERT model according to the Input sentence PYInput (X);
and the model training module is used for predicting the correct values of all the other words except [ CLS ] and [ SEP ] in the Input (X) through a mask language model BertForMaskedLM in the BERT model after the Input (X) is used as the Input of the BERT model and is trained.
The invention adopts another technical scheme that:
a Chinese pinyin spelling error correction oriented self-supervision pre-training system comprises:
at least one processor;
at least one memory for storing at least one program;
when executed by the at least one processor, cause the at least one processor to implement the method described above.
The other technical scheme adopted by the invention is as follows:
a computer readable storage medium, in which a program executable by a processor is stored, the program executable by the processor being for performing the method as described above when executed by the processor.
The beneficial effects of the invention are: the method combines Chinese character to pinyin conversion and pinyin to Chinese character tools to construct a pinyin confusion set of the masked code words for replacement, converts the MLM task in the BERT into correct value prediction of the masked code words, and finally further enhances the error detection and correction capability of the BERT on the premise that the BERT does not change the structure of the model through retraining again.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description is made on the drawings of the embodiments of the present invention or the related technical solutions in the prior art, and it should be understood that the drawings in the following description are only for convenience and clarity of describing some embodiments in the technical solutions of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a schematic structural diagram of a BERT pre-training model based on MAPOSSC in an embodiment of the present invention;
FIG. 2 is a flowchart illustrating an embodiment of an auto-supervised pre-training method for spelling error correction of Chinese pinyin.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention and are not to be construed as limiting the present invention. The step numbers in the following embodiments are provided only for convenience of illustration, the order between the steps is not limited at all, and the execution order of each step in the embodiments can be adapted according to the understanding of those skilled in the art.
In the description of the present invention, it should be understood that the orientation or positional relationship referred to in the description of the orientation, such as the upper, lower, front, rear, left, right, etc., is based on the orientation or positional relationship shown in the drawings, and is only for convenience of description and simplification of description, and does not indicate or imply that the device or element referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus, should not be construed as limiting the present invention.
In the description of the present invention, the meaning of a plurality of means is one or more, the meaning of a plurality of means is two or more, and larger, smaller, larger, etc. are understood as excluding the number, and larger, smaller, inner, etc. are understood as including the number. If there is a description of first and second for the purpose of distinguishing technical features only, this is not to be understood as indicating or implying a relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of technical features indicated.
In the description of the present invention, unless otherwise explicitly limited, terms such as arrangement, installation, connection and the like should be understood in a broad sense, and those skilled in the art can reasonably determine the specific meanings of the above terms in the present invention in combination with the specific contents of the technical solutions.
Aiming at the defects of the prior art, the invention provides a Chinese PinYin Spelling error Correction oriented self-supervision pre-training method MAPYSC (Masked Language Model as Pinyin Spelling Correction), aiming at the problem that a mask Language Model MLM adopted by a pre-training Language Model BERT in a pre-training stage lacks the recognition and Correction capability of Chinese PinYin Spelling errors and has target inconsistency with a fine-tuning stage, a mask and correct value prediction task for characters in the MLM is converted into the recognition and Correction capability of Spelling misspelled characters, a PinYin confusing character set of the Masked characters is constructed from the aspects of homophone, sound and near-sound by combining a Chinese character to PinYin and PinYin to Chinese character conversion tool, the training target of the Model is changed into the predicted correct characters, and finally, the Chinese PinYin misspelling detection and Correction capability of the BERT is further enhanced on the premise that the structure of the BERT is not changed by utilizing the MAPYSC to train again on the basis of disclosing the pre-training Language Model and a data set, and the Chinese PinYin misspelling effect can be directly applied to various Chinese PinYin misspelling models and improving the Chinese misspelling effect of the BERT-based on the BERT.
The method comprises the steps of converting mask codes and correct value prediction tasks of MLM paired characters in original BERT into recognition and error correction of spelling misspelling characters by MAPOSC, constructing a spelling confusion word set of the masked characters from the aspects of homophony, phonology and the like by combining Chinese character to pinyin and pinyin to Chinese character conversion tools, and replacing the spelling confusion word set by combining the Chinese character to pinyin and character conversion tools, so as to form a pre-training data set containing Chinese spelling misspelling, retraining a public pre-training language model based on massive texts and MLM by combining pre-training and loss functions of the model on the basis, further enhancing the error detection and correction capability of BERT to Chinese pinyin spelling, and finally applying the pre-training BERT model obtained after retraining to fine tuning and testing of downstream tasks. Experiments on the tested open corpora, such as Sighan2013, sighan2014 and Sighan2015 prove that the MAPHSC method not only can improve the error detection and correction capability of the pretrained language model BERT on Chinese pinyin spelling correction, but also can be directly applied to other BERT-based Chinese spelling correction models such as Soft-Masked BERT, spellGCN and DCN on the premise of not changing the structure of the BERT model, and further improves the effect of the model.
As shown in fig. 1 and fig. 2, the embodiment provides a self-supervised pre-training method for chinese pinyin spelling error correction, which utilizes the advantages of a mask language model MLM and pre-training in BERT in learning context information of a text sequence and the relationship thereof to provide a self-supervised pre-training method MAPYSC for chinese pinyin spelling error correction, combines a chinese character-to-pinyin and pinyin-to-chinese character tool to construct a pinyin confusion set of a masked word for replacement, converts the MLM task in BERT into correct value prediction of the masked word, and finally further enhances the error detection and correction capability of BERT for chinese pinyin on the premise that BERT does not change the structure of the model through re-pre-training. The method specifically comprises the following steps:
s1, obtaining a Chinese text sequence, and converting the Chinese text sequence into an input sentence X meeting the requirements of a BERT model according to a preset mark.
For the input Chinese text sequence, combining the text sequence into an input sequence X meeting the BERT requirement by combining special marks such as [ CLS ] and [ SEP ].
S2, acquiring a list of characters needing to be replaced by the pinyin confusion set from the input sentence X, and recording the list as PYList (X).
For an input sentence X, a corresponding masked word is selected by adopting a BERT original masking strategy, namely 15% of words (not containing special marks such as [ CLS ] and [ SEP ] and the like) in the input sentence are selected for replacement, wherein the selected words are replaced by the words in a pinyin confusion set of the words with the probability of 80%, the selected words are randomly replaced by a certain word in a vocabulary with the probability of 10%, and the probability of 10% is kept unchanged. For convenience of representation, a list of all the selected words in the input sentence X as the words to be replaced with the pinyin confusion set is denoted as PYList (X).
S3, for each character X in the PYList (X), obtaining the pinyin of the character X, obtaining a homophone list according to the pinyin, and replacing the character X with a new character according to the homophone list.
For each word X in PYList (X), MAPYSC employs the following strategy and flow: 1. the pinyin and monotony of the character are obtained by using a Chinese character-to-pinyin tool pypinyin in Python. Pypinyin does not adopt the polyphone mode as a default, namely heteronym is set to False;2. and according to the Pinyin of the character, obtaining a homophone list under the Pinyin by using a Pinyin-to-Chinese character tool Pinyin2Hanzi in Python. If the list is not empty, the list SamePYList (x) of homophones and homophones of the word is obtained. If the list is empty, samePYList (x) is set to empty. 3. If the end of the pinyin of the character is g, removing g and obtaining a near sound list DiffPYList (x) of the character by utilizing the above process 2; 4. on the basis of SamePYList (x) and DiffPYList (x), a certain word in SamePYList (x) and DiffPYList (x) is randomly selected with a probability of 70% and 30%, respectively. If SamePYList (x) or DiffPYList (x) is empty, the original word is left unchanged.
And S4, after all the words in the PYList (X) are processed, obtaining a new Input sentence PYInput (X), and obtaining the Input (X) of the BERT model according to the Input sentence PYInput (X).
The above step S3 is continuously performed until all the words in PYList (X) are processed, and finally a new input sentence PYInput (X) is obtained. And converting each character in the PYInput (X) into a serial number in a vocabulary table, combining the position word vector, the word vector information and the segmentation information to be used as an Input sentence of the BERT layer, setting the serial numbers of the unselected characters and the completion mark [ PAD ] in the original X as-100, and finally obtaining the Input (X) of the BERT model.
And S5, after the Input (X) is used as the Input of the BERT model and is trained, predicting the correct values of the words except [ CLS ] and [ SEP ] in the Input (X) through a mask language model BertForMaskedLM in the BERT model.
Using Input (X) as the Input of Bertformask (Kedlm), training and predicting the correct value of each word, wherein the Loss function adopts standard cross entropy, and the Loss value is recorded as Loss MAPYSC . In order to better utilize the grammar and semantic knowledge learned by the pre-training language model on massive texts, the pre-training process adopts an open pre-training language model (such as BERT-base or BERT-large) as a basic model for initialization. The pre-training process adopts a batch training mode, the learning rate is 5e-5, 10 rounds of training are performed totally, and an Adam optimizer is adopted. If the training data are all single sentences, the Loss value only contains Loss MAPYSC . If the training data is a plurality of sentences, the combination can be performedPredicting relation between sentences by using adjacent sentence relation in BERT or predicting relation between sentences by using sentence order relation in ALBERT SOP, and calculating corresponding Loss value and Loss MAPYSC The loss values are summed as a model.
And S4 and S5 are model input construction and correct value prediction training, original sentences are replaced after a construction confusion set is used to obtain new input sentences, correct values of masked words are predicted by the BertForMaskedLM on the basis, and corresponding loss values and optimization model parameters are calculated through standard cross entropy. The output result of step S5 is the cross entropy between the serial number of each masked word on the vocabulary and the serial number of the correct value, the smaller the value, the more accurate the model prediction is represented, and the larger the value, the more incorrect the model prediction is represented.
As an alternative implementation, an automatic supervised pre-training method for chinese pinyin spelling error correction of the present embodiment is completed by automatically constructing a confusion set and training data based on a pre-training language model BERT. Step S1 is a data preprocessing link. Step S2 is a masking strategy and masking procedure, i.e. 15% of the words in the input sentence are selected (not containing [ CLS ]]And [ SEP]Etc.) where the selected word is replaced with a word in the pinyin confusion set of the word with a probability of 80%, replaced with a word in the vocabulary at random with a probability of 10%, and the probability of 10% remains the same. And S3, constructing a pinyin confusion set and replacing the masked characters, namely respectively acquiring homophones and near-consonant confusion sets of the masked characters by using a Chinese character-to-pinyin tool and a pinyin-to-Chinese character tool, and respectively selecting homophones and near-consonant characters in the pinyin confusion set according to the probabilities of 70% and 30% for replacement. Step S4 is the input structure of BERT, and non-selected words and completion mark [ PAD ]]The serial number of (2) is-100. And S5, judging the correct value of the masked word based on the BertForMaskedLM model, calculating a loss value by using standard cross entropy, and then optimizing the model parameters. The model is optimized by adopting a value aiming at a loss function in the pre-training process, and an Adam optimizer is combined, so that the learning rate is unified to 5e -5
In summary, compared with the prior art, the method of the embodiment has the following advantages and beneficial effects: the invention provides a Chinese pinyin spelling error correction-oriented self-supervision pre-training method MAPHYSC, which aims at solving the problems that a mask language model MLM adopted by a pre-training language model BERT in a pre-training stage is lack of recognition and correction capability of Chinese pinyin spelling errors and is inconsistent with a target in a fine-tuning stage. Firstly, preprocessing an input sentence and forming an input meeting the BERT requirement; then, selecting a corresponding masked word by using an MLM masking strategy in BERT; then, combining Chinese character to pinyin and pinyin to Chinese character tools to construct a pinyin confusing character set of the mask characters from the aspects of homophones, near-sounds and the like, and then replacing the pinyin confusing character set, and simultaneously changing the training target of the model into a character with correct prediction; finally, the pre-trained language model and the data set are disclosed, and the MAPOSSC is used for retraining, so that the error detection and correction capability of the Chinese spelling is further enhanced on the premise that the BERT does not change the structure of the model, and the obtained pre-trained BERT model can be directly applied to other various BERT-based Chinese spelling correction methods, and the effect of the model can be effectively improved.
The embodiment also provides a self-supervision pre-training system for Chinese pinyin spelling error correction, which includes:
the Chinese text input module is used for acquiring a Chinese text sequence and converting the Chinese text sequence into an input sentence X meeting the requirement of a BERT model according to a preset mark;
a substituted character selection module, which is used for acquiring a list of characters needing to be substituted by using the pinyin confusion set from the input sentence X and recording the list as PYList (X);
the character replacing module is used for acquiring the pinyin of each character X in the PYList (X), acquiring a homophone list according to the pinyin, and replacing the character X with a new character according to the homophone list;
a training set obtaining module, configured to obtain a new Input sentence PYInput (X) after all words in the PYList (X) are processed, and obtain an Input sentence (X) of the BERT model according to the Input sentence PYInput (X);
and the model training module is used for predicting the correct value of each word after the Input (X) is used as the Input of the BertForMaskedLM and is trained.
The Chinese pinyin spelling error correction-oriented self-supervision pre-training system can execute the Chinese pinyin spelling error correction-oriented self-supervision pre-training method provided by the embodiment of the method, can execute the implementation steps of any combination of the embodiment of the method, and has corresponding functions and beneficial effects of the method.
The embodiment further provides an auto-supervised pre-training system for chinese pinyin spelling error correction, which includes:
at least one processor;
at least one memory for storing at least one program;
when executed by the at least one processor, cause the at least one processor to implement the method of fig. 2.
The Chinese pinyin spelling error correction-oriented self-supervision pre-training system can execute the Chinese pinyin spelling error correction-oriented self-supervision pre-training method provided by the embodiment of the method, can execute the implementation steps of any combination of the embodiment of the method, and has corresponding functions and beneficial effects of the method.
The embodiment of the application also discloses a computer program product or a computer program, which comprises computer instructions, and the computer instructions are stored in a computer readable storage medium. The computer instructions may be read by a processor of a computer device from a computer-readable storage medium, and executed by the processor, causing the computer device to perform the method illustrated in fig. 2.
The embodiment also provides a storage medium, which stores an instruction or a program capable of executing the Chinese pinyin spelling correction oriented self-supervision pre-training method, and when the instruction or the program is run, the method can be executed by any combination of the embodiments, and has corresponding functions and beneficial effects.
In alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flow charts of the present invention are provided by way of example in order to provide a more comprehensive understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed and in which sub-operations described as part of larger operations are performed independently.
Furthermore, although the present invention is described in the context of functional modules, it should be understood that, unless otherwise stated to the contrary, one or more of the described functions and/or features may be integrated in a single physical device and/or software module, or one or more functions and/or features may be implemented in a separate physical device or software module. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary for an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be understood within the ordinary skill of an engineer given the nature, function, and interrelationships of the modules. Accordingly, those skilled in the art can, using ordinary skill, practice the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative of and not intended to limit the scope of the invention, which is defined by the appended claims and their full scope of equivalents.
The functions may be stored in a computer-readable storage medium if they are implemented in the form of software functional units and sold or used as separate products. Based on such understanding, the technical solution of the present invention or a part thereof which substantially contributes to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
In the foregoing description of the specification, reference to the description of "one embodiment/example," "another embodiment/example," or "certain embodiments/examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.
While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (10)

1. A Chinese pinyin spelling error correction oriented self-supervision pre-training method is characterized by comprising the following steps:
acquiring a Chinese text sequence, and converting the Chinese text sequence into an input sentence X meeting the requirements of a BERT model according to a preset mark;
acquiring a list of characters needing to be replaced by a pinyin confusion set from an input sentence X, and recording the list as PYList (X);
for each character X in the PYList (X), obtaining the pinyin of the character X, obtaining a homophone list according to the pinyin, and replacing the character X with a new character according to the homophone list;
after all the words in PYList (X) have been processed, obtaining a new Input sentence PYInput (X), and obtaining an Input (X) of the BERT model from the Input sentence PYInput (X);
and after the Input (X) is used as the Input of the BERT model and is trained, predicting the correct value of each word in the Input (X) through a mask language model in the BERT model.
2. The Chinese-pinyin-spelling-error-correction-oriented self-supervision-pretraining method as claimed in claim 1, wherein the obtaining a list of words to be replaced by a pinyin confusion set from an input sentence X and recording the list as PYList (X) comprises:
for an input sentence X, selecting a corresponding masked word by adopting a mask strategy of BERT, namely selecting 15% of words in the input sentence X for replacement; wherein the selected word has 80% probability to be replaced by the word of the word in the pinyin confusion set, 10% probability to be replaced by a word in the vocabulary randomly, and 10% probability to be kept unchanged;
for convenience of representation, a list of all the selected words in the input sentence X as the words to be replaced with the pinyin confusion set is denoted as PYList (X).
3. The self-supervised pre-training method for spelling error correction of Chinese phonetic alphabet as claimed in claim 1, wherein for each word X in PYList (X), obtaining the spelling of the word X, obtaining the homophone list according to the spelling, and replacing the word X with a new word according to the homophone list comprises:
for each word X in PYList (X) the following steps are performed:
obtaining the pinyin and monotony of the character by using a Chinese character pinyin-converting tool pypinyin in Python;
according to the Pinyin of the character, a Pinyin-to-Chinese character tool Pinyin2Hanzi in Python is used for obtaining a homophone list under the Pinyin; if the list is not empty, obtaining the homonym and homonym different-tone list SamePYList (x) of the character; if the list is empty, the list SamePYList (x) is set to be empty;
if the end of the pinyin of the character is g, removing g, and obtaining a near sound list DiffPYList (x) of the character by using a pinyin-to-Chinese character tool in Python;
selecting a replacement word according to the list SamePYList (x) and the list DiffPYList (x); if the list SamePYList (x) or the list DiffPYList (x) is empty, the original word is left unchanged.
4. The self-supervised pre-training method for chinese pinyin spelling error correction as recited in claim 1, wherein the obtaining a new Input sentence PYInput (X) after processing all words in PYList (X), and obtaining the Input (X) of the BERT model according to the Input sentence PYInput (X) comprises:
converting each character in PYInput (X) into a serial number in a vocabulary, merging position word vectors, word vector information and segmentation information to be used as an input sentence of a BERT layer; and meanwhile, setting the serial number of the characters which do not need to be replaced and the completion mark [ PAD ] in the Input sentence X as-100 to obtain the Input (X) of the BERT model.
5. The self-supervised pre-training method for spelling error correction of Chinese phonetic alphabet as claimed in claim 1, wherein in the training process, the Loss function adopts standard cross entropy and records the Loss value as Loss MAPYSC
If the training data are all single sentences, the Loss value only contains Loss MAPYSC (ii) a If the training data is a plurality of sentences, predicting NSP (non-subsampled prediction) by combining the relation of adjacent sentences in BERT (binary search notation) or SOP (sequence of sentences) in ALBERT (approximate search notation) to further predict the relation between sentences, and calculating corresponding Loss values and Loss MAPYSC The loss values are summed as a model.
6. The self-supervised pre-training method for Chinese phonetic spelling error correction according to claim 5, wherein in order to better utilize the grammar and semantic knowledge learned by the pre-trained language model on massive texts, the pre-training process is initialized by using a preset pre-trained language model as a basic model.
7. The self-supervised pre-training method for spelling error correction of Chinese Pinyin as claimed in claim 6, wherein the pre-training process is performed in batch training mode with a learning rate of 5e -5 The 10 rounds were trained and optimized using an Adam optimizer.
8. A Chinese pinyin spelling error correction oriented self-supervision pre-training system is characterized by comprising:
the Chinese text input module is used for acquiring a Chinese text sequence and converting the Chinese text sequence into an input sentence X meeting the requirement of a BERT model according to a preset mark;
a substituted character selection module, which is used for obtaining a list of characters needing to be substituted by using the pinyin confusion set from the input sentence X and recording the list as PYList (X);
the character replacing module is used for acquiring the pinyin of each character X in the PYList (X), acquiring a homophone list according to the pinyin, and replacing the character X with a new character according to the homophone list;
the training set acquisition module is used for acquiring a new Input sentence PYInput (X) after all the characters in the PYList (X) are processed, and acquiring the Input (X) of the BERT model according to the Input sentence PYInput (X);
and the model training module is used for predicting the correct value of each word in the Input (X) through a mask language model in the BERT model after the Input (X) is used as the Input of the BERT model and is trained.
9. A self-supervision pre-training system for Chinese pinyin spelling error correction is characterized by comprising:
at least one processor;
at least one memory for storing at least one program;
when executed by the at least one processor, cause the at least one processor to implement the method of any one of claims 1-7.
10. A computer readable storage medium in which a program executable by a processor is stored, wherein the program executable by the processor is adapted to perform the method according to any one of claims 1 to 7 when executed by the processor.
CN202211156374.3A 2022-09-22 2022-09-22 Chinese pinyin spelling error correction-oriented self-supervision pre-training method, system and medium Pending CN115563959A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211156374.3A CN115563959A (en) 2022-09-22 2022-09-22 Chinese pinyin spelling error correction-oriented self-supervision pre-training method, system and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211156374.3A CN115563959A (en) 2022-09-22 2022-09-22 Chinese pinyin spelling error correction-oriented self-supervision pre-training method, system and medium

Publications (1)

Publication Number Publication Date
CN115563959A true CN115563959A (en) 2023-01-03

Family

ID=84740245

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211156374.3A Pending CN115563959A (en) 2022-09-22 2022-09-22 Chinese pinyin spelling error correction-oriented self-supervision pre-training method, system and medium

Country Status (1)

Country Link
CN (1) CN115563959A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116127953A (en) * 2023-04-18 2023-05-16 之江实验室 Chinese spelling error correction method, device and medium based on contrast learning

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116127953A (en) * 2023-04-18 2023-05-16 之江实验室 Chinese spelling error correction method, device and medium based on contrast learning

Similar Documents

Publication Publication Date Title
CN110196894B (en) Language model training method and language model prediction method
CN107729313B (en) Deep neural network-based polyphone pronunciation distinguishing method and device
CN106297800B (en) Self-adaptive voice recognition method and equipment
US11031009B2 (en) Method for creating a knowledge base of components and their problems from short text utterances
CN111951789B (en) Training of speech recognition model, speech recognition method, apparatus, device and medium
CN114580382A (en) Text error correction method and device
CN115795009A (en) Cross-language question-answering system construction method and device based on generating type multi-language model
CN110705262B (en) Improved intelligent error correction method applied to medical technology inspection report
CN116127953B (en) Chinese spelling error correction method, device and medium based on contrast learning
CN111553159B (en) Question generation method and system
CN114896971B (en) Method, device and storage medium for recognizing specific prefix and suffix negative words
Singh et al. HINDIA: a deep-learning-based model for spell-checking of Hindi language
CN112101032A (en) Named entity identification and error correction method based on self-distillation
CN115455175A (en) Cross-language abstract generation method and device based on multi-language model
CN115563959A (en) Chinese pinyin spelling error correction-oriented self-supervision pre-training method, system and medium
CN112183060B (en) Reference resolution method of multi-round dialogue system
CN112883713A (en) Evaluation object extraction method and device based on convolutional neural network
CN114896966A (en) Method, system, equipment and medium for positioning grammar error of Chinese text
CN114896382A (en) Artificial intelligent question-answering model generation method, question-answering method, device and storage medium
WO2022251720A1 (en) Character-level attention neural networks
CN115203206A (en) Data content searching method and device, computer equipment and readable storage medium
CN111090720B (en) Hot word adding method and device
CN115099222A (en) Punctuation mark misuse detection and correction method, device, equipment and storage medium
CN114333760A (en) Information prediction module construction method, information prediction method and related equipment
CN113012685B (en) Audio recognition method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination