WO2020186778A1 - Error word correction method and device, computer device, and storage medium - Google Patents

Error word correction method and device, computer device, and storage medium Download PDF

Info

Publication number
WO2020186778A1
WO2020186778A1 PCT/CN2019/117237 CN2019117237W WO2020186778A1 WO 2020186778 A1 WO2020186778 A1 WO 2020186778A1 CN 2019117237 W CN2019117237 W CN 2019117237W WO 2020186778 A1 WO2020186778 A1 WO 2020186778A1
Authority
WO
WIPO (PCT)
Prior art keywords
pinyin
sentence
data set
natural language
language data
Prior art date
Application number
PCT/CN2019/117237
Other languages
French (fr)
Chinese (zh)
Inventor
解笑
徐国强
邱寒
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020186778A1 publication Critical patent/WO2020186778A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3343Query execution using phonetics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting

Definitions

  • This application relates to the technical field of speech recognition, and in particular to a method, device, computer device and non-volatile readable storage medium for correcting wrong words.
  • the first aspect of this application provides a method for correcting a wrong word.
  • the method includes:
  • Pre-training the neural network model by using the first sample set to obtain a pre-trained neural network model
  • the pinyin sequence of the sentence to be corrected is input into the fine-tuned neural network model for error correction, and the corrected sentence is obtained.
  • the second aspect of the present application provides a wrong word correction device, the device includes:
  • the first acquisition module is configured to acquire a universal natural language data set, the universal natural language data set containing multiple sentences;
  • a conversion module configured to convert each sentence contained in the universal natural language data set into a pinyin sequence to obtain a pinyin-sentence pair of the universal natural language data set;
  • the generating module is used to select multiple pinyin-sentence pairs from the pinyin-sentence pairs in the universal natural language data set, and replace part of the pinyin of each selected pinyin-sentence pair with similar pinyin to obtain the replaced pinyin- Sentence pairs, combining the unselected pinyin-sentence pairs of the general natural language data set and the replaced pinyin-sentence pairs into a first sample set;
  • the pre-training module is used to pre-train the neural network model with the first sample set to obtain the pre-trained neural network model
  • the second acquisition module is used to acquire multiple pinyin-sentence pairs containing similar pinyin related to a specific field as a second sample set;
  • a fine-tuning module configured to fine-tune the pre-trained neural network model by using the second sample set to obtain a fine-tuned neural network model
  • the error correction module is used to input the pinyin sequence of the sentence to be corrected into the fine-tuned neural network model for error correction, and obtain the corrected sentence.
  • a third aspect of the present application provides a computer device, the computer device includes a processor, and the processor is configured to implement the wrong word correction method when executing computer-readable instructions stored in a memory.
  • a fourth aspect of the present application provides a non-volatile readable storage medium having computer readable instructions stored thereon, and when the computer readable instructions are executed by a processor, the wrong word correction method is implemented.
  • This application obtains a universal natural language data set, the universal natural language data set contains multiple sentences; each sentence included in the universal natural language data set is converted into a pinyin sequence to obtain the pinyin of the universal natural language data set- Sentence pairs; select multiple pinyin-sentence pairs from the pinyin-sentence pairs of the universal natural language data set, and replace part of the pinyin of each selected pinyin-sentence pair with similar pinyin to obtain the replaced pinyin-sentence pair , Compose the unselected pinyin-sentence pairs of the general natural language data set and the replaced pinyin-sentence pairs into a first sample set; use the first sample set to pre-train the neural network model, Obtain a pre-trained neural network model; acquire a number of pinyin-sentence pairs related to a specific field containing similar pinyin as the second sample set; use the second sample set to fine-tune the pre-trained neural network model , Obtain the fine-tuned neural network model; input the pinyin sequence of
  • Fig. 1 is a flowchart of a method for correcting a wrong word provided by an embodiment of the present application.
  • Figure 2 is a structural diagram of a wrong word correction device provided by an embodiment of the present application.
  • Fig. 3 is a schematic diagram of a computer device provided by an embodiment of the present application.
  • the wrong word correction method of this application is applied to one or more computer devices.
  • the computer device is a device that can automatically perform numerical calculation and/or information processing in accordance with pre-set or stored instructions. Its hardware includes, but is not limited to, a microprocessor and an application specific integrated circuit (ASIC) , Field-Programmable Gate Array (FPGA), Digital Processor (Digital Signal Processor, DSP), embedded equipment, etc.
  • ASIC application specific integrated circuit
  • FPGA Field-Programmable Gate Array
  • DSP Digital Processor
  • embedded equipment etc.
  • the computer device may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
  • the computer device can interact with the user through a keyboard, a mouse, a remote control, a touch panel, or a voice control device.
  • FIG. 1 is a flowchart of a method for correcting a wrong word provided in Embodiment 1 of the present application.
  • the wrong word correction method is applied to a computer device.
  • the method for correcting wrong words in this application is to correct sentences obtained by language recognition.
  • the method for correcting wrong words can solve the problem of unable to accurately predict proprietary words in a specific field due to the versatility of the speech recognition system, and at the same time enhance the error correction system's ability to find wrong words when the proprietary words are replaced with common words. Improve the user experience.
  • the wrong word correction method includes:
  • Step 101 Obtain a universal natural language data set, the universal natural language data set containing multiple sentences.
  • the universal natural language data set is a Chinese text containing everyday words.
  • the universal natural language data set can be collected from data sources such as books, news, web pages (such as Baidu Baike, Wikipedia, etc.).
  • text recognition can be performed on text in a book to obtain the universal natural language data set.
  • language recognition can be performed on the broadcast news to obtain the universal natural language data set.
  • text can be captured from a web page to obtain the universal natural language data set.
  • the universal natural language data set can be read from a preset database.
  • the preset database can store a large amount of Chinese texts in advance.
  • the Chinese text input by the user may be received, and the Chinese text input by the user may be used as the universal natural language data set.
  • Step 102 Convert each sentence contained in the universal natural language data set into a pinyin sequence to obtain a pinyin-sentence pair of the universal natural language data set.
  • the universal natural language data set may include multiple Chinese texts, and each Chinese text may include multiple sentences (ie, multiple sentences).
  • each Chinese text can be divided into multiple sentences according to punctuation marks (such as comma, semicolon, period, etc.), and each sentence obtained by the division can be converted into a pinyin sequence to obtain the pinyin corresponding to each sentence -Sentence pairs.
  • the sentence can be converted into a pinyin sequence according to the ASCII code of the Chinese character. Since Chinese characters are represented by ASCII codes in the computer system, only the correspondence between each pinyin and each ASCII code existing in the computer system or established by the user can be used to convert sentences into pinyin sequences. If the sentence contains polyphonic characters, multiple pinyins of the polyphonic characters can be listed, and the correct pinyin selected by the user can be received.
  • the sentence can be converted into a pinyin sequence according to the Unicode value of the Chinese character. Specific steps are as follows:
  • the numbers of the multiple pinyin corresponding to the polyphonic characters can be added to the Unicode value-pinyin number comparison table according to the Unicode value of the polyphonic character.
  • the Unicode value of the polysyllabic character is determined, and the number of the multiple pinyin corresponding to the polysyllabic character is obtained from the Unicode value-pinyin number comparison table according to the Unicode value of the polysyllabic character.
  • the numbers of the multiple pinyin corresponding to the polyphonic character are obtained from the pinyin-number comparison table.
  • the correct pinyin selected by the user from the plurality of pinyin can be received, and the pinyin selected by the user can be used as the correct pinyin of the polyphone in the sentence.
  • Step 103 Select a plurality of pinyin-sentence pairs from the pinyin-sentence pairs of the general natural language data set, and replace part of the pinyin of each selected pinyin-sentence pair with similar pinyin to obtain the replaced pinyin-sentence pair , Compose the unselected pinyin-sentence pair of the universal natural language data set and the replaced pinyin-sentence pair into a first sample set.
  • the multiple pinyin-sentence pairs may be randomly selected from the pinyin-sentence pairs in the universal natural language data set, and part of the pinyin in each selected pinyin-sentence may be replaced with a similar pinyin.
  • a plurality of pinyin-sentence pairs can be selected from the pinyin-sentence pairs of the general natural language data set according to a preset ratio. For example, 20% of the pinyin-sentence pairs can be selected from the pinyin-sentence pairs in the universal natural language data set for pinyin replacement. For example, if the universal natural language data set includes 100 sentences (that is, includes 100 pinyin-sentence pairs), then 20 pinyin-sentence pairs are selected for pinyin replacement.
  • the training samples of the first sample set include unselected pinyin-sentence pairs, that is, correct pinyin-sentence pairs, and also include pinyin-sentence pairs after replacement, that is, partial pinyin is replaced with pinyin-sentence pairs of similar pinyin.
  • This application is mainly used to correct errors in sentences obtained by language recognition.
  • Most of the sentence errors obtained by speech recognition are that the words in the sentence are meaningful but the sentence is meaningless. For example, "who needs to insure for whom” is sometimes recognized as “who needs to Taobao for”. Therefore, not only the correct pinyin-sentence pairs are needed as training samples, but part of the pinyin needs to be replaced with similar pinyin-sentence pairs as training samples for the model.
  • Step 104 Pre-train the neural network model by using the first sample set to obtain a pre-trained neural network model.
  • the input of the neural network model is a pinyin sequence
  • the output is a corresponding sentence (ie, a sequence of Chinese characters).
  • the corresponding Chinese character is predicted.
  • each unselected pinyin-sentence pair ie unreplaced pinyin-sentence pair
  • each replaced pinyin-sentence pair are used as training samples.
  • the pinyin sequence in the pinyin-sentence pair is the input of the neural network model, and the sentence in the pinyin-sentence pair is the real result.
  • the neural network model may be a transformer model.
  • the transformer model can accept a string of sequences as input and output a string of sequences at the same time.
  • the Transformer model uses a Pinyin sequence as input and outputs a sequence of Chinese characters.
  • the transformer model includes an encoding layer, a self-attention layer, and a decoding layer.
  • the coding layer and the decoding layer correspond to the coding of Pinyin and the decoding of Chinese characters respectively.
  • the self-attention layer is used to predict Chinese characters with repeated Pinyin. Since there are a lot of repetitions of Chinese pinyin, different Chinese characters and words correspond to the same pinyin, for example, "Bangxiao" and "baoxiao" have the same pinyin and tone, so when making predictions for each pinyin, you need to "pay attention" to the entire sentence Pinyin sequence instead of just looking at the pinyin at the current position.
  • the self-attention mechanism can make the pinyin of a certain position obtain the pinyin representations of all other positions, so as to make predictions of Chinese characters more in line with the sentence scenario.
  • the Ttransformer model After training with a large number of samples, the Ttransformer model can output the corresponding Chinese character sequence by inputting the Pinyin sequence.
  • Step 105 Acquire a plurality of pinyin-sentence pairs that contain similar pinyin related to a specific field as a second sample set.
  • Each training sample in the second sample set is a pinyin-sentence pair related to a specific field, and the pinyin-sentence pair contains similar pinyin related to the specific field.
  • the specific field is the exclusive field to be applied in this method, such as law, insurance, etc.
  • the language data set obtained in step 101 is a general natural language data set, which mainly contains some everyday words.
  • the first sample set obtained according to the general natural language data set is a training sample about everyday words. Therefore, the neural network model obtained by pre-training is in When the sentences in daily life have obvious speech recognition errors, they can be corrected well. However, when encountering certain proprietary fields such as law and insurance, the error correction effect of the neural network model is reduced, and many proprietary words will be recognized as everyday words. For example, “Insured” in "Who needs to insure” is identified as "Taobao". Therefore, when it is applied to a specific field for error correction, sample data of the specific field is required.
  • the pinyin of the specific word in the specific field in the pinyin-sentence pair of the text data set is replaced with a similar pinyin to obtain a pinyin-sentence pair containing the similar pinyin related to the specific field. For example, replace the pinyin (tou, ersheng, bao, three tones) of "insurance” in “who needs to insure for” with the pinyin of "taobao” (tao, ersheng, bao, three tones).
  • a database may be established in advance to store the pinyin-sentence pairs that are incorrectly recognized in the specific field, and a plurality of pinyin-sentence pairs containing similar pinyin related to the specific field can be obtained from the database.
  • Step 106 Use the second sample set to fine-tune the pre-trained neural network model to obtain a fine-tuned neural network model.
  • the purpose of fine-tuning the neural network model by using the second sample set is to make the neural network model more suitable for a specific field and improve the error correction accuracy rate in the specific field.
  • the model after fine-tuning training is more inclined to predict the exclusive words in the specific field, thereby improving the effect of correcting the wrong words of speech recognition errors.
  • the weights of the neurons in the first few layers of the neural network model can be fixed, and the weights of the neurons in the subsequent layers of the neural network model can be fine-tuned. This is mainly to avoid over-fitting when the second sample set is too small.
  • the neurons in the first few layers of the neural network model generally contain more general features, which are very important for many tasks, but the characteristics of the neurons in the latter layers Learning pays attention to high-level features, and there are big differences between different data sets.
  • Step 107 Input the pinyin sequence of the sentence to be corrected into the fine-tuned neural network model for error correction, and obtain the corrected sentence.
  • the result of language recognition may include multiple Chinese texts, and each Chinese text may include multiple sentences to be corrected (ie, multiple sentences).
  • the Chinese text obtained by language recognition can be divided into multiple sentences to be corrected according to punctuation (such as comma, semicolon, period, etc.), and each sentence to be corrected is converted into a pinyin sequence.
  • the sentence to be corrected can be converted into a pinyin sequence according to the ASCII code of the Chinese character.
  • the sentence to be corrected can be converted into a pinyin sequence according to the Unicode value of the Chinese character. Refer to step 102 for the method of converting the sentence to be corrected into a pinyin sequence.
  • the sentence to be corrected input by the user may be received, and the sentence to be corrected may be converted into a pinyin sequence.
  • a user interface may be generated, and a sentence to be corrected input by the user may be received from the user interface. It is also possible to directly receive the pinyin sequence of the sentence to be corrected input by the user.
  • the wrong word correction method of the first embodiment obtains a universal natural language data set, the universal natural language data set contains multiple sentences; each sentence included in the universal natural language data set is converted into a pinyin sequence to obtain the universal natural language data set.
  • Pinyin-sentence pairs of the language data set select multiple pinyin-sentence pairs from the pinyin-sentence pairs of the general natural language data set, and replace part of the pinyin of each selected pinyin-sentence pair with similar pinyin to obtain the replacement
  • the unselected pinyin-sentence pair of the general natural language data set and the replaced pinyin-sentence pair form the first sample set; use the first sample set to pair the nerve
  • the network model is pre-trained to obtain a pre-trained neural network model; a plurality of pinyin-sentence pairs containing similar pinyin related to a specific field are obtained as the second sample set; the second sample set is used to perform the pre-training
  • the neural network model is fine-tuned to obtain a fine-tune
  • the method for correcting wrong words may further include: recognizing the input voice to obtain the sentence to be corrected.
  • Various speech recognition technologies can be used, such as Dynamic Time Warping (DTW), Hidden Markov Model (HMM), Vector Quantization (VQ), and Artificial Neural Network (Artificial Neural Network, ANN) and other technologies to recognize the voice.
  • DTW Dynamic Time Warping
  • HMM Hidden Markov Model
  • VQ Vector Quantization
  • ANN Artificial Neural Network
  • Fig. 2 is a structural diagram of a wrong word correction device provided in the second embodiment of the present application.
  • the wrong word correction device 20 is applied to a computer device.
  • the wrong word correction device 20 may include a first acquisition module 201, a conversion module 202, a generation module 203, a pre-training module 204, a second acquisition module 205, a fine-tuning module 206, and an error correction module 207.
  • the first acquisition module 201 is configured to acquire a universal natural language data set, the universal natural language data set containing multiple sentences.
  • the universal natural language data set is a Chinese text containing everyday words.
  • the universal natural language data set can be collected from data sources such as books, news, web pages (such as Baidu Baike, Wikipedia, etc.).
  • text recognition can be performed on text in a book to obtain the universal natural language data set.
  • language recognition can be performed on the broadcast news to obtain the universal natural language data set.
  • text can be captured from a web page to obtain the universal natural language data set.
  • the universal natural language data set can be read from a preset database.
  • the preset database can store a large amount of Chinese texts in advance.
  • the Chinese text input by the user may be received, and the Chinese text input by the user may be used as the universal natural language data set.
  • the conversion module 202 is configured to convert each sentence contained in the universal natural language data set into a pinyin sequence to obtain a pinyin-sentence pair of the universal natural language data set.
  • the universal natural language data set may include multiple Chinese texts, and each Chinese text may include multiple sentences (ie, multiple sentences).
  • each Chinese text can be divided into multiple sentences according to punctuation marks (such as comma, semicolon, period, etc.), and each sentence obtained by the division can be converted into a pinyin sequence to obtain the pinyin corresponding to each sentence -Sentence pairs.
  • the sentence can be converted into a pinyin sequence according to the ASCII code of the Chinese character. Since Chinese characters are represented by ASCII codes in the computer system, only the correspondence between each pinyin and each ASCII code existing in the computer system or established by the user can be used to convert sentences into pinyin sequences. If the sentence contains polyphonic characters, multiple pinyins of the polyphonic characters can be listed, and the correct pinyin selected by the user can be received.
  • the sentence can be converted into a pinyin sequence according to the Unicode value of the Chinese character. Specific steps are as follows:
  • the numbers of the multiple pinyin corresponding to the polyphonic characters can be added to the Unicode value-pinyin number comparison table according to the Unicode value of the polyphonic character.
  • the Unicode value of the polysyllabic character is determined, and the number of the multiple pinyin corresponding to the polysyllabic character is obtained from the Unicode value-pinyin number comparison table according to the Unicode value of the polysyllabic character.
  • the numbers of the multiple pinyin corresponding to the polyphonic character are obtained from the pinyin-number comparison table.
  • the correct pinyin selected by the user from the plurality of pinyin can be received, and the pinyin selected by the user can be used as the correct pinyin of the polyphone in the sentence.
  • the generating module 203 is configured to select multiple pinyin-sentence pairs from the pinyin-sentence pairs in the universal natural language data set, and replace part of the pinyin of each selected pinyin-sentence pair with similar pinyin to obtain the replaced pinyin -Sentence pairs, combining the unselected pinyin-sentence pairs of the universal natural language data set and the replaced pinyin-sentence pairs into a first sample set.
  • the multiple pinyin-sentence pairs may be randomly selected from the pinyin-sentence pairs in the universal natural language data set, and part of the pinyin in each selected pinyin-sentence may be replaced with a similar pinyin.
  • a plurality of pinyin-sentence pairs can be selected from the pinyin-sentence pairs of the general natural language data set according to a preset ratio. For example, 20% of the pinyin-sentence pairs can be selected from the pinyin-sentence pairs in the universal natural language data set for pinyin replacement. For example, if the general natural language data set includes 100 sentences (that is, includes 100 pinyin-sentence pairs), then 20 pinyin-sentence pairs are selected for pinyin replacement.
  • the training samples of the first sample set include unselected pinyin-sentence pairs, that is, correct pinyin-sentence pairs, and also include pinyin-sentence pairs after replacement, that is, partial pinyin is replaced with pinyin-sentence pairs of similar pinyin.
  • This application is mainly used to correct errors in sentences obtained by language recognition.
  • Most of the sentence errors obtained by speech recognition are that the words in the sentence are meaningful but the sentence is meaningless. For example, "who needs to insure for whom” is sometimes recognized as “who needs to Taobao for”. Therefore, not only the correct pinyin-sentence pairs are required as training samples, but also some pinyin-sentence pairs need to be replaced with similar pinyin pinyin-sentence pairs as training samples for the model.
  • the pre-training module 204 is configured to pre-train the neural network model by using the first sample set to obtain the pre-trained neural network model.
  • the input of the neural network model is a pinyin sequence
  • the output is a corresponding sentence (ie, a sequence of Chinese characters).
  • the corresponding Chinese character is predicted.
  • each unselected pinyin-sentence pair ie unreplaced pinyin-sentence pair
  • each replaced pinyin-sentence pair are used as training samples.
  • the pinyin sequence in the pinyin-sentence pair is the input of the neural network model, and the sentence in the pinyin-sentence pair is the real result.
  • the neural network model may be a transformer model.
  • the transformer model can accept a string of sequences as input and output a string of sequences at the same time.
  • the Transformer model uses a Pinyin sequence as input and outputs a sequence of Chinese characters.
  • the transformer model includes an encoding layer, a self-attention layer, and a decoding layer.
  • the coding layer and the decoding layer correspond to the coding of Pinyin and the decoding of Chinese characters respectively.
  • the self-attention layer is used to predict Chinese characters with repeated Pinyin. Because there are a lot of repetitions in Chinese pinyin, different Chinese characters and words correspond to the same pinyin, for example, "Bangxiao” and "baoxiao” have the same pinyin and tone, so when making predictions for each pinyin, you need to "pay attention” to the entire sentence Pinyin sequence instead of just looking at the pinyin at the current position.
  • the self-attention mechanism can make the pinyin of a certain position obtain the pinyin representations of all other positions, so as to make predictions of Chinese characters more in line with the sentence scenario.
  • the Ttransformer model After training with a large number of samples, the Ttransformer model can output the corresponding Chinese character sequence by inputting the Pinyin sequence.
  • the second acquisition module 205 is configured to acquire a plurality of pinyin-sentence pairs containing similar pinyin related to a specific field as a second sample set.
  • Each training sample in the second sample set is a pinyin-sentence pair related to a specific field, and the pinyin-sentence pair contains similar pinyin related to the specific field.
  • the specific field is the exclusive field to be applied in this method, such as law, insurance, etc.
  • the language data set obtained by the first acquisition module 201 is a general natural language data set, which mainly contains some everyday words.
  • the first sample set obtained according to the general natural language data set is a training sample about everyday words, so the nerves obtained by pre-training
  • the network model can perform a good error correction when there are obvious speech recognition errors in sentences in daily life.
  • the error correction effect of the neural network model is reduced, and many proprietary words will be recognized as everyday words. For example, “Insured” in "Who needs to insure” is identified as "Taobao". Therefore, when it is applied to a specific field for error correction, sample data of the specific field is required.
  • the pinyin of the specific word in the specific field in the pinyin-sentence pair of the text data set is replaced with a similar pinyin to obtain a pinyin-sentence pair containing the similar pinyin related to the specific field. For example, replace the pinyin (tou, ersheng, bao, three tones) of "insurance” in “who needs to insure for” with the pinyin of "taobao” (tao, ersheng, bao, three tones).
  • a database may be established in advance to store the pinyin-sentence pairs that are incorrectly recognized in the specific field, and a plurality of pinyin-sentence pairs containing similar pinyin related to the specific field can be obtained from the database.
  • the fine-tuning module 206 is configured to use the second sample set to fine-tune the pre-trained neural network model to obtain a fine-tuned neural network model.
  • the purpose of fine-tuning the neural network model by using the second sample set is to make the neural network model more suitable for a specific field and improve the error correction accuracy rate in the specific field.
  • the model after fine-tuning training is more inclined to predict the exclusive words in the specific field, thereby improving the effect of correcting the wrong words of speech recognition errors.
  • the weights of neurons in the first few layers of the neural network model can be fixed, and the weights of neurons in the next few layers of the neural network model can be fine-tuned. This is mainly to avoid over-fitting when the second sample set is too small.
  • the neurons in the first few layers of the neural network model generally contain more general features, which are very important for many tasks, but the characteristics of the neurons in the latter layers Learning focuses on high-level features, and different data sets vary greatly.
  • the error correction module 207 is configured to input the pinyin sequence of the sentence to be corrected into the fine-tuned neural network model for error correction, and obtain the corrected sentence.
  • the result of language recognition may include multiple Chinese texts, and each Chinese text may include multiple sentences to be corrected (ie, multiple sentences).
  • the Chinese text obtained by language recognition can be divided into multiple sentences to be corrected according to punctuation (such as comma, semicolon, period, etc.), and each sentence to be corrected is converted into a pinyin sequence.
  • the sentence to be corrected can be converted into a pinyin sequence according to the ASCII code of the Chinese character.
  • the sentence to be corrected can be converted into a pinyin sequence according to the Unicode value of the Chinese character. Refer to the description of the conversion module 202 for the method of converting the sentence to be corrected into a pinyin sequence.
  • the sentence to be corrected input by the user may be received, and the sentence to be corrected may be converted into a pinyin sequence.
  • a user interface may be generated, and a sentence to be corrected input by the user may be received from the user interface. It is also possible to directly receive the pinyin sequence of the sentence to be corrected input by the user.
  • the wrong word correction device 20 of this embodiment obtains a universal natural language data set, the universal natural language data set contains multiple sentences; each sentence included in the universal natural language data set is converted into a pinyin sequence to obtain the universal Pinyin-sentence pairs of the natural language data set; select multiple pinyin-sentence pairs from the pinyin-sentence pairs of the general natural language data set, and replace part of the pinyin of each selected pinyin-sentence pair with similar pinyin to obtain The replaced pinyin-sentence pair, the unselected pinyin-sentence pair of the general natural language data set and the replaced pinyin-sentence pair form a first sample set; the first sample set pair is used
  • the neural network model is pre-trained to obtain the pre-trained neural network model; a plurality of pinyin-sentence pairs containing similar pinyin related to a specific field are obtained as the second sample set; the second sample set is used to perform the pre-training
  • the latter neural network model is fine-tuned to obtain a fine-t
  • the wrong word correction device 20 may further include: a recognition module, which recognizes the input voice to obtain the sentence to be corrected.
  • a recognition module which recognizes the input voice to obtain the sentence to be corrected.
  • Various speech recognition technologies can be used, such as Dynamic Time Warping (DTW), Hidden Markov Model (HMM), Vector Quantization (VQ), Artificial Neural Network (Artificial Neural Network, ANN) and other technologies to recognize the voice.
  • This embodiment provides a non-volatile readable storage medium with computer readable instructions stored on the non-volatile readable storage medium, and when the computer readable instructions are executed by a processor, the above-mentioned wrong word correction method embodiment is implemented Steps in, for example, steps 101-107 shown in Figure 1:
  • Step 101 Obtain a universal natural language data set, where the universal natural language data set contains multiple sentences;
  • Step 102 Convert each sentence contained in the universal natural language data set into a pinyin sequence to obtain a pinyin-sentence pair of the universal natural language data set;
  • Step 103 Select a plurality of pinyin-sentence pairs from the pinyin-sentence pairs of the general natural language data set, and replace part of the pinyin of each selected pinyin-sentence pair with similar pinyin to obtain the replaced pinyin-sentence pair , Compose the unselected pinyin-sentence pair of the universal natural language data set and the replaced pinyin-sentence pair into a first sample set;
  • Step 104 Pre-train the neural network model by using the first sample set to obtain a pre-trained neural network model
  • Step 105 Obtain a plurality of pinyin-sentence pairs containing similar pinyin related to a specific field as a second sample set;
  • Step 106 Use the second sample set to fine-tune the pre-trained neural network model to obtain a fine-tuned neural network model
  • Step 107 Input the pinyin sequence of the sentence to be corrected into the fine-tuned neural network model for error correction, and obtain the corrected sentence.
  • the computer-readable instructions realize the functions of the modules in the above-mentioned device embodiment when executed by the processor, for example, modules 201-207 in Figure 2:
  • the first acquiring module 201 is configured to acquire a universal natural language data set, the universal natural language data set containing multiple sentences;
  • the conversion module 202 is configured to convert each sentence contained in the universal natural language data set into a pinyin sequence to obtain a pinyin-sentence pair of the universal natural language data set;
  • the generating module 203 is configured to select multiple pinyin-sentence pairs from the pinyin-sentence pairs in the universal natural language data set, and replace part of the pinyin of each selected pinyin-sentence pair with similar pinyin to obtain the replaced pinyin -Sentence pairs, combining the unselected pinyin-sentence pairs of the general natural language data set and the replaced pinyin-sentence pairs into a first sample set;
  • the pre-training module 204 is configured to pre-train the neural network model by using the first sample set to obtain a pre-trained neural network model
  • the second acquiring module 205 is configured to acquire a plurality of pinyin-sentence pairs containing similar pinyin related to a specific field as a second sample set;
  • the fine-tuning module 206 is configured to use the second sample set to fine-tune the pre-trained neural network model to obtain a fine-tuned neural network model;
  • the error correction module 207 is configured to input the pinyin sequence of the sentence to be corrected into the fine-tuned neural network model for error correction, and obtain the corrected sentence.
  • FIG. 3 is a schematic diagram of a computer device provided in Embodiment 4 of this application.
  • the computer device 30 includes a memory 301, a processor 302, and computer-readable instructions 303 stored in the memory 301 and running on the processor 302, such as a wrong word correction program.
  • the processor 302 executes the computer-readable instruction 303, the steps in the embodiment of the above-mentioned wrong word correction method are implemented, for example, steps 101-107 shown in Fig. 1:
  • Step 101 Obtain a universal natural language data set, where the universal natural language data set contains multiple sentences;
  • Step 102 Convert each sentence contained in the universal natural language data set into a pinyin sequence to obtain a pinyin-sentence pair of the universal natural language data set;
  • Step 103 Select a plurality of pinyin-sentence pairs from the pinyin-sentence pairs of the general natural language data set, and replace part of the pinyin of each selected pinyin-sentence pair with similar pinyin to obtain the replaced pinyin-sentence pair , Compose the unselected pinyin-sentence pair of the universal natural language data set and the replaced pinyin-sentence pair into a first sample set;
  • Step 104 Pre-train the neural network model by using the first sample set to obtain a pre-trained neural network model
  • Step 105 Obtain a plurality of pinyin-sentence pairs containing similar pinyin related to a specific field as a second sample set;
  • Step 106 Use the second sample set to fine-tune the pre-trained neural network model to obtain a fine-tuned neural network model
  • Step 107 Input the pinyin sequence of the sentence to be corrected into the fine-tuned neural network model for error correction, and obtain the corrected sentence.
  • the first acquiring module 201 is configured to acquire a universal natural language data set, the universal natural language data set containing multiple sentences;
  • the conversion module 202 is configured to convert each sentence contained in the universal natural language data set into a pinyin sequence to obtain a pinyin-sentence pair of the universal natural language data set;
  • the generating module 203 is configured to select multiple pinyin-sentence pairs from the pinyin-sentence pairs in the universal natural language data set, and replace part of the pinyin of each selected pinyin-sentence pair with similar pinyin to obtain the replaced pinyin -Sentence pairs, combining the unselected pinyin-sentence pairs of the general natural language data set and the replaced pinyin-sentence pairs into a first sample set;
  • the pre-training module 204 is configured to pre-train the neural network model by using the first sample set to obtain a pre-trained neural network model
  • the second acquiring module 205 is configured to acquire a plurality of pinyin-sentence pairs containing similar pinyin related to a specific field as a second sample set;
  • the fine-tuning module 206 is configured to use the second sample set to fine-tune the pre-trained neural network model to obtain a fine-tuned neural network model;
  • the error correction module 207 is configured to input the pinyin sequence of the sentence to be corrected into the fine-tuned neural network model for error correction, and obtain the corrected sentence.
  • the computer-readable instruction 303 may be divided into one or more modules, and the one or more modules are stored in the memory 301 and executed by the processor 302 to complete the method .
  • the computer-readable instruction 303 can be divided into the first acquisition module 201, the conversion module 202, the generation 203, the pre-training module 204, the second acquisition module 205, the fine-tuning module 206, and the error correction module 207 in FIG. Refer to the second embodiment for the specific functions of each module.
  • the computer device 30 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
  • a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
  • the schematic diagram 3 is only an example of the computer device 30 and does not constitute a limitation on the computer device 30. It may include more or less components than those shown in the figure, or combine certain components, or be different.
  • the computer device 30 may also include input and output devices, network access devices, buses, etc.
  • the so-called processor 302 may be a central processing unit (Central Processing Unit, CPU), other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor can be a microprocessor or the processor 302 can also be any conventional processor, etc.
  • the processor 302 is the control center of the computer device 30 and connects the entire computer device 30 with various interfaces and lines. Various parts.
  • the memory 301 may be used to store the computer-readable instructions 303, and the processor 302 executes or executes the computer-readable instructions or modules stored in the memory 301, and calls data stored in the memory 301 to implement Various functions of the computer device 30.
  • the memory 301 may mainly include a program storage area and a data storage area.
  • the program storage area may store an operating system, an application program required by at least one function (such as a sound playback function, an image playback function, etc.), etc.; The data created according to the use of the computer device 30 is stored.
  • the memory 301 may include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a Secure Digital (SD) card, a flash memory card (Flash Card), At least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device.
  • non-volatile memory such as a hard disk, a memory, a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a Secure Digital (SD) card, a flash memory card (Flash Card), At least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device.
  • the integrated module of the computer device 30 may be stored in a non-volatile readable storage medium.
  • this application implements all or part of the processes in the above-mentioned embodiments and methods, and can also be completed by instructing relevant hardware through computer-readable instructions.
  • the computer-readable instructions can be stored in a non-volatile memory. In the read storage medium, when the computer-readable instructions are executed by the processor, the steps of the foregoing method embodiments can be implemented.
  • the computer-readable instructions may be in the form of source code, object code, executable file, or some intermediate forms, etc.
  • the computer-readable medium may include: any entity or device capable of carrying the computer-readable instruction code, recording medium, U disk, mobile hard disk, magnetic disk, optical disk, read-only memory (ROM, Read-Only Memory).

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Acoustics & Sound (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Signal Processing (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)

Abstract

An error word correction method and device, a computer device, and a storage medium. The error word correction method comprises: obtaining a general natural language data set (101); converting each sentence comprised in the natural language data set into a Pinyin sequence to obtain obtaining Pinyin-sentence pairs of the general natural language data set (102); performing Pinyin replacement on some of the Pinyin-sentence pairs of the general natural language data set to obtain a first sample set (103); pre-training a neural network model using the first sample set to obtain a pre-trained neural network model (104); obtaining a plurality of Pinyin-sentence pairs comprising similar Pinyin and related to a specific field as a second sample set (105); performing fine tuning on the pre-trained neural network model using the second sample set to obtain a fine-tuned neural network model (106); and inputting a Pinyin sequence of a sentence to be corrected into the fine-tuned neural network model for correction to obtain a corrected sentence (107). By means of the method, error correction can be performed on special words identified as common words in language identification.

Description

错词纠正方法、装置、计算机装置及存储介质Wrong word correction method, device, computer device and storage medium
本申请要求于2019年03月15日提交中国专利局,申请号为201910199221.9申请名称为“错词纠正方法、装置、计算机装置及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims to be submitted to the Chinese Patent Office on March 15, 2019. The application number is 201910199221.9. The application titled "Wrong word correction method, device, computer device and storage medium" is the priority of the Chinese patent application, the entire content of which is incorporated by reference In this application.
技术领域Technical field
本申请涉及语音识别技术领域,具体涉及一种错词纠正方法、装置、计算机装置及非易失性可读存储介质。This application relates to the technical field of speech recognition, and in particular to a method, device, computer device and non-volatile readable storage medium for correcting wrong words.
背景技术Background technique
随着语音识别应用场景的迅猛拓宽,语音识别技术越来越成熟,市场对高准确度的语音识别需求越来越强烈。对于一些开发具有语音识别功能产品的公司,更多的情况是使用通用系统的语音识别模块,不针对其具体应用场景进行识别,就会很容易出现将某些专有词语识别为常用词。例如将“需要为谁投保”识别为“需要为谁淘宝”,由于其并没有明显的错误,现有错词纠正系统难以发现此类错误。With the rapid expansion of speech recognition application scenarios, speech recognition technology has become more and more mature, and the market has increasingly strong demand for high-accuracy speech recognition. For some companies that develop products with voice recognition functions, more often they use the voice recognition module of the general system. If they do not recognize specific application scenarios, it is easy to recognize certain proprietary words as common words. For example, "who needs to be insured" is identified as "who needs to Taobao". Since there is no obvious error, the existing wrong word correction system is difficult to find such errors.
目前,对于如何提升语言识别在实际应用场景中的纠正效果并没有一个有效的解决方法。如何制定合适的方案,以减少语音识别的偏差,提升用户体验,是相关技术人员目前需要解决的技术问题。Currently, there is no effective solution to how to improve the correction effect of language recognition in actual application scenarios. How to formulate a suitable solution to reduce the deviation of speech recognition and improve user experience is a technical problem that relevant technicians need to solve at present.
发明内容Summary of the invention
鉴于以上内容,有必要提出一种错词纠正方法、装置、计算机装置及非易失性可读存储介质,可以对语言识别中专有词语被识别为常用词进行纠错。In view of the above, it is necessary to propose a method, device, computer device, and non-volatile readable storage medium for correcting wrong words, which can correct the errors when the proprietary words are recognized as common words in language recognition.
本申请的第一方面提供一种错词纠正方法,所述方法包括:The first aspect of this application provides a method for correcting a wrong word. The method includes:
获取通用自然语言数据集,所述通用自然语言数据集包含多个句子;Acquiring a universal natural language data set, the universal natural language data set containing a plurality of sentences;
将所述通用自然语言数据集包含的每个句子转换为拼音序列,得到所述通用自然语言数据集的拼音-句子对;Converting each sentence contained in the universal natural language data set into a pinyin sequence to obtain a pinyin-sentence pair of the universal natural language data set;
从所述通用自然语言数据集的拼音-句子对中选择多个拼音-句子对,将选择的每个拼音-句子对的部分拼音替换为相近拼音,得到替换后的拼音-句子对,将所述通用自然语言数据集的未选择的拼音-句子对和所述替换后的拼音-句子对组成第一样本集;Select multiple pinyin-sentence pairs from the pinyin-sentence pairs of the universal natural language data set, replace part of the pinyin of each selected pinyin-sentence pair with similar pinyin, obtain the replaced pinyin-sentence pair, The unselected pinyin-sentence pairs of the general natural language data set and the replaced pinyin-sentence pairs form the first sample set;
利用所述第一样本集对神经网络模型进行预训练,得到预训练后的神经网络模型;Pre-training the neural network model by using the first sample set to obtain a pre-trained neural network model;
获取多个与特定领域相关的包含相近拼音的拼音-句子对作为第二样本集;Acquire multiple pinyin-sentence pairs with similar pinyin related to a specific field as the second sample set;
利用所述第二样本集对所述预训练后的神经网络模型进行微调,得到微调后的神经网络模型;Using the second sample set to fine-tune the pre-trained neural network model to obtain a fine-tuned neural network model;
将待纠错句子的拼音序列输入所述微调后的神经网络模型进行纠错,得到纠错后的句子。The pinyin sequence of the sentence to be corrected is input into the fine-tuned neural network model for error correction, and the corrected sentence is obtained.
本申请的第二方面提供一种错词纠正装置,所述装置包括:The second aspect of the present application provides a wrong word correction device, the device includes:
第一获取模块,用于获取通用自然语言数据集,所述通用自然语言数据集包含多个句子;The first acquisition module is configured to acquire a universal natural language data set, the universal natural language data set containing multiple sentences;
转换模块,用于将所述通用自然语言数据集包含的每个句子转换为拼音序列,得到所述通用自然语言数据集的拼音-句子对;A conversion module, configured to convert each sentence contained in the universal natural language data set into a pinyin sequence to obtain a pinyin-sentence pair of the universal natural language data set;
生成模块,用于从所述通用自然语言数据集的拼音-句子对中选择多个拼音-句子对,将选择的每个拼音-句子对的部分拼音替换为相近拼音,得到替换后的拼音-句子对,将所述通用自然语言数据集的未选择的拼音-句子对和所述替换后的拼音-句子对组成第一样本集;The generating module is used to select multiple pinyin-sentence pairs from the pinyin-sentence pairs in the universal natural language data set, and replace part of the pinyin of each selected pinyin-sentence pair with similar pinyin to obtain the replaced pinyin- Sentence pairs, combining the unselected pinyin-sentence pairs of the general natural language data set and the replaced pinyin-sentence pairs into a first sample set;
预训练模块,用于用所述第一样本集对神经网络模型进行预训练,得到预训练后的神经网络模型;The pre-training module is used to pre-train the neural network model with the first sample set to obtain the pre-trained neural network model;
第二获取模块,用于获取多个与特定领域相关的包含相近拼音的拼音-句子对作为第二样本集;The second acquisition module is used to acquire multiple pinyin-sentence pairs containing similar pinyin related to a specific field as a second sample set;
微调模块,用于利用所述第二样本集对所述预训练后的神经网络模型进行微调,得到微调后的神经网络模型;A fine-tuning module, configured to fine-tune the pre-trained neural network model by using the second sample set to obtain a fine-tuned neural network model;
纠错模块,用于将待纠错句子的拼音序列输入所述微调后的神经网络模型进行纠错,得到纠错后的句子。The error correction module is used to input the pinyin sequence of the sentence to be corrected into the fine-tuned neural network model for error correction, and obtain the corrected sentence.
本申请的第三方面提供一种计算机装置,所述计算机装置包括处理器,所述处理器用于执行存储器中存储的计算机可读指令时实现所述错词纠正方法。A third aspect of the present application provides a computer device, the computer device includes a processor, and the processor is configured to implement the wrong word correction method when executing computer-readable instructions stored in a memory.
本申请的第四方面提供一种非易失性可读存储介质,其上存储有计算机可读指令,所述计算机可读指令被处理器执行时实现所述错词纠正方法。A fourth aspect of the present application provides a non-volatile readable storage medium having computer readable instructions stored thereon, and when the computer readable instructions are executed by a processor, the wrong word correction method is implemented.
本申请获取通用自然语言数据集,所述通用自然语言数据集包含多个句子;将所述通用自然语言数据集包含的每个句子转换为拼音序列,得到所述通用自然语言数据集的拼音-句子对;从所述通用自然语言数据集的拼音-句子对中选择多个拼音-句子对,将选择的每个拼音-句子对的部分拼音替换为相近拼音,得到替换后的拼音-句子对,将所述通用自然语言数据集 的未选择的拼音-句子对和所述替换后的拼音-句子对组成第一样本集;利用所述第一样本集对神经网络模型进行预训练,得到预训练后的神经网络模型;获取多个与特定领域相关的包含相近拼音的拼音-句子对作为第二样本集;利用所述第二样本集对所述预训练后的神经网络模型进行微调,得到微调后的神经网络模型;将待纠错句子的拼音序列输入所述微调后的神经网络模型进行纠错,得到纠错后的句子。本实施例可以解决由于语音识别系统的通用性在特定领域内无法准确预测专有词语的问题,能够对语言识别中专有词语被识别为常用词进行纠错。This application obtains a universal natural language data set, the universal natural language data set contains multiple sentences; each sentence included in the universal natural language data set is converted into a pinyin sequence to obtain the pinyin of the universal natural language data set- Sentence pairs; select multiple pinyin-sentence pairs from the pinyin-sentence pairs of the universal natural language data set, and replace part of the pinyin of each selected pinyin-sentence pair with similar pinyin to obtain the replaced pinyin-sentence pair , Compose the unselected pinyin-sentence pairs of the general natural language data set and the replaced pinyin-sentence pairs into a first sample set; use the first sample set to pre-train the neural network model, Obtain a pre-trained neural network model; acquire a number of pinyin-sentence pairs related to a specific field containing similar pinyin as the second sample set; use the second sample set to fine-tune the pre-trained neural network model , Obtain the fine-tuned neural network model; input the pinyin sequence of the sentence to be corrected into the fine-tuned neural network model for error correction, and obtain the corrected sentence. This embodiment can solve the problem that the proprietary words cannot be accurately predicted in a specific field due to the versatility of the speech recognition system, and can correct errors in the recognition of the proprietary words as common words in language recognition.
附图说明Description of the drawings
图1是本申请实施例提供的错词纠正方法的流程图。Fig. 1 is a flowchart of a method for correcting a wrong word provided by an embodiment of the present application.
图2是本申请实施例提供的错词纠正装置的结构图。Figure 2 is a structural diagram of a wrong word correction device provided by an embodiment of the present application.
图3是本申请实施例提供的计算机装置的示意图。Fig. 3 is a schematic diagram of a computer device provided by an embodiment of the present application.
具体实施方式detailed description
为了能够更清楚地理解本申请的上述目的、特征和优点,下面结合附图和具体实施例对本申请进行详细描述。需要说明的是,在不冲突的情况下,本申请的实施例及实施例中的特征可以相互组合。In order to be able to understand the above objectives, features and advantages of the application more clearly, the application will be described in detail below with reference to the accompanying drawings and specific embodiments. It should be noted that the embodiments of the application and the features in the embodiments can be combined with each other if there is no conflict.
优选地,本申请的错词纠正方法应用在一个或者多个计算机装置中。所述计算机装置是一种能够按照事先设定或存储的指令,自动进行数值计算和/或信息处理的设备,其硬件包括但不限于微处理器、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程门阵列(Field-Programmable Gate Array,FPGA)、数字处理器(Digital Signal Processor,DSP)、嵌入式设备等。Preferably, the wrong word correction method of this application is applied to one or more computer devices. The computer device is a device that can automatically perform numerical calculation and/or information processing in accordance with pre-set or stored instructions. Its hardware includes, but is not limited to, a microprocessor and an application specific integrated circuit (ASIC) , Field-Programmable Gate Array (FPGA), Digital Processor (Digital Signal Processor, DSP), embedded equipment, etc.
所述计算机装置可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。所述计算机装置可以与用户通过键盘、鼠标、遥控器、触摸板或声控设备等方式进行人机交互。The computer device may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server. The computer device can interact with the user through a keyboard, a mouse, a remote control, a touch panel, or a voice control device.
实施例一Example one
图1是本申请实施例一提供的错词纠正方法的流程图。所述错词纠正方法应用于计算机装置。FIG. 1 is a flowchart of a method for correcting a wrong word provided in Embodiment 1 of the present application. The wrong word correction method is applied to a computer device.
本申请的错词纠正方法是对语言识别得到的句子进行纠错。所述错词纠正方法可以解决由于语音识别系统的通用性在特定领域内无法准确预测专有词语的问题,同时增强了纠错系统在专有词语被替换为常用词时的错词寻找能力,提升用户的使用体验。The method for correcting wrong words in this application is to correct sentences obtained by language recognition. The method for correcting wrong words can solve the problem of unable to accurately predict proprietary words in a specific field due to the versatility of the speech recognition system, and at the same time enhance the error correction system's ability to find wrong words when the proprietary words are replaced with common words. Improve the user experience.
如图1所示,所述错词纠正方法包括:As shown in Figure 1, the wrong word correction method includes:
步骤101,获取通用自然语言数据集,所述通用自然语言数据集包含多个句子。Step 101: Obtain a universal natural language data set, the universal natural language data set containing multiple sentences.
所述通用自然语言数据集是包含日常用语的中文文本。The universal natural language data set is a Chinese text containing everyday words.
可以从书籍、新闻、网页(例如百度百科、维基百科等)等数据源中收集所述通用自然语言数据集。例如,可以对书籍中的文字进行文字识别,得到所述通用自然语言数据集。又如,可以对播报的新闻进行语言识别,得到所述通用自然语言数据集。再如,可以从网页中抓取文本,得到所述通用自然语言数据集。The universal natural language data set can be collected from data sources such as books, news, web pages (such as Baidu Baike, Wikipedia, etc.). For example, text recognition can be performed on text in a book to obtain the universal natural language data set. For another example, language recognition can be performed on the broadcast news to obtain the universal natural language data set. For another example, text can be captured from a web page to obtain the universal natural language data set.
或者,可以从预设数据库读取所述通用自然语言数据集。所述预设数据库可以预先存储大量的中文文本。Alternatively, the universal natural language data set can be read from a preset database. The preset database can store a large amount of Chinese texts in advance.
或者,可以接收用户输入的中文文本,将用户输入的中文文本作为所述通用自然语言数据集。Alternatively, the Chinese text input by the user may be received, and the Chinese text input by the user may be used as the universal natural language data set.
步骤102,将所述通用自然语言数据集包含的每个句子转换为拼音序列,得到所述通用自然语言数据集的拼音-句子对。Step 102: Convert each sentence contained in the universal natural language data set into a pinyin sequence to obtain a pinyin-sentence pair of the universal natural language data set.
在本实施例中,所述通用自然语言数据集可以包括多个中文文本,每个中文文本可以包括多个句子(即多句话)。这种情况下,可以根据标点符号(例如逗号、分号、句号等)将每个中文文本划分为多个句子,将划分得到的每个句子转换为拼音序列,即得到每个句子对应的拼音-句子对。In this embodiment, the universal natural language data set may include multiple Chinese texts, and each Chinese text may include multiple sentences (ie, multiple sentences). In this case, each Chinese text can be divided into multiple sentences according to punctuation marks (such as comma, semicolon, period, etc.), and each sentence obtained by the division can be converted into a pinyin sequence to obtain the pinyin corresponding to each sentence -Sentence pairs.
可以根据汉字的ASCII码将所述句子转换为拼音序列。由于汉字在计算机系统中以ASCII码表示,只需要利用计算机系统中已有的或用户建立的每个拼音与每个ASCII码对应关系,即可实现将句子转换成拼音序列。若句子含有多音字,可以列出多音字的多个拼音,接收用户选择的正确拼音。The sentence can be converted into a pinyin sequence according to the ASCII code of the Chinese character. Since Chinese characters are represented by ASCII codes in the computer system, only the correspondence between each pinyin and each ASCII code existing in the computer system or established by the user can be used to convert sentences into pinyin sequences. If the sentence contains polyphonic characters, multiple pinyins of the polyphonic characters can be listed, and the correct pinyin selected by the user can be received.
或者,可以根据汉字的Unicode值将所述句子转换为拼音序列。具体步骤如下:Alternatively, the sentence can be converted into a pinyin sequence according to the Unicode value of the Chinese character. Specific steps are as follows:
(1)建立拼音-编号对照表,对所有拼音进行编号并将所有拼音对应的编号添加到所述拼音-编号对照表中。所有汉字的拼音不超过512个,可以用两个字节对拼音进行编号。每个拼音对应一个编号。(1) Establish a pinyin-number comparison table, number all the pinyins and add the corresponding numbers of all the pinyins to the pinyin-number comparison table. The pinyin of all Chinese characters does not exceed 512, and the pinyin can be numbered with two bytes. Each pinyin corresponds to a number.
(2)建立Unicode值-拼音编号对照表,将汉字对应拼音的编号按照汉字的Unicode值添加到所述Unicode值-拼音编号对照表中。(2) Establish a Unicode value-Pinyin number comparison table, and add the corresponding pinyin number of the Chinese character to the Unicode value-Pinyin number comparison table according to the Unicode value of the Chinese character.
(3)逐一读取所述句子中的待转换汉字,确定所述待转换汉字的Unicode值,根据所述待转换汉字的Unicode值从所述Unicode值-拼音编号对照表中获取所述待转换汉字对应的拼音的编号,根据所述待转换汉字对应的拼音的编号从所述拼音-编号对照表获得所述待转换汉字对应的拼音,从而将所述句子中的每个汉字转换为拼音。(3) Read the Chinese characters to be converted in the sentence one by one, determine the Unicode value of the Chinese characters to be converted, and obtain the to-be converted from the Unicode value-pinyin number comparison table according to the Unicode value of the Chinese characters to be converted The number of the pinyin corresponding to the Chinese character is obtained from the pinyin-number comparison table according to the number of the pinyin corresponding to the Chinese character to be converted, so that each Chinese character in the sentence is converted into a pinyin.
若所述句子中含有多音字,可以在上述步骤(2)中将所述多音字对应的多个拼音的编号按照所述多音字的Unicode值添加到所述Unicode值-拼音编号对照表中,在上述(3)中确定所述多音字的Unicode值,根据所述多音字的Unicode值从所述Unicode值-拼音编号对照表中获取所述多音字对应的多个拼音的编号,根据所述多音字对应的多个拼音的编号从所述拼音-编号对照表获得所述多音字对应的多个拼音。可以接收用户从所述多个拼音中选择的正确拼音,将用户选择的拼音作为所述多音字在所述句子中的正确拼音。If the sentence contains polyphonic characters, in the above step (2), the numbers of the multiple pinyin corresponding to the polyphonic characters can be added to the Unicode value-pinyin number comparison table according to the Unicode value of the polyphonic character, In the above (3), the Unicode value of the polysyllabic character is determined, and the number of the multiple pinyin corresponding to the polysyllabic character is obtained from the Unicode value-pinyin number comparison table according to the Unicode value of the polysyllabic character. The numbers of the multiple pinyin corresponding to the polyphonic character are obtained from the pinyin-number comparison table. The correct pinyin selected by the user from the plurality of pinyin can be received, and the pinyin selected by the user can be used as the correct pinyin of the polyphone in the sentence.
步骤103,从所述通用自然语言数据集的拼音-句子对中选择多个拼音-句子对,将选择的每个拼音-句子对的部分拼音替换为相近拼音,得到替换后的拼音-句子对,将所述通用自然语言数据集的未选择的拼音-句子对和所述替换后的拼音-句子对组成第一样本集。Step 103: Select a plurality of pinyin-sentence pairs from the pinyin-sentence pairs of the general natural language data set, and replace part of the pinyin of each selected pinyin-sentence pair with similar pinyin to obtain the replaced pinyin-sentence pair , Compose the unselected pinyin-sentence pair of the universal natural language data set and the replaced pinyin-sentence pair into a first sample set.
可以从所述通用自然语言数据集的拼音-句子对中随机选择所述多个拼音-句子对,将选择的每个拼音-句子中的部分拼音替换为相近拼音。The multiple pinyin-sentence pairs may be randomly selected from the pinyin-sentence pairs in the universal natural language data set, and part of the pinyin in each selected pinyin-sentence may be replaced with a similar pinyin.
可以按照预设比例从通用自然语言数据集的拼音-句子对中选择多个拼音-句子对。例如,可以从所述通用自然语言数据集的拼音-句子对中选择20%的拼音-句子对进行拼音替换。举例来说,若所述通用自然语言数据集包括100个句子(即包括100个拼音-句子对),则选择20个拼音-句子对进行拼音替换。A plurality of pinyin-sentence pairs can be selected from the pinyin-sentence pairs of the general natural language data set according to a preset ratio. For example, 20% of the pinyin-sentence pairs can be selected from the pinyin-sentence pairs in the universal natural language data set for pinyin replacement. For example, if the universal natural language data set includes 100 sentences (that is, includes 100 pinyin-sentence pairs), then 20 pinyin-sentence pairs are selected for pinyin replacement.
所述第一样本集的训练样本包括未选择的拼音-句子对,即正确的拼音-句子对,还包括替换后的拼音-句子对,即将部分拼音替换为相近拼音的拼音-句子对。The training samples of the first sample set include unselected pinyin-sentence pairs, that is, correct pinyin-sentence pairs, and also include pinyin-sentence pairs after replacement, that is, partial pinyin is replaced with pinyin-sentence pairs of similar pinyin.
本申请主要用于对语言识别得到的句子进行纠错。由于语音识别得到的句子错误大多是句子中的词语有意义而句子无意义,例如“需要为谁投保”有时会被识别成“需要为谁淘宝”。因此,不仅需要正确的拼音-句子对作为训练样本,还需要将部分拼音替换为相近拼音的拼音-句子对作为模型的训练样本。This application is mainly used to correct errors in sentences obtained by language recognition. Most of the sentence errors obtained by speech recognition are that the words in the sentence are meaningful but the sentence is meaningless. For example, "who needs to insure for whom" is sometimes recognized as "who needs to Taobao for". Therefore, not only the correct pinyin-sentence pairs are needed as training samples, but part of the pinyin needs to be replaced with similar pinyin-sentence pairs as training samples for the model.
步骤104,利用所述第一样本集对神经网络模型进行预训练,得到预训练后的神经网络模型。Step 104: Pre-train the neural network model by using the first sample set to obtain a pre-trained neural network model.
所述神经网络模型的输入为拼音序列,输出为对应的句子(即汉字序列),对拼音序列中的每一个拼音,预测其对应的汉字。The input of the neural network model is a pinyin sequence, and the output is a corresponding sentence (ie, a sequence of Chinese characters). For each pinyin in the pinyin sequence, the corresponding Chinese character is predicted.
在对神经网络模型进行训练时,以每个未选择的拼音-句子对(即未替换的拼音-句子对)和每个替换后的拼音-句子对作为训练样本。拼音-句子对中的拼音序列为神经网络模型的输入,拼音-句子对中的句子为真实结果。When training the neural network model, each unselected pinyin-sentence pair (ie unreplaced pinyin-sentence pair) and each replaced pinyin-sentence pair are used as training samples. The pinyin sequence in the pinyin-sentence pair is the input of the neural network model, and the sentence in the pinyin-sentence pair is the real result.
在本实施例中,所述神经网络模型可以是transformer模型。In this embodiment, the neural network model may be a transformer model.
transformer模型可以接受一串序列作为输入,同时输出一串序列,在本申请中,Transformer模型将拼音序列作为输入,输出汉字序列。The transformer model can accept a string of sequences as input and output a string of sequences at the same time. In this application, the Transformer model uses a Pinyin sequence as input and outputs a sequence of Chinese characters.
transformer模型包含编码层、自注意力层、解码层。其中编码层和解码层分别对应拼音的编码和到汉字的解码。自注意力层则用于重复拼音的汉字预测。由于汉字拼音有大量重复,不同的汉字和词语对应于相同的拼音,例如“爆笑”和“报效”拥有同样的拼音和声调,因此在每一个拼音所在进行预测时,需要“关注”整个句子的拼音序列,而不是只看当前位置的拼音。自注意力机制可以使得某一位置的拼音获得其它所有位置的拼音表示,从而做出更符合该句子场景的汉字预测。The transformer model includes an encoding layer, a self-attention layer, and a decoding layer. The coding layer and the decoding layer correspond to the coding of Pinyin and the decoding of Chinese characters respectively. The self-attention layer is used to predict Chinese characters with repeated Pinyin. Since there are a lot of repetitions of Chinese pinyin, different Chinese characters and words correspond to the same pinyin, for example, "Bangxiao" and "baoxiao" have the same pinyin and tone, so when making predictions for each pinyin, you need to "pay attention" to the entire sentence Pinyin sequence instead of just looking at the pinyin at the current position. The self-attention mechanism can make the pinyin of a certain position obtain the pinyin representations of all other positions, so as to make predictions of Chinese characters more in line with the sentence scenario.
在经过大量样本的训练后,该Ttransformer模型可以通过输入拼音序列来输出对应的汉字序列。After training with a large number of samples, the Ttransformer model can output the corresponding Chinese character sequence by inputting the Pinyin sequence.
步骤105,获取多个与特定领域相关的包含相近拼音的拼音-句子对作为第二样本集。Step 105: Acquire a plurality of pinyin-sentence pairs that contain similar pinyin related to a specific field as a second sample set.
所述第二样本集中的每个训练样本是与特定领域相关的一个拼音-句子对,该拼音-句子对中包含与所述特定领域相关的相近拼音。Each training sample in the second sample set is a pinyin-sentence pair related to a specific field, and the pinyin-sentence pair contains similar pinyin related to the specific field.
特定领域是本方法所要应用的专有领域,例如法律、保险等。The specific field is the exclusive field to be applied in this method, such as law, insurance, etc.
步骤101获得的语言数据集是通用自然语言数据集,主要包含一些日常用语,根据通用自然语言数据集得到的第一样本集是关于日常用语的训练样本,因此预训练得到的神经网络模型在当日常生活中的句子有明显的语音识别错误时,可以进行很好地纠错。但当遇到某些例如法律、保险等专有领域,则神经网络模型的纠错效果有所下降,会将很多专有词语识别为日常用语。例如将“需要为谁投保”中的“投保”识别为“淘宝”。因此要应用到特定领域进行错词纠错时,需要该特定领域的样本数据。The language data set obtained in step 101 is a general natural language data set, which mainly contains some everyday words. The first sample set obtained according to the general natural language data set is a training sample about everyday words. Therefore, the neural network model obtained by pre-training is in When the sentences in daily life have obvious speech recognition errors, they can be corrected well. However, when encountering certain proprietary fields such as law and insurance, the error correction effect of the neural network model is reduced, and many proprietary words will be recognized as everyday words. For example, "Insured" in "Who needs to insure" is identified as "Taobao". Therefore, when it is applied to a specific field for error correction, sample data of the specific field is required.
可以按照下述方法获取多个与特定领域相关的包含相近拼音的拼音-句子对:You can obtain multiple pinyin-sentence pairs with similar pinyin related to a specific field according to the following method:
获取所述特定领域的文本数据集,所述文本数据集包含多个句子;Acquiring a text data set of the specific field, the text data set containing multiple sentences;
将所述文本数据集包含的每个句子转换为拼音序列,得到所述文本数据集的拼音-句子对;Converting each sentence contained in the text data set into a pinyin sequence to obtain a pinyin-sentence pair of the text data set;
将所述文本数据集的拼音-句子对中所述特定领域的专有词语的拼音替换为相近拼音,得到与特定领域相关的包含相近拼音的拼音-句子对。例如,将“需要为谁投保”中的“投保”的拼音(tou,二声,bao,三声)替换为“淘宝”的拼音(tao,二声,bao,三声)。The pinyin of the specific word in the specific field in the pinyin-sentence pair of the text data set is replaced with a similar pinyin to obtain a pinyin-sentence pair containing the similar pinyin related to the specific field. For example, replace the pinyin (tou, ersheng, bao, three tones) of "insurance" in "who needs to insure for" with the pinyin of "taobao" (tao, ersheng, bao, three tones).
或者,可以预先建立数据库,用于存储所述特定领域识别错误的拼音-句子对,从所述数据库获取多个与特定领域相关的包含相近拼音的拼音-句子对。Alternatively, a database may be established in advance to store the pinyin-sentence pairs that are incorrectly recognized in the specific field, and a plurality of pinyin-sentence pairs containing similar pinyin related to the specific field can be obtained from the database.
步骤106,利用所述第二样本集对所述预训练后的神经网络模型进行微调,得到微调后的神经网络模型。Step 106: Use the second sample set to fine-tune the pre-trained neural network model to obtain a fine-tuned neural network model.
利用所述第二样本集对所述神经网络模型进行微调的目的是使所述神经网络模型更适用于特定领域,提高特定领域的纠错准确率。The purpose of fine-tuning the neural network model by using the second sample set is to make the neural network model more suitable for a specific field and improve the error correction accuracy rate in the specific field.
微调训练后的模型在拼音近似的情况下,更倾向于预测为该特定领域的专有词语,从而提高语音识别错误的错词纠正效果。When the pinyin is similar, the model after fine-tuning training is more inclined to predict the exclusive words in the specific field, thereby improving the effect of correcting the wrong words of speech recognition errors.
可以固定所述神经网络模型的前面几层神经元的权值,微调神经网络模型的后面几层神经元的权值。这样做主要是为了避免第二样本集过小出现过拟合现象,神经网络模型前几层神经元一般包含更多的一般特征,对于许多任务而言非常重要,但是后面几层神经元的特征学习注重高层特征,不同的数据集间差异较大。The weights of the neurons in the first few layers of the neural network model can be fixed, and the weights of the neurons in the subsequent layers of the neural network model can be fine-tuned. This is mainly to avoid over-fitting when the second sample set is too small. The neurons in the first few layers of the neural network model generally contain more general features, which are very important for many tasks, but the characteristics of the neurons in the latter layers Learning pays attention to high-level features, and there are big differences between different data sets.
步骤107,将待纠错句子的拼音序列输入所述微调后的神经网络模型进行纠错,得到纠错后的句子。Step 107: Input the pinyin sequence of the sentence to be corrected into the fine-tuned neural network model for error correction, and obtain the corrected sentence.
语言识别得到的结果可以包括多个中文文本,每个中文文本可以包括多个待纠错句子(即多句话)。这种情况下,可以根据标点符号(例如逗号、分号、句号等)将语言识别得到的中文文本划分为多个待纠错句子,将划分得到的每个待纠错句子转换为拼音序列。The result of language recognition may include multiple Chinese texts, and each Chinese text may include multiple sentences to be corrected (ie, multiple sentences). In this case, the Chinese text obtained by language recognition can be divided into multiple sentences to be corrected according to punctuation (such as comma, semicolon, period, etc.), and each sentence to be corrected is converted into a pinyin sequence.
可以根据汉字的ASCII码将所述待纠错句子转换为拼音序列。或者,可以根据汉字的Unicode值将所述待纠错句子转换为拼音序列。将待纠错句子转换为拼音序列的方法可以参考步骤102。The sentence to be corrected can be converted into a pinyin sequence according to the ASCII code of the Chinese character. Alternatively, the sentence to be corrected can be converted into a pinyin sequence according to the Unicode value of the Chinese character. Refer to step 102 for the method of converting the sentence to be corrected into a pinyin sequence.
或者,可以接收用户输入的待纠错句子,将所述待纠错句子转换为拼音序列。例如,可以生成用户界面,从所述用户界面接收用户输入的待纠错句子。也可以直接接收用户输入的待纠错句子的拼音序列。Alternatively, the sentence to be corrected input by the user may be received, and the sentence to be corrected may be converted into a pinyin sequence. For example, a user interface may be generated, and a sentence to be corrected input by the user may be received from the user interface. It is also possible to directly receive the pinyin sequence of the sentence to be corrected input by the user.
实施例一的错词纠正方法获取通用自然语言数据集,所述通用自然语言数据集包含多个句子;将所述通用自然语言数据集包含的每个句子转换为拼音序列,得到所述通用自然语言数据集的拼音-句子对;从所述通用自然语言数据集的拼音-句子对中选择多个拼音-句子对,将选择的每个拼音-句子对的部分拼音替换为相近拼音,得到替换后的拼音-句子对,将所述通用自然语言数据集的未选择的拼音-句子对和所述替换后的拼音-句子对组成第一样本集;利用所述第一样本集对神经网络模型进行预训练,得到预训练后的神经网络模型;获取多个与特定领域相关的包含相近拼音的拼音-句子对作为第二样本集;利用所述第二样本集对所述预训练后的神经网络模型进行微调,得到微调后的神经网络模型;将待纠错句子的拼音序列输入所述微调后的神经网络模型进行纠错,得到纠错后的句子。本实施例可以解决由于语音识别系统的通用性在特定领域内无法准确预测专有词语的问题,能够对语言识别中专有词语被识别为常用词进行纠错。The wrong word correction method of the first embodiment obtains a universal natural language data set, the universal natural language data set contains multiple sentences; each sentence included in the universal natural language data set is converted into a pinyin sequence to obtain the universal natural language data set. Pinyin-sentence pairs of the language data set; select multiple pinyin-sentence pairs from the pinyin-sentence pairs of the general natural language data set, and replace part of the pinyin of each selected pinyin-sentence pair with similar pinyin to obtain the replacement After the pinyin-sentence pair, the unselected pinyin-sentence pair of the general natural language data set and the replaced pinyin-sentence pair form the first sample set; use the first sample set to pair the nerve The network model is pre-trained to obtain a pre-trained neural network model; a plurality of pinyin-sentence pairs containing similar pinyin related to a specific field are obtained as the second sample set; the second sample set is used to perform the pre-training The neural network model is fine-tuned to obtain a fine-tuned neural network model; the pinyin sequence of the sentence to be corrected is input into the fine-tuned neural network model for error correction, and the corrected sentence is obtained. This embodiment can solve the problem that the proprietary words cannot be accurately predicted in a specific field due to the versatility of the speech recognition system, and can correct errors in the recognition of the proprietary words as common words in language recognition.
在另一实施例中,所述错词纠正方法还可以包括:对输入的语音进行识别,得到所述待纠错句子。可以采用各种语音识别技术,例如动态时间规整(Dynamic Time Warping, DTW)、隐马尔可夫模型(Hidden Markov Model,HMM)、矢量量化(Vector Quantization,VQ)、人工神经网络(Artificial Neural Network,ANN)等技术对所述语音进行识别。In another embodiment, the method for correcting wrong words may further include: recognizing the input voice to obtain the sentence to be corrected. Various speech recognition technologies can be used, such as Dynamic Time Warping (DTW), Hidden Markov Model (HMM), Vector Quantization (VQ), and Artificial Neural Network (Artificial Neural Network, ANN) and other technologies to recognize the voice.
实施例二Example two
图2是本申请实施例二提供的错词纠正装置的结构图。所述错词纠正装置20应用于计算机装置。如图2所示,所述错词纠正装置20可以包括第一获取模块201、转换模块202、生成模块203、预训练模块204、第二获取模块205、微调模块206、纠错模块207。Fig. 2 is a structural diagram of a wrong word correction device provided in the second embodiment of the present application. The wrong word correction device 20 is applied to a computer device. As shown in FIG. 2, the wrong word correction device 20 may include a first acquisition module 201, a conversion module 202, a generation module 203, a pre-training module 204, a second acquisition module 205, a fine-tuning module 206, and an error correction module 207.
第一获取模块201,用于获取通用自然语言数据集,所述通用自然语言数据集包含多个句子。The first acquisition module 201 is configured to acquire a universal natural language data set, the universal natural language data set containing multiple sentences.
所述通用自然语言数据集是包含日常用语的中文文本。The universal natural language data set is a Chinese text containing everyday words.
可以从书籍、新闻、网页(例如百度百科、维基百科等)等数据源中收集所述通用自然语言数据集。例如,可以对书籍中的文字进行文字识别,得到所述通用自然语言数据集。又如,可以对播报的新闻进行语言识别,得到所述通用自然语言数据集。再如,可以从网页中抓取文本,得到所述通用自然语言数据集。The universal natural language data set can be collected from data sources such as books, news, web pages (such as Baidu Baike, Wikipedia, etc.). For example, text recognition can be performed on text in a book to obtain the universal natural language data set. For another example, language recognition can be performed on the broadcast news to obtain the universal natural language data set. For another example, text can be captured from a web page to obtain the universal natural language data set.
或者,可以从预设数据库读取所述通用自然语言数据集。所述预设数据库可以预先存储大量的中文文本。Alternatively, the universal natural language data set can be read from a preset database. The preset database can store a large amount of Chinese texts in advance.
或者,可以接收用户输入的中文文本,将用户输入的中文文本作为所述通用自然语言数据集。Alternatively, the Chinese text input by the user may be received, and the Chinese text input by the user may be used as the universal natural language data set.
转换模块202,用于将所述通用自然语言数据集包含的每个句子转换为拼音序列,得到所述通用自然语言数据集的拼音-句子对。The conversion module 202 is configured to convert each sentence contained in the universal natural language data set into a pinyin sequence to obtain a pinyin-sentence pair of the universal natural language data set.
在本实施例中,所述通用自然语言数据集可以包括多个中文文本,每个中文文本可以包括多个句子(即多句话)。这种情况下,可以根据标点符号(例如逗号、分号、句号等)将每个中文文本划分为多个句子,将划分得到的每个句子转换为拼音序列,即得到每个句子对应的拼音-句子对。In this embodiment, the universal natural language data set may include multiple Chinese texts, and each Chinese text may include multiple sentences (ie, multiple sentences). In this case, each Chinese text can be divided into multiple sentences according to punctuation marks (such as comma, semicolon, period, etc.), and each sentence obtained by the division can be converted into a pinyin sequence to obtain the pinyin corresponding to each sentence -Sentence pairs.
可以根据汉字的ASCII码将所述句子转换为拼音序列。由于汉字在计算机系统中以ASCII码表示,只需要利用计算机系统中已有的或用户建立的每个拼音与每个ASCII码对应关系,即可实现将句子转换成拼音序列。若句子含有多音字,可以列出多音字的多个拼音,接收用户选择的正确拼音。The sentence can be converted into a pinyin sequence according to the ASCII code of the Chinese character. Since Chinese characters are represented by ASCII codes in the computer system, only the correspondence between each pinyin and each ASCII code existing in the computer system or established by the user can be used to convert sentences into pinyin sequences. If the sentence contains polyphonic characters, multiple pinyins of the polyphonic characters can be listed, and the correct pinyin selected by the user can be received.
或者,可以根据汉字的Unicode值将所述句子转换为拼音序列。具体步骤如下:Alternatively, the sentence can be converted into a pinyin sequence according to the Unicode value of the Chinese character. Specific steps are as follows:
(1)建立拼音-编号对照表,对所有拼音进行编号并将所有拼音对应的编号添加到所述拼音-编号对照表中。所有汉字的拼音不超过512个,可以用两个字节对拼音进行编 号。每个拼音对应一个编号。(1) Establish a pinyin-number comparison table, number all the pinyins and add the corresponding numbers of all the pinyins to the pinyin-number comparison table. The pinyin of all Chinese characters does not exceed 512, and the pinyin can be numbered with two bytes. Each pinyin corresponds to a number.
(2)建立Unicode值-拼音编号对照表,将汉字对应拼音的编号按照汉字的Unicode值添加到所述Unicode值-拼音编号对照表中。(2) Establish a Unicode value-Pinyin number comparison table, and add the corresponding pinyin number of the Chinese character to the Unicode value-Pinyin number comparison table according to the Unicode value of the Chinese character.
(3)逐一读取所述句子中的待转换汉字,确定所述待转换汉字的Unicode值,根据所述待转换汉字的Unicode值从所述Unicode值-拼音编号对照表中获取所述待转换汉字对应的拼音的编号,根据所述待转换汉字对应的拼音的编号从所述拼音-编号对照表获得所述待转换汉字对应的拼音,从而将所述句子中的每个汉字转换为拼音。(3) Read the Chinese characters to be converted in the sentence one by one, determine the Unicode value of the Chinese characters to be converted, and obtain the to-be converted from the Unicode value-pinyin number comparison table according to the Unicode value of the Chinese characters to be converted The number of the pinyin corresponding to the Chinese character is obtained from the pinyin-number comparison table according to the number of the pinyin corresponding to the Chinese character to be converted, so that each Chinese character in the sentence is converted into a pinyin.
若所述句子中含有多音字,可以在上述步骤(2)中将所述多音字对应的多个拼音的编号按照所述多音字的Unicode值添加到所述Unicode值-拼音编号对照表中,在上述(3)中确定所述多音字的Unicode值,根据所述多音字的Unicode值从所述Unicode值-拼音编号对照表中获取所述多音字对应的多个拼音的编号,根据所述多音字对应的多个拼音的编号从所述拼音-编号对照表获得所述多音字对应的多个拼音。可以接收用户从所述多个拼音中选择的正确拼音,将用户选择的拼音作为所述多音字在所述句子中的正确拼音。If the sentence contains polyphonic characters, in the above step (2), the numbers of the multiple pinyin corresponding to the polyphonic characters can be added to the Unicode value-pinyin number comparison table according to the Unicode value of the polyphonic character, In the above (3), the Unicode value of the polysyllabic character is determined, and the number of the multiple pinyin corresponding to the polysyllabic character is obtained from the Unicode value-pinyin number comparison table according to the Unicode value of the polysyllabic character. The numbers of the multiple pinyin corresponding to the polyphonic character are obtained from the pinyin-number comparison table. The correct pinyin selected by the user from the plurality of pinyin can be received, and the pinyin selected by the user can be used as the correct pinyin of the polyphone in the sentence.
生成模块203,用于从所述通用自然语言数据集的拼音-句子对中选择多个拼音-句子对,将选择的每个拼音-句子对的部分拼音替换为相近拼音,得到替换后的拼音-句子对,将所述通用自然语言数据集的未选择的拼音-句子对和所述替换后的拼音-句子对组成第一样本集。The generating module 203 is configured to select multiple pinyin-sentence pairs from the pinyin-sentence pairs in the universal natural language data set, and replace part of the pinyin of each selected pinyin-sentence pair with similar pinyin to obtain the replaced pinyin -Sentence pairs, combining the unselected pinyin-sentence pairs of the universal natural language data set and the replaced pinyin-sentence pairs into a first sample set.
可以从所述通用自然语言数据集的拼音-句子对中随机选择所述多个拼音-句子对,将选择的每个拼音-句子中的部分拼音替换为相近拼音。The multiple pinyin-sentence pairs may be randomly selected from the pinyin-sentence pairs in the universal natural language data set, and part of the pinyin in each selected pinyin-sentence may be replaced with a similar pinyin.
可以按照预设比例从通用自然语言数据集的拼音-句子对中选择多个拼音-句子对。例如,可以从所述通用自然语言数据集的拼音-句子对中选择20%的拼音-句子对进行拼音替换。举例来说,若所述通用自然语言数据集包括100个句子(即包括100个拼音-句子对),则选择20个拼音-句子对进行拼音替换。A plurality of pinyin-sentence pairs can be selected from the pinyin-sentence pairs of the general natural language data set according to a preset ratio. For example, 20% of the pinyin-sentence pairs can be selected from the pinyin-sentence pairs in the universal natural language data set for pinyin replacement. For example, if the general natural language data set includes 100 sentences (that is, includes 100 pinyin-sentence pairs), then 20 pinyin-sentence pairs are selected for pinyin replacement.
所述第一样本集的训练样本包括未选择的拼音-句子对,即正确的拼音-句子对,还包括替换后的拼音-句子对,即将部分拼音替换为相近拼音的拼音-句子对。The training samples of the first sample set include unselected pinyin-sentence pairs, that is, correct pinyin-sentence pairs, and also include pinyin-sentence pairs after replacement, that is, partial pinyin is replaced with pinyin-sentence pairs of similar pinyin.
本申请主要用于对语言识别得到的句子进行纠错。由于语音识别得到的句子错误大多是句子中的词语有意义而句子无意义,例如“需要为谁投保”有时会被识别成“需要为谁淘宝”。因此,不仅需要正确的拼音-句子对作为训练样本,还需要将部分拼音替换为相近拼音的拼音-句子对作为模型的训练样本。This application is mainly used to correct errors in sentences obtained by language recognition. Most of the sentence errors obtained by speech recognition are that the words in the sentence are meaningful but the sentence is meaningless. For example, "who needs to insure for whom" is sometimes recognized as "who needs to Taobao for". Therefore, not only the correct pinyin-sentence pairs are required as training samples, but also some pinyin-sentence pairs need to be replaced with similar pinyin pinyin-sentence pairs as training samples for the model.
预训练模块204,用于利用所述第一样本集对神经网络模型进行预训练,得到预训练后的神经网络模型。The pre-training module 204 is configured to pre-train the neural network model by using the first sample set to obtain the pre-trained neural network model.
所述神经网络模型的输入为拼音序列,输出为对应的句子(即汉字序列),对拼音序列中的每一个拼音,预测其对应的汉字。The input of the neural network model is a pinyin sequence, and the output is a corresponding sentence (ie, a sequence of Chinese characters). For each pinyin in the pinyin sequence, the corresponding Chinese character is predicted.
在对神经网络模型进行训练时,以每个未选择的拼音-句子对(即未替换的拼音-句子对)和每个替换后的拼音-句子对作为训练样本。拼音-句子对中的拼音序列为神经网络模型的输入,拼音-句子对中的句子为真实结果。When training the neural network model, each unselected pinyin-sentence pair (ie unreplaced pinyin-sentence pair) and each replaced pinyin-sentence pair are used as training samples. The pinyin sequence in the pinyin-sentence pair is the input of the neural network model, and the sentence in the pinyin-sentence pair is the real result.
在本实施例中,所述神经网络模型可以是transformer模型。In this embodiment, the neural network model may be a transformer model.
transformer模型可以接受一串序列作为输入,同时输出一串序列,在本申请中,Transformer模型将拼音序列作为输入,输出汉字序列。The transformer model can accept a string of sequences as input and output a string of sequences at the same time. In this application, the Transformer model uses a Pinyin sequence as input and outputs a sequence of Chinese characters.
transformer模型包含编码层、自注意力层、解码层。其中编码层和解码层分别对应拼音的编码和到汉字的解码。The transformer model includes an encoding layer, a self-attention layer, and a decoding layer. The coding layer and the decoding layer correspond to the coding of Pinyin and the decoding of Chinese characters respectively.
自注意力层则用于重复拼音的汉字预测。由于汉字拼音有大量重复,不同的汉字和词语对应于相同的拼音,例如“爆笑”和“报效”拥有同样的拼音和声调,因此在每一个拼音所在进行预测时,需要“关注”整个句子的拼音序列,而不是只看当前位置的拼音。自注意力机制可以使得某一位置的拼音获得其它所有位置的拼音表示,从而做出更符合该句子场景的汉字预测。The self-attention layer is used to predict Chinese characters with repeated Pinyin. Because there are a lot of repetitions in Chinese pinyin, different Chinese characters and words correspond to the same pinyin, for example, "Bangxiao" and "baoxiao" have the same pinyin and tone, so when making predictions for each pinyin, you need to "pay attention" to the entire sentence Pinyin sequence instead of just looking at the pinyin at the current position. The self-attention mechanism can make the pinyin of a certain position obtain the pinyin representations of all other positions, so as to make predictions of Chinese characters more in line with the sentence scenario.
在经过大量样本的训练后,该Ttransformer模型可以通过输入拼音序列来输出对应的汉字序列。After training with a large number of samples, the Ttransformer model can output the corresponding Chinese character sequence by inputting the Pinyin sequence.
第二获取模块205,用于获取多个与特定领域相关的包含相近拼音的拼音-句子对作为第二样本集。The second acquisition module 205 is configured to acquire a plurality of pinyin-sentence pairs containing similar pinyin related to a specific field as a second sample set.
所述第二样本集中的每个训练样本是与特定领域相关的一个拼音-句子对,该拼音-句子对中包含与所述特定领域相关的相近拼音。Each training sample in the second sample set is a pinyin-sentence pair related to a specific field, and the pinyin-sentence pair contains similar pinyin related to the specific field.
特定领域是本方法所要应用的专有领域,例如法律、保险等。The specific field is the exclusive field to be applied in this method, such as law, insurance, etc.
第一获取模块201获得的语言数据集是通用自然语言数据集,主要包含一些日常用语,根据通用自然语言数据集得到的第一样本集是关于日常用语的训练样本,因此预训练得到的神经网络模型在当日常生活中的句子有明显的语音识别错误时,可以进行很好地纠错。但当遇到某些例如法律、保险等专有领域,则神经网络模型的纠错效果有所下降,会将很多专有词语识别为日常用语。例如将“需要为谁投保”中的“投保”识别为“淘宝”。因此要应用到特定领域进行错词纠错时,需要该特定领域的样本数据。The language data set obtained by the first acquisition module 201 is a general natural language data set, which mainly contains some everyday words. The first sample set obtained according to the general natural language data set is a training sample about everyday words, so the nerves obtained by pre-training The network model can perform a good error correction when there are obvious speech recognition errors in sentences in daily life. However, when encountering certain proprietary fields such as law and insurance, the error correction effect of the neural network model is reduced, and many proprietary words will be recognized as everyday words. For example, "Insured" in "Who needs to insure" is identified as "Taobao". Therefore, when it is applied to a specific field for error correction, sample data of the specific field is required.
可以按照下述方法获取多个与特定领域相关的包含相近拼音的拼音-句子对:You can obtain multiple pinyin-sentence pairs with similar pinyin related to a specific field according to the following method:
获取所述特定领域的文本数据集,所述文本数据集包含多个句子;Acquiring a text data set of the specific field, the text data set containing multiple sentences;
将所述文本数据集包含的每个句子转换为拼音序列,得到所述文本数据集的拼音- 句子对;Converting each sentence contained in the text data set into a pinyin sequence to obtain a pinyin-sentence pair of the text data set;
将所述文本数据集的拼音-句子对中所述特定领域的专有词语的拼音替换为相近拼音,得到与特定领域相关的包含相近拼音的拼音-句子对。例如,将“需要为谁投保”中的“投保”的拼音(tou,二声,bao,三声)替换为“淘宝”的拼音(tao,二声,bao,三声)。The pinyin of the specific word in the specific field in the pinyin-sentence pair of the text data set is replaced with a similar pinyin to obtain a pinyin-sentence pair containing the similar pinyin related to the specific field. For example, replace the pinyin (tou, ersheng, bao, three tones) of "insurance" in "who needs to insure for" with the pinyin of "taobao" (tao, ersheng, bao, three tones).
或者,可以预先建立数据库,用于存储所述特定领域识别错误的拼音-句子对,从所述数据库获取多个与特定领域相关的包含相近拼音的拼音-句子对。Alternatively, a database may be established in advance to store the pinyin-sentence pairs that are incorrectly recognized in the specific field, and a plurality of pinyin-sentence pairs containing similar pinyin related to the specific field can be obtained from the database.
微调模块206,用于利用所述第二样本集对所述预训练后的神经网络模型进行微调,得到微调后的神经网络模型。The fine-tuning module 206 is configured to use the second sample set to fine-tune the pre-trained neural network model to obtain a fine-tuned neural network model.
利用所述第二样本集对所述神经网络模型进行微调的目的是使所述神经网络模型更适用于特定领域,提高特定领域的纠错准确率。The purpose of fine-tuning the neural network model by using the second sample set is to make the neural network model more suitable for a specific field and improve the error correction accuracy rate in the specific field.
微调训练后的模型在拼音近似的情况下,更倾向于预测为该特定领域的专有词语,从而提高语音识别错误的错词纠正效果。When the pinyin is similar, the model after fine-tuning training is more inclined to predict the exclusive words in the specific field, thereby improving the effect of correcting the wrong words of speech recognition errors.
可以固定所述神经网络模型的前面几层神经元的权值,微调神经网络模型的后面几层神经元的权值。这样做主要是为了避免第二样本集过小出现过拟合现象,神经网络模型前几层神经元一般包含更多的一般特征,对于许多任务而言非常重要,但是后面几层神经元的特征学习注重高层特征,不同的数据集间差异较大。The weights of neurons in the first few layers of the neural network model can be fixed, and the weights of neurons in the next few layers of the neural network model can be fine-tuned. This is mainly to avoid over-fitting when the second sample set is too small. The neurons in the first few layers of the neural network model generally contain more general features, which are very important for many tasks, but the characteristics of the neurons in the latter layers Learning focuses on high-level features, and different data sets vary greatly.
纠错模块207,用于将待纠错句子的拼音序列输入所述微调后的神经网络模型进行纠错,得到纠错后的句子。The error correction module 207 is configured to input the pinyin sequence of the sentence to be corrected into the fine-tuned neural network model for error correction, and obtain the corrected sentence.
语言识别得到的结果可以包括多个中文文本,每个中文文本可以包括多个待纠错句子(即多句话)。这种情况下,可以根据标点符号(例如逗号、分号、句号等)将语言识别得到的中文文本划分为多个待纠错句子,将划分得到的每个待纠错句子转换为拼音序列。The result of language recognition may include multiple Chinese texts, and each Chinese text may include multiple sentences to be corrected (ie, multiple sentences). In this case, the Chinese text obtained by language recognition can be divided into multiple sentences to be corrected according to punctuation (such as comma, semicolon, period, etc.), and each sentence to be corrected is converted into a pinyin sequence.
可以根据汉字的ASCII码将所述待纠错句子转换为拼音序列。或者,可以根据汉字的Unicode值将所述待纠错句子转换为拼音序列。将待纠错句子转换为拼音序列的方法可以参考转换模块202的描述。The sentence to be corrected can be converted into a pinyin sequence according to the ASCII code of the Chinese character. Alternatively, the sentence to be corrected can be converted into a pinyin sequence according to the Unicode value of the Chinese character. Refer to the description of the conversion module 202 for the method of converting the sentence to be corrected into a pinyin sequence.
或者,可以接收用户输入的待纠错句子,将所述待纠错句子转换为拼音序列。例如,可以生成用户界面,从所述用户界面接收用户输入的待纠错句子。也可以直接接收用户输入的待纠错句子的拼音序列。Alternatively, the sentence to be corrected input by the user may be received, and the sentence to be corrected may be converted into a pinyin sequence. For example, a user interface may be generated, and a sentence to be corrected input by the user may be received from the user interface. It is also possible to directly receive the pinyin sequence of the sentence to be corrected input by the user.
本实施例的错词纠正装置20获取通用自然语言数据集,所述通用自然语言数据集包含多个句子;将所述通用自然语言数据集包含的每个句子转换为拼音序列,得到所述通用自然语言数据集的拼音-句子对;从所述通用自然语言数据集的拼音-句子对中选择多 个拼音-句子对,将选择的每个拼音-句子对的部分拼音替换为相近拼音,得到替换后的拼音-句子对,将所述通用自然语言数据集的未选择的拼音-句子对和所述替换后的拼音-句子对组成第一样本集;利用所述第一样本集对神经网络模型进行预训练,得到预训练后的神经网络模型;获取多个与特定领域相关的包含相近拼音的拼音-句子对作为第二样本集;利用所述第二样本集对所述预训练后的神经网络模型进行微调,得到微调后的神经网络模型;将待纠错句子的拼音序列输入所述微调后的神经网络模型进行纠错,得到纠错后的句子。本实施例可以解决由于语音识别系统的通用性在特定领域内无法准确预测专有词语的问题,能够对语言识别中专有词语被识别为常用词进行纠错。The wrong word correction device 20 of this embodiment obtains a universal natural language data set, the universal natural language data set contains multiple sentences; each sentence included in the universal natural language data set is converted into a pinyin sequence to obtain the universal Pinyin-sentence pairs of the natural language data set; select multiple pinyin-sentence pairs from the pinyin-sentence pairs of the general natural language data set, and replace part of the pinyin of each selected pinyin-sentence pair with similar pinyin to obtain The replaced pinyin-sentence pair, the unselected pinyin-sentence pair of the general natural language data set and the replaced pinyin-sentence pair form a first sample set; the first sample set pair is used The neural network model is pre-trained to obtain the pre-trained neural network model; a plurality of pinyin-sentence pairs containing similar pinyin related to a specific field are obtained as the second sample set; the second sample set is used to perform the pre-training The latter neural network model is fine-tuned to obtain a fine-tuned neural network model; the pinyin sequence of the sentence to be corrected is input into the fine-tuned neural network model for error correction, and the corrected sentence is obtained. This embodiment can solve the problem that the proprietary words cannot be accurately predicted in a specific field due to the versatility of the speech recognition system, and can correct errors in the recognition of the proprietary words as common words in language recognition.
在另一实施例中,所述错词纠正装置20还可以包括:识别模块,对输入的语音进行识别,得到所述待纠错句子。可以采用各种语音识别技术,例如动态时间规整(Dynamic Time Warping,DTW)、隐马尔可夫模型(Hidden Markov Model,HMM)、矢量量化(Vector Quantization,VQ)、人工神经网络(Artificial Neural Network,ANN)等技术对所述语音进行识别。In another embodiment, the wrong word correction device 20 may further include: a recognition module, which recognizes the input voice to obtain the sentence to be corrected. Various speech recognition technologies can be used, such as Dynamic Time Warping (DTW), Hidden Markov Model (HMM), Vector Quantization (VQ), Artificial Neural Network (Artificial Neural Network, ANN) and other technologies to recognize the voice.
实施例三Example three
本实施例提供一种非易失性可读存储介质,该非易失性可读存储介质上存储有计算机可读指令,该计算机可读指令被处理器执行时实现上述错词纠正方法实施例中的步骤,例如图1所示的步骤101-107:This embodiment provides a non-volatile readable storage medium with computer readable instructions stored on the non-volatile readable storage medium, and when the computer readable instructions are executed by a processor, the above-mentioned wrong word correction method embodiment is implemented Steps in, for example, steps 101-107 shown in Figure 1:
步骤101,获取通用自然语言数据集,所述通用自然语言数据集包含多个句子;Step 101: Obtain a universal natural language data set, where the universal natural language data set contains multiple sentences;
步骤102,将所述通用自然语言数据集包含的每个句子转换为拼音序列,得到所述通用自然语言数据集的拼音-句子对;Step 102: Convert each sentence contained in the universal natural language data set into a pinyin sequence to obtain a pinyin-sentence pair of the universal natural language data set;
步骤103,从所述通用自然语言数据集的拼音-句子对中选择多个拼音-句子对,将选择的每个拼音-句子对的部分拼音替换为相近拼音,得到替换后的拼音-句子对,将所述通用自然语言数据集的未选择的拼音-句子对和所述替换后的拼音-句子对组成第一样本集;Step 103: Select a plurality of pinyin-sentence pairs from the pinyin-sentence pairs of the general natural language data set, and replace part of the pinyin of each selected pinyin-sentence pair with similar pinyin to obtain the replaced pinyin-sentence pair , Compose the unselected pinyin-sentence pair of the universal natural language data set and the replaced pinyin-sentence pair into a first sample set;
步骤104,利用所述第一样本集对神经网络模型进行预训练,得到预训练后的神经网络模型;Step 104: Pre-train the neural network model by using the first sample set to obtain a pre-trained neural network model;
步骤105,获取多个与特定领域相关的包含相近拼音的拼音-句子对作为第二样本集;Step 105: Obtain a plurality of pinyin-sentence pairs containing similar pinyin related to a specific field as a second sample set;
步骤106,利用所述第二样本集对所述预训练后的神经网络模型进行微调,得到微调后的神经网络模型;Step 106: Use the second sample set to fine-tune the pre-trained neural network model to obtain a fine-tuned neural network model;
步骤107,将待纠错句子的拼音序列输入所述微调后的神经网络模型进行纠错,得到纠错后的句子。Step 107: Input the pinyin sequence of the sentence to be corrected into the fine-tuned neural network model for error correction, and obtain the corrected sentence.
或者,该计算机可读指令被处理器执行时实现上述装置实施例中各模块的功能,例 如图2中的模块201-207:Alternatively, the computer-readable instructions realize the functions of the modules in the above-mentioned device embodiment when executed by the processor, for example, modules 201-207 in Figure 2:
第一获取模块201,用于获取通用自然语言数据集,所述通用自然语言数据集包含多个句子;The first acquiring module 201 is configured to acquire a universal natural language data set, the universal natural language data set containing multiple sentences;
转换模块202,用于将所述通用自然语言数据集包含的每个句子转换为拼音序列,得到所述通用自然语言数据集的拼音-句子对;The conversion module 202 is configured to convert each sentence contained in the universal natural language data set into a pinyin sequence to obtain a pinyin-sentence pair of the universal natural language data set;
生成模块203,用于从所述通用自然语言数据集的拼音-句子对中选择多个拼音-句子对,将选择的每个拼音-句子对的部分拼音替换为相近拼音,得到替换后的拼音-句子对,将所述通用自然语言数据集的未选择的拼音-句子对和所述替换后的拼音-句子对组成第一样本集;The generating module 203 is configured to select multiple pinyin-sentence pairs from the pinyin-sentence pairs in the universal natural language data set, and replace part of the pinyin of each selected pinyin-sentence pair with similar pinyin to obtain the replaced pinyin -Sentence pairs, combining the unselected pinyin-sentence pairs of the general natural language data set and the replaced pinyin-sentence pairs into a first sample set;
预训练模块204,用于利用所述第一样本集对神经网络模型进行预训练,得到预训练后的神经网络模型;The pre-training module 204 is configured to pre-train the neural network model by using the first sample set to obtain a pre-trained neural network model;
第二获取模块205,用于获取多个与特定领域相关的包含相近拼音的拼音-句子对作为第二样本集;The second acquiring module 205 is configured to acquire a plurality of pinyin-sentence pairs containing similar pinyin related to a specific field as a second sample set;
微调模块206,用于利用所述第二样本集对所述预训练后的神经网络模型进行微调,得到微调后的神经网络模型;The fine-tuning module 206 is configured to use the second sample set to fine-tune the pre-trained neural network model to obtain a fine-tuned neural network model;
纠错模块207,用于将待纠错句子的拼音序列输入所述微调后的神经网络模型进行纠错,得到纠错后的句子。The error correction module 207 is configured to input the pinyin sequence of the sentence to be corrected into the fine-tuned neural network model for error correction, and obtain the corrected sentence.
实施例四Example four
图3为本申请实施例四提供的计算机装置的示意图。所述计算机装置30包括存储器301、处理器302以及存储在所述存储器301中并可在所述处理器302上运行的计算机可读指令303,例如错词纠正程序。所述处理器302执行所述计算机可读指令303时实现上述错词纠正方法实施例中的步骤,例如图1所示的步骤101-107:FIG. 3 is a schematic diagram of a computer device provided in Embodiment 4 of this application. The computer device 30 includes a memory 301, a processor 302, and computer-readable instructions 303 stored in the memory 301 and running on the processor 302, such as a wrong word correction program. When the processor 302 executes the computer-readable instruction 303, the steps in the embodiment of the above-mentioned wrong word correction method are implemented, for example, steps 101-107 shown in Fig. 1:
步骤101,获取通用自然语言数据集,所述通用自然语言数据集包含多个句子;Step 101: Obtain a universal natural language data set, where the universal natural language data set contains multiple sentences;
步骤102,将所述通用自然语言数据集包含的每个句子转换为拼音序列,得到所述通用自然语言数据集的拼音-句子对;Step 102: Convert each sentence contained in the universal natural language data set into a pinyin sequence to obtain a pinyin-sentence pair of the universal natural language data set;
步骤103,从所述通用自然语言数据集的拼音-句子对中选择多个拼音-句子对,将选择的每个拼音-句子对的部分拼音替换为相近拼音,得到替换后的拼音-句子对,将所述通用自然语言数据集的未选择的拼音-句子对和所述替换后的拼音-句子对组成第一样本集;Step 103: Select a plurality of pinyin-sentence pairs from the pinyin-sentence pairs of the general natural language data set, and replace part of the pinyin of each selected pinyin-sentence pair with similar pinyin to obtain the replaced pinyin-sentence pair , Compose the unselected pinyin-sentence pair of the universal natural language data set and the replaced pinyin-sentence pair into a first sample set;
步骤104,利用所述第一样本集对神经网络模型进行预训练,得到预训练后的神经网络模型;Step 104: Pre-train the neural network model by using the first sample set to obtain a pre-trained neural network model;
步骤105,获取多个与特定领域相关的包含相近拼音的拼音-句子对作为第二样本集;Step 105: Obtain a plurality of pinyin-sentence pairs containing similar pinyin related to a specific field as a second sample set;
步骤106,利用所述第二样本集对所述预训练后的神经网络模型进行微调,得到微调后的神经网络模型;Step 106: Use the second sample set to fine-tune the pre-trained neural network model to obtain a fine-tuned neural network model;
步骤107,将待纠错句子的拼音序列输入所述微调后的神经网络模型进行纠错,得到纠错后的句子。Step 107: Input the pinyin sequence of the sentence to be corrected into the fine-tuned neural network model for error correction, and obtain the corrected sentence.
或者,该计算机可读指令被处理器执行时实现上述装置实施例中各模块的功能,例如图2中的模块201-207:Or, when the computer-readable instruction is executed by the processor, the function of each module in the above device embodiment is realized, for example, the modules 201-207 in FIG. 2
第一获取模块201,用于获取通用自然语言数据集,所述通用自然语言数据集包含多个句子;The first acquiring module 201 is configured to acquire a universal natural language data set, the universal natural language data set containing multiple sentences;
转换模块202,用于将所述通用自然语言数据集包含的每个句子转换为拼音序列,得到所述通用自然语言数据集的拼音-句子对;The conversion module 202 is configured to convert each sentence contained in the universal natural language data set into a pinyin sequence to obtain a pinyin-sentence pair of the universal natural language data set;
生成模块203,用于从所述通用自然语言数据集的拼音-句子对中选择多个拼音-句子对,将选择的每个拼音-句子对的部分拼音替换为相近拼音,得到替换后的拼音-句子对,将所述通用自然语言数据集的未选择的拼音-句子对和所述替换后的拼音-句子对组成第一样本集;The generating module 203 is configured to select multiple pinyin-sentence pairs from the pinyin-sentence pairs in the universal natural language data set, and replace part of the pinyin of each selected pinyin-sentence pair with similar pinyin to obtain the replaced pinyin -Sentence pairs, combining the unselected pinyin-sentence pairs of the general natural language data set and the replaced pinyin-sentence pairs into a first sample set;
预训练模块204,用于利用所述第一样本集对神经网络模型进行预训练,得到预训练后的神经网络模型;The pre-training module 204 is configured to pre-train the neural network model by using the first sample set to obtain a pre-trained neural network model;
第二获取模块205,用于获取多个与特定领域相关的包含相近拼音的拼音-句子对作为第二样本集;The second acquiring module 205 is configured to acquire a plurality of pinyin-sentence pairs containing similar pinyin related to a specific field as a second sample set;
微调模块206,用于利用所述第二样本集对所述预训练后的神经网络模型进行微调,得到微调后的神经网络模型;The fine-tuning module 206 is configured to use the second sample set to fine-tune the pre-trained neural network model to obtain a fine-tuned neural network model;
纠错模块207,用于将待纠错句子的拼音序列输入所述微调后的神经网络模型进行纠错,得到纠错后的句子。The error correction module 207 is configured to input the pinyin sequence of the sentence to be corrected into the fine-tuned neural network model for error correction, and obtain the corrected sentence.
示例性的,所述计算机可读指令303可以被分割成一个或多个模块,所述一个或者多个模块被存储在所述存储器301中,并由所述处理器302执行,以完成本方法。例如,所述计算机可读指令303可以被分割成图2中的第一获取模块201、转换模块202、生成203、预训练模块204、第二获取模块205、微调模块206、纠错模块207,各模块具体功能参见实施例二。Exemplarily, the computer-readable instruction 303 may be divided into one or more modules, and the one or more modules are stored in the memory 301 and executed by the processor 302 to complete the method . For example, the computer-readable instruction 303 can be divided into the first acquisition module 201, the conversion module 202, the generation 203, the pre-training module 204, the second acquisition module 205, the fine-tuning module 206, and the error correction module 207 in FIG. Refer to the second embodiment for the specific functions of each module.
所述计算机装置30可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。本领域技术人员可以理解,所述示意图3仅仅是计算机装置30的示例,并不构成对计算机装置30的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如所述计算机装置30还可以包括输入输出设备、网络接入设备、总线等。The computer device 30 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server. Those skilled in the art can understand that the schematic diagram 3 is only an example of the computer device 30 and does not constitute a limitation on the computer device 30. It may include more or less components than those shown in the figure, or combine certain components, or be different. For example, the computer device 30 may also include input and output devices, network access devices, buses, etc.
所称处理器302可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器302也可以是任何常规的处理器等,所述处理器302是所述计算机装置30的控制中心,利用各种接口和线路连接整个计算机装置30的各个部分。The so-called processor 302 may be a central processing unit (Central Processing Unit, CPU), other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. The general-purpose processor can be a microprocessor or the processor 302 can also be any conventional processor, etc. The processor 302 is the control center of the computer device 30 and connects the entire computer device 30 with various interfaces and lines. Various parts.
所述存储器301可用于存储所述计算机可读指令303,所述处理器302通过运行或执行存储在所述存储器301内的计算机可读指令或模块,以及调用存储在存储器301内的数据,实现所述计算机装置30的各种功能。所述存储器301可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据计算机装置30的使用所创建的数据。此外,存储器301可以包括非易失性存储器,例如硬盘、内存、插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)、至少一个磁盘存储器件、闪存器件、或其他非易失性固态存储器件。The memory 301 may be used to store the computer-readable instructions 303, and the processor 302 executes or executes the computer-readable instructions or modules stored in the memory 301, and calls data stored in the memory 301 to implement Various functions of the computer device 30. The memory 301 may mainly include a program storage area and a data storage area. The program storage area may store an operating system, an application program required by at least one function (such as a sound playback function, an image playback function, etc.), etc.; The data created according to the use of the computer device 30 is stored. In addition, the memory 301 may include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a Secure Digital (SD) card, a flash memory card (Flash Card), At least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device.
所述计算机装置30集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,可以存储在一个非易失性可读存储介质中。基于这样的理解,本申请实现上述实施例方法中的全部或部分流程,也可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性可读存储介质中,该计算机可读指令在被处理器执行时,可实现上述各个方法实施例的步骤。其中,所述计算机可读指令可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质可以包括:能够携带所述计算机可读指令代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、只读存储器(ROM,Read-Only Memory)。If the integrated module of the computer device 30 is implemented in the form of a software function module and sold or used as an independent product, it may be stored in a non-volatile readable storage medium. Based on this understanding, this application implements all or part of the processes in the above-mentioned embodiments and methods, and can also be completed by instructing relevant hardware through computer-readable instructions. The computer-readable instructions can be stored in a non-volatile memory. In the read storage medium, when the computer-readable instructions are executed by the processor, the steps of the foregoing method embodiments can be implemented. Wherein, the computer-readable instructions may be in the form of source code, object code, executable file, or some intermediate forms, etc. The computer-readable medium may include: any entity or device capable of carrying the computer-readable instruction code, recording medium, U disk, mobile hard disk, magnetic disk, optical disk, read-only memory (ROM, Read-Only Memory).
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。In the several embodiments provided in this application, it should be understood that the disclosed system, device, and method may be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the modules is only a logical function division, and there may be other division methods in actual implementation.
最后应说明的是,以上实施例仅用以说明本申请的技术方案而非限制,尽管参照较佳实施例对本申请进行了详细说明,本领域的普通技术人员应当理解,可以对本申请的技术方案进行修改或等同替换,而不脱离本申请技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the application and not to limit them. Although the application has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that the technical solutions of the application can be Modifications or equivalent replacements are made without departing from the spirit and scope of the technical solution of this application.

Claims (20)

  1. 一种错词纠正方法,其特征在于,所述方法包括:A method for correcting wrong words, characterized in that the method includes:
    获取通用自然语言数据集,所述通用自然语言数据集包含多个句子;Acquiring a universal natural language data set, the universal natural language data set containing a plurality of sentences;
    将所述通用自然语言数据集包含的每个句子转换为拼音序列,得到所述通用自然语言数据集的拼音-句子对;Converting each sentence contained in the universal natural language data set into a pinyin sequence to obtain a pinyin-sentence pair of the universal natural language data set;
    从所述通用自然语言数据集的拼音-句子对中选择多个拼音-句子对,将选择的每个拼音-句子对的部分拼音替换为相近拼音,得到替换后的拼音-句子对,将所述通用自然语言数据集的未选择的拼音-句子对和所述替换后的拼音-句子对组成第一样本集;Select multiple pinyin-sentence pairs from the pinyin-sentence pairs of the universal natural language data set, replace part of the pinyin of each selected pinyin-sentence pair with similar pinyin, obtain the replaced pinyin-sentence pair, The unselected pinyin-sentence pairs of the general natural language data set and the replaced pinyin-sentence pairs form the first sample set;
    利用所述第一样本集对神经网络模型进行预训练,得到预训练后的神经网络模型;Pre-training the neural network model by using the first sample set to obtain a pre-trained neural network model;
    获取多个与特定领域相关的包含相近拼音的拼音-句子对作为第二样本集;Acquire multiple pinyin-sentence pairs with similar pinyin related to a specific field as the second sample set;
    利用所述第二样本集对所述预训练后的神经网络模型进行微调,得到微调后的神经网络模型;Using the second sample set to fine-tune the pre-trained neural network model to obtain a fine-tuned neural network model;
    将待纠错句子的拼音序列输入所述微调后的神经网络模型进行纠错,得到纠错后的句子。The pinyin sequence of the sentence to be corrected is input into the fine-tuned neural network model for error correction, and the corrected sentence is obtained.
  2. 如权利要求1所述的方法,其特征在于,所述将所述通用自然语言数据集包含的每个句子转换为拼音序列包括:The method according to claim 1, wherein the converting each sentence contained in the universal natural language data set into a pinyin sequence comprises:
    根据汉字的ASCII码将所述句子转换为拼音序列;或Convert the sentence into a Pinyin sequence according to the ASCII code of the Chinese character; or
    根据汉字的Unicode值将所述句子转换为拼音序列。The sentence is converted into a Pinyin sequence according to the Unicode value of the Chinese character.
  3. 如权利要求2所述的方法,其特征在于,所述根据汉字的Unicode值将所述句子转换为拼音序列包括:3. The method of claim 2, wherein the converting the sentence into a pinyin sequence according to the Unicode value of the Chinese character comprises:
    建立拼音-编号对照表,对所有拼音进行编号并将所有拼音对应的编号添加到所述拼音-编号对照表中;Establish a pinyin-number comparison table, number all the pinyins and add the corresponding numbers of all the pinyins to the pinyin-number comparison table;
    建立Unicode值-拼音编号对照表,将汉字对应拼音的编号按照汉字的Unicode值添加到所述Unicode值-拼音编号对照表中;Establish a Unicode value-Pinyin number comparison table, and add the number of the Chinese character corresponding to the pinyin to the Unicode value-Pinyin number comparison table according to the Unicode value of the Chinese character;
    逐一读取所述句子中的待转换汉字,确定所述待转换汉字的Unicode值,根据所述待转换汉字的Unicode值从所述Unicode值-拼音编号对照表中获取所述待转换汉字对应的拼音的编号,根据所述待转换汉字对应的拼音的编号从所述拼音-编号对照表获得所述待转换汉字对应的拼音,从而将所述句子中的每个汉字转换为拼音。Read the Chinese characters to be converted in the sentence one by one, determine the Unicode value of the Chinese character to be converted, and obtain the corresponding Chinese character to be converted from the Unicode value-pinyin number comparison table according to the Unicode value of the Chinese character to be converted The pinyin number is obtained from the pinyin-number comparison table according to the pinyin number corresponding to the Chinese character to be converted, so as to convert each Chinese character in the sentence into pinyin.
  4. 如权利要求1所述的方法,其特征在于,所述从所述通用自然语言数据集的拼音-句子对中选择多个拼音-句子对包括:The method according to claim 1, wherein the selecting a plurality of pinyin-sentence pairs from the pinyin-sentence pairs of the universal natural language data set comprises:
    从所述通用自然语言数据集的拼音-句子对中随机选择所述多个拼音-句子对;和/或Randomly selecting the plurality of pinyin-sentence pairs from the pinyin-sentence pairs of the universal natural language data set; and/or
    按照预设比例从所述通用自然语言数据集的拼音-句子对中选择所述多个拼音-句子对。The multiple pinyin-sentence pairs are selected from the pinyin-sentence pairs in the universal natural language data set according to a preset ratio.
  5. 如权利要求1所述的方法,其特征在于,所述神经网络模型是transformer模型。The method of claim 1, wherein the neural network model is a transformer model.
  6. 如权利要求1所述的方法,其特征在于,所述对所述预训练后的神经网络模型进行微调包括:The method of claim 1, wherein the fine-tuning the neural network model after the pre-training comprises:
    固定所述神经网络模型的前面几层神经元的权值,微调所述神经网络模型的后面几层神经元的权值。Fix the weights of the neurons in the first few layers of the neural network model, and fine-tune the weights of the neurons in the next few layers of the neural network model.
  7. 如权利要求1-6中任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 1-6, wherein the method further comprises:
    对输入的语音进行识别,得到所述待纠错句子。Recognizing the input voice to obtain the sentence to be corrected.
  8. 一种错词纠正装置,其特征在于,所述装置包括:A wrong word correction device, characterized in that the device includes:
    第一获取模块,用于获取通用自然语言数据集,所述通用自然语言数据集包含多个句子;The first acquisition module is configured to acquire a universal natural language data set, the universal natural language data set containing multiple sentences;
    转换模块,用于将所述通用自然语言数据集包含的每个句子转换为拼音序列,得到所述通用自然语言数据集的拼音-句子对;A conversion module, configured to convert each sentence contained in the universal natural language data set into a pinyin sequence to obtain a pinyin-sentence pair of the universal natural language data set;
    生成模块,用于从所述通用自然语言数据集的拼音-句子对中选择多个拼音-句子对,将选择的每个拼音-句子对的部分拼音替换为相近拼音,得到替换后的拼音-句子对,将所述通用自然语言数据集的未选择的拼音-句子对和所述替换后的拼音-句子对组成第一样本集;The generating module is used to select multiple pinyin-sentence pairs from the pinyin-sentence pairs in the universal natural language data set, and replace part of the pinyin of each selected pinyin-sentence pair with similar pinyin to obtain the replaced pinyin- Sentence pairs, combining the unselected pinyin-sentence pairs of the general natural language data set and the replaced pinyin-sentence pairs into a first sample set;
    预训练模块,用于用所述第一样本集对神经网络模型进行预训练,得到预训练后的神经网络模型;The pre-training module is used to pre-train the neural network model with the first sample set to obtain the pre-trained neural network model;
    第二获取模块,用于获取多个与特定领域相关的包含相近拼音的拼音-句子对作为第二样本集;The second acquisition module is used to acquire multiple pinyin-sentence pairs containing similar pinyin related to a specific field as a second sample set;
    微调模块,用于利用所述第二样本集对所述预训练后的神经网络模型进行微调,得到微调后的神经网络模型;A fine-tuning module, configured to fine-tune the pre-trained neural network model by using the second sample set to obtain a fine-tuned neural network model;
    纠错模块,用于将待纠错句子的拼音序列输入所述微调后的神经网络模型进行纠错,得到纠错后的句子。The error correction module is used to input the pinyin sequence of the sentence to be corrected into the fine-tuned neural network model for error correction, and obtain the corrected sentence.
  9. 一种计算机装置,其特征在于,所述计算机装置包括处理器和存储器,所述处理器用于执行所述存储器中存储的计算机可读指令以实现以下步骤:A computer device, wherein the computer device includes a processor and a memory, and the processor is configured to execute computer-readable instructions stored in the memory to implement the following steps:
    获取通用自然语言数据集,所述通用自然语言数据集包含多个句子;Acquiring a universal natural language data set, the universal natural language data set containing a plurality of sentences;
    将所述通用自然语言数据集包含的每个句子转换为拼音序列,得到所述通用自然语言数据集的拼音-句子对;Converting each sentence contained in the universal natural language data set into a pinyin sequence to obtain a pinyin-sentence pair of the universal natural language data set;
    从所述通用自然语言数据集的拼音-句子对中选择多个拼音-句子对,将选择的每个拼音-句子对的部分拼音替换为相近拼音,得到替换后的拼音-句子对,将所述通用自然语言数据集的未选择的拼音-句子对和所述替换后的拼音-句子对组成第一样本集;Select multiple pinyin-sentence pairs from the pinyin-sentence pairs of the universal natural language data set, replace part of the pinyin of each selected pinyin-sentence pair with similar pinyin, obtain the replaced pinyin-sentence pair, The unselected pinyin-sentence pairs of the general natural language data set and the replaced pinyin-sentence pairs form the first sample set;
    利用所述第一样本集对神经网络模型进行预训练,得到预训练后的神经网络模型;Pre-training the neural network model by using the first sample set to obtain a pre-trained neural network model;
    获取多个与特定领域相关的包含相近拼音的拼音-句子对作为第二样本集;Acquire multiple pinyin-sentence pairs with similar pinyin related to a specific field as the second sample set;
    利用所述第二样本集对所述预训练后的神经网络模型进行微调,得到微调后的神经网络模型;Using the second sample set to fine-tune the pre-trained neural network model to obtain a fine-tuned neural network model;
    将待纠错句子的拼音序列输入所述微调后的神经网络模型进行纠错,得到纠错后的句子。The pinyin sequence of the sentence to be corrected is input into the fine-tuned neural network model for error correction, and the corrected sentence is obtained.
  10. 如权利要求9所述的计算机装置,其特征在于,所述处理器执行所述存储器中存储的计算机可读指令以实现所述将所述通用自然语言数据集包含的每个句子转换为拼音序列时,包括:The computer device of claim 9, wherein the processor executes computer-readable instructions stored in the memory to implement the conversion of each sentence contained in the universal natural language data set into a Pinyin sequence When including:
    根据汉字的ASCII码将所述句子转换为拼音序列;或Convert the sentence into a Pinyin sequence according to the ASCII code of the Chinese character; or
    根据汉字的Unicode值将所述句子转换为拼音序列。The sentence is converted into a Pinyin sequence according to the Unicode value of the Chinese character.
  11. 如权利要求10所述的计算机装置,其特征在于,所述处理器执行所述存储器中存储的计算机可读指令以实现所述根据汉字的Unicode值将所述句子转换为拼音序列时,包括:10. The computer device according to claim 10, wherein when the processor executes the computer-readable instructions stored in the memory to implement the conversion of the sentence into a Pinyin sequence according to the Unicode value of the Chinese character, it comprises:
    建立拼音-编号对照表,对所有拼音进行编号并将所有拼音对应的编号添加到所述拼音-编号对照表中;Establish a pinyin-number comparison table, number all the pinyins and add the corresponding numbers of all the pinyins to the pinyin-number comparison table;
    建立Unicode值-拼音编号对照表,将汉字对应拼音的编号按照汉字的Unicode值添加到所述Unicode值-拼音编号对照表中;Establish a Unicode value-Pinyin number comparison table, and add the number of the Chinese character corresponding to the pinyin to the Unicode value-Pinyin number comparison table according to the Unicode value of the Chinese character;
    逐一读取所述句子中的待转换汉字,确定所述待转换汉字的Unicode值,根据所述待转换汉字的Unicode值从所述Unicode值-拼音编号对照表中获取所述待转换汉字对应的拼音的编号,根据所述待转换汉字对应的拼音的编号从所述拼音-编号对照表获得所述待转换汉字对应的拼音,从而将所述句子中的每个汉字转换为拼音。Read the Chinese characters to be converted in the sentence one by one, determine the Unicode value of the Chinese character to be converted, and obtain the corresponding Chinese character to be converted from the Unicode value-pinyin number comparison table according to the Unicode value of the Chinese character to be converted The number of the pinyin is obtained from the pinyin-number comparison table according to the number of the pinyin corresponding to the Chinese character to be converted, so as to convert each Chinese character in the sentence into pinyin.
  12. 如权利要求9所述的计算机装置,其特征在于,所述处理器执行所述存储器中存储的计算机可读指令以实现所述从所述通用自然语言数据集的拼音-句子对中选择多个拼音-句子对时,包括:The computer device of claim 9, wherein the processor executes computer-readable instructions stored in the memory to implement the selection of a plurality of pinyin-sentence pairs from the general natural language data set Pinyin-sentence pairs include:
    从所述通用自然语言数据集的拼音-句子对中随机选择所述多个拼音-句子对;和/或Randomly selecting the plurality of pinyin-sentence pairs from the pinyin-sentence pairs of the universal natural language data set; and/or
    按照预设比例从所述通用自然语言数据集的拼音-句子对中选择所述多个拼音-句子对。The multiple pinyin-sentence pairs are selected from the pinyin-sentence pairs in the universal natural language data set according to a preset ratio.
  13. 如权利要求9所述的计算机装置,其特征在于,所述处理器执行所述存储器中存储的计算机可读指令以实现所述对所述预训练后的神经网络模型进行微调时,包括:9. The computer device according to claim 9, wherein when the processor executes the computer-readable instructions stored in the memory to implement the fine-tuning of the pre-trained neural network model, it comprises:
    固定所述神经网络模型的前面几层神经元的权值,微调所述神经网络模型的后面几层神经元的权值。Fix the weights of the neurons in the first few layers of the neural network model, and fine-tune the weights of the neurons in the next few layers of the neural network model.
  14. 如权利要求9-13中任一项所述的计算机装置,其特征在于,所述处理器执行所述存储器中存储的计算机可读指令还用以实现以下步骤:The computer device according to any one of claims 9-13, wherein the processor executing the computer-readable instructions stored in the memory is further used to implement the following steps:
    对输入的语音进行识别,得到所述待纠错句子。Recognizing the input voice to obtain the sentence to be corrected.
  15. 一种非易失性可读存储介质,所述非易失性可读存储介质上存储有计算机可读指令,其特征在于,所述计算机可读指令被处理器执行时实现以下步骤:A non-volatile readable storage medium having computer readable instructions stored on the non-volatile readable storage medium, characterized in that, when the computer readable instructions are executed by a processor, the following steps are implemented:
    获取通用自然语言数据集,所述通用自然语言数据集包含多个句子;Acquiring a universal natural language data set, the universal natural language data set containing a plurality of sentences;
    将所述通用自然语言数据集包含的每个句子转换为拼音序列,得到所述通用自然语言数据集的拼音-句子对;Converting each sentence contained in the universal natural language data set into a pinyin sequence to obtain a pinyin-sentence pair of the universal natural language data set;
    从所述通用自然语言数据集的拼音-句子对中选择多个拼音-句子对,将选择的每个拼音-句子对的部分拼音替换为相近拼音,得到替换后的拼音-句子对,将所述通用自然语言数据集的未选择的拼音-句子对和所述替换后的拼音-句子对组成第一样本集;Select multiple pinyin-sentence pairs from the pinyin-sentence pairs of the universal natural language data set, replace part of the pinyin of each selected pinyin-sentence pair with similar pinyin, obtain the replaced pinyin-sentence pair, The unselected pinyin-sentence pairs of the general natural language data set and the replaced pinyin-sentence pairs form the first sample set;
    利用所述第一样本集对神经网络模型进行预训练,得到预训练后的神经网络模型;Pre-training the neural network model by using the first sample set to obtain a pre-trained neural network model;
    获取多个与特定领域相关的包含相近拼音的拼音-句子对作为第二样本集;Acquire multiple pinyin-sentence pairs with similar pinyin related to a specific field as the second sample set;
    利用所述第二样本集对所述预训练后的神经网络模型进行微调,得到微调后的神经网络模型;Using the second sample set to fine-tune the pre-trained neural network model to obtain a fine-tuned neural network model;
    将待纠错句子的拼音序列输入所述微调后的神经网络模型进行纠错,得到纠错后的句子。The pinyin sequence of the sentence to be corrected is input into the fine-tuned neural network model for error correction, and the corrected sentence is obtained.
  16. 如权利要求15所述的存储介质,其特征在于,所述计算机可读指令被所述处理器执行以实现所述将所述通用自然语言数据集包含的每个句子转换为拼音序列时,包括:The storage medium of claim 15, wherein the computer-readable instructions are executed by the processor to implement the conversion of each sentence contained in the universal natural language data set into a pinyin sequence, comprising: :
    根据汉字的ASCII码将所述句子转换为拼音序列;或Convert the sentence into a Pinyin sequence according to the ASCII code of the Chinese character; or
    根据汉字的Unicode值将所述句子转换为拼音序列。The sentence is converted into a Pinyin sequence according to the Unicode value of the Chinese character.
  17. 如权利要求16所述的存储介质,其特征在于,所述计算机可读指令被所述处理器执行以实现所述根据汉字的Unicode值将所述句子转换为拼音序列时,包括:16. The storage medium of claim 16, wherein the computer-readable instructions are executed by the processor to implement the conversion of the sentence into a Pinyin sequence according to the Unicode value of the Chinese character, comprising:
    建立拼音-编号对照表,对所有拼音进行编号并将所有拼音对应的编号添加到所述拼音-编号对照表中;Establish a pinyin-number comparison table, number all the pinyins and add the corresponding numbers of all the pinyins to the pinyin-number comparison table;
    建立Unicode值-拼音编号对照表,将汉字对应拼音的编号按照汉字的Unicode值添加到所述Unicode值-拼音编号对照表中;Establish a Unicode value-Pinyin number comparison table, and add the number of the Chinese character corresponding to the pinyin to the Unicode value-Pinyin number comparison table according to the Unicode value of the Chinese character;
    逐一读取所述句子中的待转换汉字,确定所述待转换汉字的Unicode值,根据所述待转换汉字的Unicode值从所述Unicode值-拼音编号对照表中获取所述待转换汉字对应的拼音的编号,根据所述待转换汉字对应的拼音的编号从所述拼音-编号对照表获得所述待转换汉字对应的拼音,从而将所述句子中的每个汉字转换为拼音。Read the Chinese characters to be converted in the sentence one by one, determine the Unicode value of the Chinese character to be converted, and obtain the corresponding Chinese character to be converted from the Unicode value-pinyin number comparison table according to the Unicode value of the Chinese character to be converted The pinyin number is obtained from the pinyin-number comparison table according to the pinyin number corresponding to the Chinese character to be converted, so as to convert each Chinese character in the sentence into pinyin.
  18. 如权利要求15所述的存储介质,其特征在于,所述计算机可读指令被所述处理器执行以实现所述从所述通用自然语言数据集的拼音-句子对中选择多个拼音-句子对时,包括:The storage medium of claim 15, wherein the computer-readable instructions are executed by the processor to implement the selection of a plurality of pinyin-sentences from the pinyin-sentence pairs in the universal natural language data set Right, including:
    从所述通用自然语言数据集的拼音-句子对中随机选择所述多个拼音-句子对;和/或Randomly selecting the plurality of pinyin-sentence pairs from the pinyin-sentence pairs of the universal natural language data set; and/or
    按照预设比例从所述通用自然语言数据集的拼音-句子对中选择所述多个拼音-句子对。The multiple pinyin-sentence pairs are selected from the pinyin-sentence pairs in the universal natural language data set according to a preset ratio.
  19. 如权利要求15所述的存储介质,其特征在于,所述计算机可读指令被所述处理器执行以实现所述对所述预训练后的神经网络模型进行微调时,包括:15. The storage medium of claim 15, wherein when the computer-readable instructions are executed by the processor to implement the fine-tuning of the pre-trained neural network model, the method comprises:
    固定所述神经网络模型的前面几层神经元的权值,微调所述神经网络模型的后面几层神经元的权值。Fix the weights of the neurons in the first few layers of the neural network model, and fine-tune the weights of the neurons in the next few layers of the neural network model.
  20. 如权利要求15-18中任一项所述的存储介质,其特征在于,所述计算机可读指令被所述处理器执行还用以实现以下步骤:18. The storage medium according to any one of claims 15-18, wherein the computer-readable instructions are executed by the processor to further implement the following steps:
    对输入的语音进行识别,得到所述待纠错句子。Recognizing the input voice to obtain the sentence to be corrected.
PCT/CN2019/117237 2019-03-15 2019-11-11 Error word correction method and device, computer device, and storage medium WO2020186778A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910199221.9 2019-03-15
CN201910199221.9A CN110110041B (en) 2019-03-15 2019-03-15 Wrong word correcting method, wrong word correcting device, computer device and storage medium

Publications (1)

Publication Number Publication Date
WO2020186778A1 true WO2020186778A1 (en) 2020-09-24

Family

ID=67484339

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/117237 WO2020186778A1 (en) 2019-03-15 2019-11-11 Error word correction method and device, computer device, and storage medium

Country Status (2)

Country Link
CN (1) CN110110041B (en)
WO (1) WO2020186778A1 (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112509581A (en) * 2020-11-20 2021-03-16 北京有竹居网络技术有限公司 Method and device for correcting text after speech recognition, readable medium and electronic equipment
CN112528637A (en) * 2020-12-11 2021-03-19 平安科技(深圳)有限公司 Text processing model training method and device, computer equipment and storage medium
CN112580324A (en) * 2020-12-24 2021-03-30 北京百度网讯科技有限公司 Text error correction method and device, electronic equipment and storage medium
CN113012701A (en) * 2021-03-16 2021-06-22 联想(北京)有限公司 Identification method, identification device, electronic equipment and storage medium
CN113159168A (en) * 2021-04-19 2021-07-23 清华大学 Pre-training model accelerated reasoning method and system based on redundant word deletion
CN113192497A (en) * 2021-04-28 2021-07-30 平安科技(深圳)有限公司 Speech recognition method, apparatus, device and medium based on natural language processing
CN113284499A (en) * 2021-05-24 2021-08-20 湖北亿咖通科技有限公司 Voice instruction recognition method and electronic equipment
CN113380225A (en) * 2021-06-18 2021-09-10 广州虎牙科技有限公司 Language model training method, speech recognition method and related device
CN113449514A (en) * 2021-06-21 2021-09-28 浙江康旭科技有限公司 Text error correction method and device suitable for specific vertical field
EP4027337A1 (en) * 2021-04-12 2022-07-13 Apollo Intelligent Connectivity (Beijing) Technology Co., Ltd. Speech recognition method and apparatus, electronic device and storage medium

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110110041B (en) * 2019-03-15 2022-02-15 平安科技(深圳)有限公司 Wrong word correcting method, wrong word correcting device, computer device and storage medium
CN110705262B (en) * 2019-09-06 2023-08-29 宁波市科技园区明天医网科技有限公司 Improved intelligent error correction method applied to medical technology inspection report
CN110705217B (en) * 2019-09-09 2023-07-21 上海斑马来拉物流科技有限公司 Wrongly written or mispronounced word detection method and device, computer storage medium and electronic equipment
CN112786014A (en) * 2019-10-23 2021-05-11 北京京东振世信息技术有限公司 Method and device for identifying data
CN110956959B (en) * 2019-11-25 2023-07-25 科大讯飞股份有限公司 Speech recognition error correction method, related device and readable storage medium
CN112988955B (en) * 2019-12-02 2024-03-15 卢文祥 Multilingual voice recognition and topic semantic analysis method and device
CN110909535B (en) * 2019-12-06 2023-04-07 北京百分点科技集团股份有限公司 Named entity checking method and device, readable storage medium and electronic equipment
CN111414772B (en) * 2020-03-12 2023-09-26 北京小米松果电子有限公司 Machine translation method, device and medium
CN113807080A (en) * 2020-06-15 2021-12-17 科沃斯商用机器人有限公司 Text correction method, text correction device and storage medium
CN111783471A (en) * 2020-06-29 2020-10-16 中国平安财产保险股份有限公司 Semantic recognition method, device, equipment and storage medium of natural language
CN112686036B (en) * 2020-08-18 2022-04-01 平安国际智慧城市科技股份有限公司 Risk text recognition method and device, computer equipment and storage medium
CN112164403A (en) * 2020-09-27 2021-01-01 江苏四象软件有限公司 Natural language processing system based on artificial intelligence
CN111931490B (en) * 2020-09-27 2021-01-08 平安科技(深圳)有限公司 Text error correction method, device and storage medium
CN112116907A (en) * 2020-10-22 2020-12-22 浙江同花顺智能科技有限公司 Speech recognition model establishing method, speech recognition device, speech recognition equipment and medium
CN112329447B (en) * 2020-10-29 2024-03-26 语联网(武汉)信息技术有限公司 Training method of Chinese error correction model, chinese error correction method and device
CN112037755B (en) * 2020-11-03 2021-02-02 北京淇瑀信息科技有限公司 Voice synthesis method and device based on timbre clone and electronic equipment
CN113449090A (en) * 2021-06-23 2021-09-28 山东新一代信息产业技术研究院有限公司 Error correction method, device and medium for intelligent question answering
CN114861635B (en) * 2022-05-10 2023-04-07 广东外语外贸大学 Chinese spelling error correction method, device, equipment and storage medium
CN115437511B (en) * 2022-11-07 2023-02-21 北京澜舟科技有限公司 Pinyin Chinese character conversion method, conversion model training method and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105869634A (en) * 2016-03-31 2016-08-17 重庆大学 Field-based method and system for feeding back text error correction after speech recognition
CN107357775A (en) * 2017-06-05 2017-11-17 百度在线网络技术(北京)有限公司 The text error correction method and device of Recognition with Recurrent Neural Network based on artificial intelligence
CN108021554A (en) * 2017-11-14 2018-05-11 无锡小天鹅股份有限公司 Audio recognition method, device and washing machine
CN108874174A (en) * 2018-05-29 2018-11-23 腾讯科技(深圳)有限公司 A kind of text error correction method, device and relevant device
CN110110041A (en) * 2019-03-15 2019-08-09 平安科技(深圳)有限公司 Wrong word correcting method, device, computer installation and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9262397B2 (en) * 2010-10-08 2016-02-16 Microsoft Technology Licensing, Llc General purpose correction of grammatical and word usage errors
US9396723B2 (en) * 2013-02-01 2016-07-19 Tencent Technology (Shenzhen) Company Limited Method and device for acoustic language model training
CN103971677B (en) * 2013-02-01 2015-08-12 腾讯科技(深圳)有限公司 A kind of acoustics language model training method and device
CN103235789B (en) * 2013-03-29 2016-08-10 惠州市德赛西威汽车电子股份有限公司 A kind of Chinese character is converted to the method for spelling and initial
CN108091328B (en) * 2017-11-20 2021-04-16 北京百度网讯科技有限公司 Speech recognition error correction method and device based on artificial intelligence and readable medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105869634A (en) * 2016-03-31 2016-08-17 重庆大学 Field-based method and system for feeding back text error correction after speech recognition
CN107357775A (en) * 2017-06-05 2017-11-17 百度在线网络技术(北京)有限公司 The text error correction method and device of Recognition with Recurrent Neural Network based on artificial intelligence
CN108021554A (en) * 2017-11-14 2018-05-11 无锡小天鹅股份有限公司 Audio recognition method, device and washing machine
CN108874174A (en) * 2018-05-29 2018-11-23 腾讯科技(深圳)有限公司 A kind of text error correction method, device and relevant device
CN110110041A (en) * 2019-03-15 2019-08-09 平安科技(深圳)有限公司 Wrong word correcting method, device, computer installation and storage medium

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112509581A (en) * 2020-11-20 2021-03-16 北京有竹居网络技术有限公司 Method and device for correcting text after speech recognition, readable medium and electronic equipment
CN112509581B (en) * 2020-11-20 2024-03-01 北京有竹居网络技术有限公司 Error correction method and device for text after voice recognition, readable medium and electronic equipment
CN112528637A (en) * 2020-12-11 2021-03-19 平安科技(深圳)有限公司 Text processing model training method and device, computer equipment and storage medium
CN112528637B (en) * 2020-12-11 2024-03-29 平安科技(深圳)有限公司 Text processing model training method, device, computer equipment and storage medium
CN112580324A (en) * 2020-12-24 2021-03-30 北京百度网讯科技有限公司 Text error correction method and device, electronic equipment and storage medium
CN112580324B (en) * 2020-12-24 2023-07-25 北京百度网讯科技有限公司 Text error correction method, device, electronic equipment and storage medium
CN113012701A (en) * 2021-03-16 2021-06-22 联想(北京)有限公司 Identification method, identification device, electronic equipment and storage medium
CN113012701B (en) * 2021-03-16 2024-03-22 联想(北京)有限公司 Identification method, identification device, electronic equipment and storage medium
EP4027337A1 (en) * 2021-04-12 2022-07-13 Apollo Intelligent Connectivity (Beijing) Technology Co., Ltd. Speech recognition method and apparatus, electronic device and storage medium
CN113159168B (en) * 2021-04-19 2022-09-02 清华大学 Pre-training model accelerated reasoning method and system based on redundant word deletion
CN113159168A (en) * 2021-04-19 2021-07-23 清华大学 Pre-training model accelerated reasoning method and system based on redundant word deletion
CN113192497A (en) * 2021-04-28 2021-07-30 平安科技(深圳)有限公司 Speech recognition method, apparatus, device and medium based on natural language processing
CN113192497B (en) * 2021-04-28 2024-03-01 平安科技(深圳)有限公司 Speech recognition method, device, equipment and medium based on natural language processing
CN113284499A (en) * 2021-05-24 2021-08-20 湖北亿咖通科技有限公司 Voice instruction recognition method and electronic equipment
CN113380225A (en) * 2021-06-18 2021-09-10 广州虎牙科技有限公司 Language model training method, speech recognition method and related device
CN113449514A (en) * 2021-06-21 2021-09-28 浙江康旭科技有限公司 Text error correction method and device suitable for specific vertical field
CN113449514B (en) * 2021-06-21 2023-10-31 浙江康旭科技有限公司 Text error correction method and device suitable for vertical field

Also Published As

Publication number Publication date
CN110110041B (en) 2022-02-15
CN110110041A (en) 2019-08-09

Similar Documents

Publication Publication Date Title
WO2020186778A1 (en) Error word correction method and device, computer device, and storage medium
CN107220235B (en) Speech recognition error correction method and device based on artificial intelligence and storage medium
KR102577514B1 (en) Method, apparatus for text generation, device and storage medium
CN108847241B (en) Method for recognizing conference voice as text, electronic device and storage medium
CN110956018B (en) Training method of text processing model, text processing method, text processing device and storage medium
CN110795552B (en) Training sample generation method and device, electronic equipment and storage medium
JP2021089705A (en) Method and device for evaluating translation quality
WO2022121251A1 (en) Method and apparatus for training text processing model, computer device and storage medium
JP7266683B2 (en) Information verification method, apparatus, device, computer storage medium, and computer program based on voice interaction
CN110211562B (en) Voice synthesis method, electronic equipment and readable storage medium
WO2021218028A1 (en) Artificial intelligence-based interview content refining method, apparatus and device, and medium
WO2021174922A1 (en) Statement sentiment classification method and related device
CN105404621A (en) Method and system for blind people to read Chinese character
WO2023201975A1 (en) Difference description sentence generation method and apparatus, and device and medium
CN111930914A (en) Question generation method and device, electronic equipment and computer-readable storage medium
CN112463942A (en) Text processing method and device, electronic equipment and computer readable storage medium
CN113450758B (en) Speech synthesis method, apparatus, device and medium
CN110516125B (en) Method, device and equipment for identifying abnormal character string and readable storage medium
CN112052329A (en) Text abstract generation method and device, computer equipment and readable storage medium
CN116909435A (en) Data processing method and device, electronic equipment and storage medium
CN115132182B (en) Data identification method, device, equipment and readable storage medium
Rajendran et al. A robust syllable centric pronunciation model for Tamil text to speech synthesizer
WO2007105615A1 (en) Request content identification system, request content identification method using natural language, and program
CN111090720B (en) Hot word adding method and device
US11651256B1 (en) Method for training a natural language processing model

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19919734

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19919734

Country of ref document: EP

Kind code of ref document: A1