CN115545013A - Sound-like error correction method and device for conversation scene - Google Patents

Sound-like error correction method and device for conversation scene Download PDF

Info

Publication number
CN115545013A
CN115545013A CN202211196704.1A CN202211196704A CN115545013A CN 115545013 A CN115545013 A CN 115545013A CN 202211196704 A CN202211196704 A CN 202211196704A CN 115545013 A CN115545013 A CN 115545013A
Authority
CN
China
Prior art keywords
error correction
sound
language model
characters
rule
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211196704.1A
Other languages
Chinese (zh)
Inventor
张洪健
刘大全
谭瑞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing Changan Automobile Co Ltd
Original Assignee
Chongqing Changan Automobile Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing Changan Automobile Co Ltd filed Critical Chongqing Changan Automobile Co Ltd
Priority to CN202211196704.1A priority Critical patent/CN115545013A/en
Publication of CN115545013A publication Critical patent/CN115545013A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering
    • G10L2015/0633Creating reference templates; Clustering using lexical or orthographic knowledge sources
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0638Interactive procedures

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Machine Translation (AREA)

Abstract

The application discloses a sound-like error correction method and device for a conversation scene, wherein the method comprises the following steps: constructing rule data according to a plurality of entities in the vehicle-mounted machine conversation field based on a preset semantic protocol; carrying out generalization enhancement on the rule data based on a pre-trained generalization model to generate a generalization corpus; collecting confusion words and near-sound characters in the vehicle-mounted device conversation field, randomly replacing generalized linguistic data according to a preset proportion, constructing error correction data, and generating error correction linguistic data; modifying the mask rule of the language model, replacing random characters with sound-like characters to carry out mask, and training the language model; based on a pre-constructed pinyin rule table and an error correction corpus, a near-speech noise training language model is added to obtain a language model similar to a partial speech so as to perform sound-like error correction in a dialogue scene. The method and the device can train the language model with the distributed partial sound-like texts in a data enhancement mode, correct the sound-like errors in a dialogue scene, improve the accuracy of error correction and detection, and improve the interaction and the mutual inductance of the vehicles.

Description

Sound-like error correction method and device for conversation scene
Technical Field
The present application relates to the field of language processing technologies, and in particular, to a method and an apparatus for audio error correction in a dialog scenario.
Background
In the related technology, the text input by the user is firstly poured into a Bert error correction model to carry out text error correction, the text which is corrected once is then led into a pinyin error correction model to carry out secondary error correction, and then the text which is subjected to secondary error correction is poured into a hot word replacement rule model to carry out hot word replacement, so that spoken texts such as dialects are converted into proper nouns, and the error correction effect is enhanced again.
However, in the related art, only semantic learning is performed, so that a large number of homophones are required to be maintained in the pinyin model, the cost for constructing hot words is increased, the pinyin features are single, the error correction capability is reduced, the use requirements of users cannot be met, and a solution is urgently needed.
Disclosure of Invention
The application provides a sound-like error correction method and device for a conversation scene, and aims to solve the technical problems that in the related technology, only semantic learning is performed, a large number of homophonic hot words need to be maintained in a pinyin model, the cost for constructing the hot words is increased, the pinyin features are single, the error correction capability is reduced, and the use requirements of users cannot be met.
The embodiment of the first aspect of the present application provides a sound-like error correction method for a dialog scene, including the following steps: based on a preset semantic protocol, constructing rule data according to a plurality of entities in the vehicle machine conversation field; carrying out generalization enhancement on the rule data based on a pre-trained generalization model to generate a generalization corpus; collecting confusion words and near-sound characters in the vehicle-mounted device conversation field, randomly replacing the generalized corpora according to a preset proportion, constructing error correction data, and generating error correction corpora; modifying a mask rule of a language model, replacing random characters with sound-like characters to carry out mask, and training the language model; based on a pre-constructed pinyin rule table and the error correction linguistic data, near-sound noise is added to train the language model, and a language model similar to partial sound is obtained so as to perform sound-like error correction in a dialogue scene.
According to the technical means, the language model with more biased sound-like text distribution can be trained in a data enhancement mode, sound-like error correction can be performed in a dialogue scene, the accuracy of error correction and detection is effectively improved, the interactive sense of a vehicle is improved, and the using requirements of users are effectively met.
Optionally, in an embodiment of the present application, the constructing rule data according to a plurality of entities in the car-machine conversation field includes: and randomly filling the entities into word slots of the preset semantic protocol according to different intentions in the preset semantic protocol.
According to the technical means, the method and the device for correcting the scene errors can improve the accuracy of correcting the scene errors and improve the use experience of the user.
Optionally, in an embodiment of the present application, the modifying a mask rule in an MLM (Masked Language Model) module, performing a mask by replacing a random character with a phonetic character, and training the MLM module includes: and modifying the replacement rules of other random characters, and replacing the original words with the near words and the confusing words to construct the mask rules.
According to the technical means, the sound-like error correction capability of the bert is increased by adjusting the replacement proportion and replacing random replacement with sound-like character replacement in the embodiment of the application.
Optionally, in an embodiment of the application, the training of the language model with near-speech noise based on the pre-constructed pinyin rule table and the error correction corpus to obtain a language model similar to a partial speech includes: and constructing multi-dimensional characteristics of the error correction characters of the error correction corpus, and constructing a sound-like classification machine learning model to carry out error correction classification.
According to the technical means, the machine learning classification model can be constructed, and the accuracy of error correction and detection is improved.
The second aspect of the present application provides a sound-like error correction apparatus for a dialog scene, including: the construction module is used for constructing rule data according to a plurality of entities in the vehicle machine conversation field based on a preset semantic protocol; the generating module is used for carrying out generalization enhancement on the rule data based on a pre-trained generalization model to generate a generalization corpus; the processing module is used for collecting confusion words and near-sound characters in the vehicle-mounted device conversation field, randomly replacing the generalized linguistic data according to a preset proportion, constructing error correction data and generating error correction linguistic data; the modification module is used for modifying the mask rule of the language model, replacing random characters with sound-like characters to carry out mask and training the language model; and the error correction module is used for increasing near-sound noise to train the language model based on the pre-constructed pinyin rule table and the error correction linguistic data to obtain a language model similar to partial sound so as to carry out sound-like error correction in a conversation scene.
Optionally, in an embodiment of the present application, the constructing module is further configured to randomly fill the plurality of entities into word slots of the preset semantic protocol according to different intents in the preset semantic protocol.
Optionally, in an embodiment of the present application, the modifying module is further configured to modify a replacement rule of other random characters, and replace an original word with a near word and a confusing word to construct the mask rule.
Optionally, in an embodiment of the present application, the error correction module is further configured to construct a multidimensional feature of the error correction characters of the error correction corpus, and construct a sound-like classification machine learning model to perform error correction classification.
An embodiment of a third aspect of the present application provides a vehicle, comprising: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor executing the program to implement the plausibility correction method for dialog scenes as described in the above embodiments.
A fourth aspect of the present application provides a computer-readable storage medium storing a computer program, which when executed by a processor, implements the above sound-likeness correction method for a dialog scene.
The beneficial effect of this application:
(1) According to the method and the device, the accuracy of error correction combined with a specific scene can be improved, and the use experience of a user is improved.
(2) The embodiment of the application can increase the sound-like error correction capability for bert by adjusting the replacement proportion and replacing random replacement with sound-like character replacement.
(3) The embodiment of the application can train the language model with more partial sound-like text distribution in a data enhancement mode, can correct the sound-like in a dialogue scene, effectively improves the accuracy of error correction and detection, improves the interactive sense of a vehicle, and effectively meets the use requirements of users.
Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.
Drawings
The above and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
fig. 1 is a flowchart of a method for performing audio error correction on a dialog scene according to an embodiment of the present application;
FIG. 2 is a schematic illustration of a semantic protocol description of an embodiment of the present application;
FIG. 3 is a Mask alternate schematic view of an embodiment of the present application;
FIG. 4 is a diagram of a Pinyin spelling Table according to an embodiment of the present application;
FIG. 5 is a flow chart of a data construction phase of an embodiment of the present application;
FIG. 6 is a flowchart of the Bert language model construction process according to an embodiment of the present application;
FIG. 7 is a flowchart illustrating a Pinyin classification model according to an embodiment of the present application;
FIG. 8 is a flowchart illustrating an overall method for audio error correction for dialog scenes in accordance with an exemplary embodiment of the present application;
FIG. 9 is a schematic structural diagram of a speech error correction apparatus for dialog scenes according to an embodiment of the present application;
fig. 10 is a schematic structural diagram of a vehicle according to an embodiment of the present application.
Wherein, 10-sound-like error correction means for dialogue scenes; 100-a construction module, 200-a generation module, 300-a processing module, 400-a modification module and 500-an error correction module; 1001-memory, 1002-processor and 1003-communication interface.
Detailed Description
Reference will now be made in detail to the embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the drawings are exemplary and intended to be used for explaining the present application and should not be construed as limiting the present application.
The following describes a sound-like error correction method and apparatus for dialog scenes according to an embodiment of the present application with reference to the drawings. Aiming at the problems that only semantic learning is carried out in the related technologies mentioned in the background technology center, so that a large number of homophonic hot words need to be maintained in a pinyin model, the cost for constructing the hot words is increased, the pinyin features are single, the error correction capability is reduced, and the use requirements of users cannot be met, the application provides the phonetic similarity error correction method for the dialogue scene. Therefore, the technical problems that in the related technology, only semantic learning is performed, a large number of homophonic hot words need to be maintained in a pinyin model, the cost of hot word construction is increased, the pinyin features are single, the error correction capability is reduced, and the use requirements of users cannot be met are solved.
Specifically, fig. 1 is a schematic flowchart of a sound-like error correction method for a dialog scene according to an embodiment of the present application.
As shown in fig. 1, the audio-like error correction method for dialog scenes includes the following steps:
in step S101, rule data is constructed according to a plurality of entities in the in-vehicle machine conversation field based on a preset semantic protocol.
It can be understood that, in the embodiment of the present application, rule data may be constructed according to a plurality of entities in the car machine conversation field in the following steps based on a certain semantic protocol, so as to ensure the comprehensiveness and accuracy of the collection of the important entities in the field, and improve the performability of the sound-like error correction of the conversation scene.
For example, in the field of vehicle-mounted machine conversation, the main fields can include navigation, music, vehicle control, entertainment and the like, wherein the navigation field can include entities such as place names and mechanism names, the music field can include entities such as song names, album names and singer names, the vehicle control field can include entities such as automobile parts and common accessories, the entertainment field is wide in range, and entities related to entertainment, such as movie names and the like, can be collected, so that a conversation set can be constructed based on rules through semantic protocols of all the fields.
It should be noted that the preset semantic protocol is set by a person skilled in the art according to actual situations, and is not limited in particular here.
Wherein, in an embodiment of the present application, constructing rule data according to a plurality of entities in the field of car-machine conversation includes: and randomly filling a plurality of entities into word slots of the preset semantic protocol according to different intentions in the preset semantic protocol.
In an actual execution process, as shown in fig. 2, a semantic protocol in the embodiment of the present application may include defined intentions and word slots, and a plurality of entities may be randomly filled into the word slots of a certain semantic protocol according to different intentions in the certain semantic protocol to construct rule data, so that accuracy of error correction in combination with a specific scene may be improved, and user experience may be improved.
In step S102, the rule data is generalized and enhanced based on a previously trained generalization model, and a generalized corpus is generated.
It can be understood that, in the embodiment of the present application, the rule data may be generalized and enhanced based on a pre-trained generalization model to generate a generalized corpus to make it more humane, for example, a UNILM (UNIfied pre-trained Language model) model idea may be adopted to construct a similar text generation model, and an effect of the similar text model may depend on the quality of the similar corpus in the vertical field, so that a data closed loop may be constructed in combination with an offline database, and similar texts may be searched in a manner of similar matching recall or clustering, so as to construct a generalized model training set, thereby training a Language model with more phonetic text distribution in a data enhancement manner, and improving the user experience.
In step S103, confusion words and near-phonetic characters in the car-mounted dialog field are collected, and the generalized corpora are randomly replaced according to a preset ratio, so as to construct error correction data and generate error correction corpora.
It can be understood that, in the embodiment of the present application, confusing words and near-sound characters in the car-machine conversation field may be collected, for example, specific confusing words and near-sound characters in the vertical field are collected, generalized corpora are randomly replaced according to a certain proportion, error correction data is constructed, and error correction corpora are generated, so that accuracy of error correction and detection is effectively improved, and interaction of a vehicle is improved while intelligence level of the vehicle is improved.
It should be noted that the preset ratio is set by a person skilled in the art according to actual situations, and is not specifically limited herein.
As a possible implementation manner, for confusing word replacement, the embodiment of the present application may specifically collect a user speech recognition error set, for example, mailbox- > oil tank, start- > start, take down- > bluetooth, and replace according to an error distribution ratio of real corpus, for example, a ratio that 1-2 wrong words occur in every 50 words, the confusing word may be added into a segmentation dictionary during replacement, so as to ensure that the confusing word can be completely taken out and replaced in a segmentation step.
In addition, for the near-sound character replacement, the embodiment of the application can collect homophones and near-sound characters of all characters, such as 'upper' characters, which may have homophones 'good and late', and which may have homophones 'injury, business and shang', and can search for near-sound characters by calculating the editing distance, such as 'upper' characters which have 'tan' and correspond to characters 'lying down, hall, trip, tang, sugar, perm, birchless, chamber, soup, enamel and pond', and can randomly replace the characters according to the error distribution.
In step S104, the mask rule of the language model is modified, and the language model is trained by using the sound-like characters instead of the random characters to perform mask.
It can be understood that, in the embodiments of the present application, a mask rule of a language model may be modified, for example, a mask rule in the MLM module of bert may be modified, a voice-like character is used to replace a random character to perform mask, and a language model, such as the MLM module of bert, is trained, so that the voice-like anti-interference capability and recognition capability of the language model may be effectively improved.
Wherein the initial mask rule of Bert is 15% of the words of the original mask sentence and is based on 8:1: the original word is replaced by mask, other random characters and original words according to the proportion of 1.
Optionally, in an embodiment of the present application, the method for training the MLM module includes modifying a mask rule in the MLM module, and performing mask by using a phonetic character instead of a random character, and includes: and modifying the replacement rules of other random characters, and replacing the original words with the near-sound words and the confusing words to construct the mask rules.
For example, as shown in fig. 3, in an original Bert, 10% of parts of 15% of mask words may be selected to replace random words, that is, a text error correction task, so that the Bert model has a certain text error correction capability, and because the parts are replaced randomly, intrinsic characteristics of sound-like confusing words are ignored, and the sound-like error correction capability may be added to the Bert by adjusting a replacement ratio and replacing the random replacement with sound-like words.
Then, bert mainly derives from the Self authorization mechanism in enhancing the anti-interference capability of similar words.
Three vectors QKV are generated for each word, which are all converted by word embedding, wherein the attention formula is as follows:
Q=XW Q
K=XW K
V=XW V
Figure BDA0003870970880000061
in the mask process of calculating the similar pronunciation words, the output attribute vector can contain the whole context of the similar pronunciation words, and because the similar pronunciation words and the correct pronunciation words are easy to generate the same context, the difference between the similar pronunciation words and the correct pronunciation words can be emphasized through log softmax Cross entry, so that the robustness to the similar pronunciation words is increased.
For the output z i The log softmax formula is calculated as follows:
Figure BDA0003870970880000062
the losses were calculated as follows:
Figure BDA0003870970880000063
wherein, T is the length of the dictionary vocab, y i Is the true value of the character after one hot vector.
Wherein, the corresponding position of the real label is 1, and the rest positions are 0.
In step S105, based on the pre-constructed pinyin rule table and error correction corpus, a near-speech noise training language model is added to obtain a language model similar to a partial speech, so as to perform speech-like error correction in a dialogue scene.
It can be understood that, in the embodiment of the application, based on the pinyin rule table and the error correction corpus pre-constructed in the following steps, a near-phonetic noise training language model is added to obtain a language model similar to the partial sound, so as to perform sound-like error correction in a dialogue scene, and therefore, the language model distributed more in the partial sound-like text can be trained in a data enhancement mode, and a machine learning classification model is constructed based on multi-mode characteristics, so that the accuracy of error correction and error detection is effectively improved, and the use requirements of users are effectively met.
For example, the pinyin rule table may be composed of two parts, i.e., a collected pinyin of all the characters in the dictionary and a collected pinyin spelling table, wherein the collected pinyin of all the characters in the dictionary in the embodiment of the present application may include polyphone pinyin, mandarin pinyin, dialect pinyin (yue language, chongqing language), and the like, and the collected pinyin spelling table, as shown in fig. 4, may further split the pinyin and construct the spelling characteristics of the initial consonants and the final.
Optionally, in an embodiment of the present application, based on a pre-constructed pinyin rule table and error correction corpora, a near-speech noise training language model is added to obtain a language model similar to a partial speech, including: and constructing multi-dimensional characteristics of error correction characters of the error correction corpus, and constructing a sound-like classification machine learning model to carry out error correction classification.
In some embodiments, the method can obtain a language model similar to partial tones by adding near-tone noise training bert, which can show that the confidence level of homophonic characters in output character classification is ahead, and in order to prevent the occurrence of detection errors, a classifier can be added to judge whether error detection and error correction are needed, wherein the language model can be used for predicting the topn confidence level of each character, and characteristics such as pinyin editing distance change, n-gram score change, word frequency change, pinyin word frequency change, predicted character embedding, original character embedding, initial consonant and vowel are combined, so that a machine learning classification model is constructed, and the accuracy of error correction and error detection is improved.
In addition, the embodiment of the application can distinguish 3 types of situations: a. error checking, b.
For example, the embodiment of the application can input the Bert model test text into a trained Bert network, and generally, if a correct character appears at a position of top0-2, when top0-2 appears a character identical to the input character, it is determined that the correct character is not subjected to error detection (there may be an error, but the error rate is low, the partial error is not processed), when top0-2 does not appear a character identical to the input character, it is determined that the position has an error, compared with real error correction data, when the position itself does not have an error, it is determined that the error is detected, and when the position has an error, it is determined that the position is valid, the error is detected, and 3 data tags with correct error detection are constructed.
Based on the error detection correct label, whether the correct character is at the position of top0-2 is checked, when the correct character is at the position of top0-2, the character is marked as an error correction correct character (a topn value is recorded), the other 2 labels are error correction labels, and when no correct character exists at top0-2, the error correction labels are all marked as error correction errors.
Thus, a class 3 data tag is constructed: a. error checking, b, correct checking, error correcting and c, correct checking, correct correcting, classification scoring with priority of accuracy rate, and slow adjustment of threshold value, so that the recall rate is improved while the accuracy rate is reduced very little, and the judgment of error correction and error detection can be realized with high accuracy rate.
The following describes the specific working principle of the present application in a specific embodiment.
As shown in fig. 5, fig. 5 is a flow chart of a data construction phase, and the specific steps are as follows:
step S501: whether domain data is sufficient.
That is, the embodiment of the present application may determine whether the domain data is sufficient, and perform step S504 when the domain data is sufficient, otherwise, perform step S502.
Step S502: and collecting important entities and common entities in the field and hot words, and constructing rule data.
That is, the present embodiment may collect domain-important and common entities, hotwords, and construct rule data.
Step S503: based on the existing offline data, the generalized model is trained.
In other words, the embodiment of the application can train a generalization model by collecting utterances with the same intention and constructing a similar text data set based on the existing data.
Step S504: and constructing similar texts of true users through a language generalization model based on the rule data.
That is to say, the embodiment of the application can construct a similar text of a real user through a language generalization model based on rule data, so as to improve the use experience of the user.
Step S505: whether the error correction corpus is sufficient.
That is, the embodiment of the present application may determine whether the error correction corpus is sufficient, and execute step S507 if the error correction corpus is sufficient, otherwise execute step S506.
Step S506: the confusion word and the similar pronunciation character are substituted for the generalization linguistic data to expand the data.
That is to say, the embodiment of the application can collect the confusion words and the near-sound characters in the field, count the confusion word frequency, the word frequency and the wrong word frequency, randomly replace the generalized corpus correct characters according to the proportion, expand the error correction data, and effectively solve the problems of difficult error correction labeling, high labor cost and the like.
Step S507: and modifying the mask rule in the MLM module in the bert.
That is to say, in the embodiment of the present application, 10% of random characters may be replaced with sound-like characters and confusing words to perform mask, when there is enough real data, the real data is directly used to perform mask, and when there is less real data, the generalized corpus (including wrong words) in step S506 is used to perform mask, and the mask position is the wrong word position.
In addition, when the data is rich enough and covers most scenes in the vertical domain, step S502, step S503, and step S504 may be skipped, and when the quality of the error correction corpus is high enough, step S506 may be skipped.
As shown in fig. 6, fig. 6 is a flowchart of Bert language model construction, and the specific steps are as follows:
step S601: the mask training text is entered and the learning rate is fine tuned by wanup.
That is, in the embodiment of the present application, a mask training text may be input, and the learning rate may be finely adjusted by wanup
Step S602: losses are calculated based on log _ softmax Cross entry, output as confidence for each word of the dictionary.
That is, the embodiment of the present application may output as a dictionary dimension, calculate confidence of occurrence of each word of the dictionary at a mask position by softmax, calculate loss based on log _ softmax Cross entry, and output as confidence of each word of the dictionary.
Step S603: and testing the error detection accuracy and recall rate of the top pn, and stopping training after the effect is achieved.
That is to say, the embodiment of the present application may test the bert model based on the constructed error correction data or the real error correction data, detect an error when the input character is not in output topn, test the accuracy, recall rate, and f1 value of error detection of topn, and stop training after reaching the effect.
As shown in fig. 7, fig. 7 is a flowchart of pinyin classification model construction, which includes the following steps:
step S701: and constructing a pinyin dictionary and an initial and final dictionary.
Step S702: marking is carried out on the pinyin classification model data set, namely a pinyin classification model training set, a verification set and a test set can be constructed according to the embodiment of the application, and 3 types of data labels with error detection, correction and error correction and correct error detection and correction can be constructed.
Step S703: and (4) feature engineering, namely after marking construction is completed, starting feature engineering, collecting characteristics of pinyin editing distance, initial and final characteristics, original character embedding, output embedding, n-gram score (n can be 1,2 and 3), word frequency, pinyin word frequency, whether the word is in an entity, a word in a sliding window range with the word as a center, whether the word is in a participle word list, participle length and the like, constructing features of the original character and the error correction character, and enabling label to correspond to the three types of labels in the step S702.
Step S704: the method comprises the steps of constructing a machine learning classification model, adjusting a classification threshold, finding out the optimal parameters through cross validation, namely the embodiment of the application can improve the accuracy rate through the machine learning classification model by using an integration model or stacking and the like, finding out the most appropriate parameters through cross validation, setting the classification threshold, adjusting the type of error detection, correction and correctness, and keeping the accuracy rate and the recall rate of the classification threshold on a high score.
As shown in fig. 8, fig. 8 is a flowchart of the overall testing stage, which includes the following specific steps:
step S801: a data construction phase.
Step S802: and constructing a Bert language model.
Step S803: and (5) constructing a pinyin classification model.
Step S804: and (4) a general test stage.
To sum up, the embodiment of the present application may input the text with error correction into the bert model, calculate the candidate word of top2 for each place, consider that the position does not need error correction when the original word appears in the candidate words, input the three candidate words into the constructed classification model in combination with the feature engineering when the original word is not in the candidates, and choose to record preferentially when the category with correct error detection and correct error correction appears, otherwise not correct error correction.
According to the sound-like error correction method for the dialogue scene, provided by the embodiment of the application, rule data can be constructed according to a plurality of entities in the car machine dialogue field based on a semantic protocol, generalized enhancement is carried out on the rule data based on a pre-trained generalized model, generalized corpora is generated, confusion words and near-sound characters in the car machine dialogue field are collected, the generalized corpora is randomly replaced according to a certain proportion, error correction data is constructed, error correction corpora is generated, a mask rule of the language model is modified, mask is carried out by replacing random characters with sound-like characters, the language model is trained, the near-sound noise training language model is added based on the pre-constructed pinyin rule table and error correction corpora, the sound-like language model is obtained, sound-like error correction is carried out in the dialogue scene, the accuracy of error correction and detection is effectively improved, the interaction feeling of a vehicle is improved, and the use requirements of users are effectively met. Therefore, the technical problems that in the related technology, only semantic learning is performed, a large number of homophonic hot words need to be maintained in a pinyin model, the cost of constructing the hot words is increased, the pinyin features are single, the error correction capability is reduced, and the use requirements of users cannot be met are solved.
Next, a proposed speech-like error correction apparatus for a dialogue scene according to an embodiment of the present application is described with reference to the drawings.
Fig. 9 is a block diagram of a speech-like error correction apparatus for dialog scenes according to an embodiment of the present application.
As shown in fig. 9, the audio-like error correction apparatus 10 for dialogue scenes includes: a construction module 100, a generation module 200, a processing module 300, a modification module 400 and an error correction module 500.
Specifically, the constructing module 100 is configured to construct rule data according to a plurality of entities in the car machine conversation field based on a preset semantic protocol.
And the generating module 200 is configured to perform generalization enhancement on the rule data based on a pre-trained generalization model to generate a generalized corpus.
And the processing module 300 is configured to collect confusing words and near-sound characters in the vehicle-mounted device conversation field, randomly replace generalized corpora according to a preset proportion, construct error correction data, and generate error correction corpora.
And the modifying module 400 is used for modifying the mask rule of the language model, replacing the random character with the sound-like character to carry out mask and train the language model.
And the error correction module 500 is used for adding a near-speech noise training language model based on a pre-constructed pinyin rule table and error correction linguistic data to obtain a language model similar to a partial speech so as to perform sound-like error correction in a dialogue scene.
Optionally, in an embodiment of the present application, the constructing module 100 is further configured to randomly fill the word slots of the preset semantic protocol with a plurality of entities according to different intents in the preset semantic protocol.
Optionally, in an embodiment of the present application, the modification module 400 is further configured to modify a replacement rule of random other characters, replace an original word with a near word and a confusing word, and construct a mask rule.
Optionally, in an embodiment of the present application, the error correction module 500 is further configured to construct a multidimensional feature of the error correction character of the error correction corpus, and construct a sound-like classification machine learning model to perform error correction classification.
It should be noted that the foregoing explanation on the embodiment of the sound-like error correction method for a dialog scene is also applicable to the sound-like error correction apparatus for a dialog scene in this embodiment, and details are not repeated here.
According to the sound-like error correction device for the dialogue scene, provided by the embodiment of the application, rule data can be constructed according to a plurality of entities in the car machine dialogue field based on a semantic protocol, the rule data is subjected to generalization enhancement based on a generalization model trained in advance to generate generalization linguistic data, confusion words and near-sound characters in the car machine dialogue field are collected, the generalization linguistic data is randomly replaced according to a certain proportion, error correction data is constructed to generate error correction linguistic data, a mask rule of a language model is modified, the random characters are replaced by the sound-like characters to carry out mask, the language model is trained, the near-sound noise training language model is added based on a pinyin rule table and the error correction linguistic data constructed in advance, the sound-like language model is obtained, sound-like error correction is carried out in the dialogue scene, the accuracy of error correction and detection is effectively improved, the interaction feeling of a vehicle is improved, and the use requirements of users are effectively met. Therefore, the technical problems that in the related technology, only semantic learning is performed, a large number of homophonic hot words need to be maintained in a pinyin model, the cost of constructing the hot words is increased, the pinyin features are single, the error correction capability is reduced, and the use requirements of users cannot be met are solved.
Fig. 10 is a schematic structural diagram of a vehicle according to an embodiment of the present application. The vehicle may include:
memory 1001, processor 1002, and computer programs stored on memory 1001 and executable on processor 1002.
The processor 1002, when executing the program, implements the sound-likeness correction method for the dialogue scene provided in the above-described embodiment.
Further, the vehicle further includes:
a communication interface 1003 for communicating between the memory 1001 and the processor 1002.
A memory 1001 for storing computer programs that may be run on the processor 1002.
Memory 1001 may include high-speed RAM memory and may also include non-volatile memory, such as at least one disk memory.
If the memory 1001, the processor 1002, and the communication interface 1003 are implemented independently, the communication interface 1003, the memory 1001, and the processor 1002 may be connected to each other through a bus and perform communication with each other. The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 10, but this is not intended to represent only one bus or type of bus.
Alternatively, in specific implementation, if the memory 1001, the processor 1002 and the communication interface 1003 are integrated into one chip, the memory 1001, the processor 1002 and the communication interface 1003 may complete communication with each other through an internal interface.
The processor 1002 may be a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits configured to implement embodiments of the present Application.
The present embodiment also provides a computer-readable storage medium on which a computer program is stored, which when executed by a processor implements the above plausibility correction method for dialog scenes.
In the description of the present specification, reference to the description of "one embodiment," "some embodiments," "an example," "a specific example," or "some examples" or the like means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or N embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or to implicitly indicate the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "N" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or N executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present application.
The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or N wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the N steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It will be understood by those skilled in the art that all or part of the steps carried out in the method of implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and the program, when executed, includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a separate product, may also be stored in a computer-readable storage medium.
The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. While embodiments of the present application have been shown and described above, it will be understood that the above embodiments are exemplary and should not be construed as limiting the present application and that changes, modifications, substitutions and alterations in the above embodiments may be made by those of ordinary skill in the art within the scope of the present application.

Claims (10)

1. A method for sound-like error correction for conversational scenes, comprising the steps of:
based on a preset semantic protocol, constructing rule data according to a plurality of entities in the vehicle machine conversation field;
carrying out generalization enhancement on the rule data based on a pre-trained generalization model to generate a generalization corpus;
collecting confusion words and near-sound characters in the vehicle-mounted device conversation field, randomly replacing the generalized corpora according to a preset proportion, constructing error correction data, and generating error correction corpora;
modifying a mask rule of a language model, replacing random characters with sound-like characters to carry out mask, and training the language model; and
based on a pre-constructed pinyin rule table and the error correction corpus, near-sound noise is added to train the language model to obtain a language model similar to partial sound, so that sound-like error correction is performed in a dialogue scene.
2. The method of claim 1, wherein constructing rule data from a plurality of entities in the car-machine dialog domain comprises:
and randomly filling the entities into word slots of the preset semantic protocol according to different intentions in the preset semantic protocol.
3. The method of claim 1, wherein the modifying mask rules in the mask language model MLM module to mask with phonetic characters instead of random characters, training the MLM module comprises:
and modifying the replacement rules of other random characters, and replacing the original words with the near-sound words and the confusing words to construct the mask rules.
4. The method according to claim 1, wherein the training of the language model based on pre-constructed pinyin rule tables and error correction corpora with near-speech noise added to obtain a partiality-like language model comprises:
and constructing multi-dimensional characteristics of the error correction characters of the error correction corpus, and constructing a sound-like classification machine learning model to carry out error correction classification.
5. A speech-like error correction apparatus for conversational scenes, comprising:
the construction module is used for constructing rule data according to a plurality of entities in the vehicle machine conversation field based on a preset semantic protocol;
the generating module is used for carrying out generalization and enhancement on the rule data based on a pre-trained generalization model to generate a generalization corpus;
the processing module is used for collecting confusion words and near-sound characters in the vehicle-mounted device conversation field, randomly replacing the generalized linguistic data according to a preset proportion, constructing error correction data and generating error correction linguistic data;
the modifying module is used for modifying the mask rule of the language model, replacing random characters with sound-like characters to carry out mask and training the language model; and
and the error correction module is used for increasing near-sound noise to train the language model based on the pre-constructed pinyin rule table and the error correction linguistic data so as to obtain a language model similar to partial sound, and performing sound-like error correction in a conversation scene.
6. The apparatus of claim 5, wherein the construction module is further configured to randomly fill word slots of the predetermined semantic protocol with the plurality of entities according to different intents of the predetermined semantic protocol.
7. The apparatus of claim 5, wherein the modification module is further configured to modify a replacement rule of random other characters to replace an original word with a near word and a confusing word to construct the mask rule.
8. The apparatus of claim 5, wherein the error correction module is further configured to construct multidimensional features of error correction characters of the error correction corpus, and construct a phonetically-classified machine learning model for error correction classification.
9. A vehicle, characterized by comprising: memory, processor and computer program stored on the memory and executable on the processor, the processor executing the program to implement the plausibility correction method for dialog scenes according to any one of claims 1 to 4.
10. A computer-readable storage medium, on which a computer program is stored, the program being executable by a processor for implementing the plausibility correction method for dialog scenes according to any one of claims 1 to 4.
CN202211196704.1A 2022-09-29 2022-09-29 Sound-like error correction method and device for conversation scene Pending CN115545013A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211196704.1A CN115545013A (en) 2022-09-29 2022-09-29 Sound-like error correction method and device for conversation scene

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211196704.1A CN115545013A (en) 2022-09-29 2022-09-29 Sound-like error correction method and device for conversation scene

Publications (1)

Publication Number Publication Date
CN115545013A true CN115545013A (en) 2022-12-30

Family

ID=84732211

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211196704.1A Pending CN115545013A (en) 2022-09-29 2022-09-29 Sound-like error correction method and device for conversation scene

Country Status (1)

Country Link
CN (1) CN115545013A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116861903A (en) * 2023-09-04 2023-10-10 成都赛力斯科技有限公司 Error correction method and device for vehicle-mounted text data

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116861903A (en) * 2023-09-04 2023-10-10 成都赛力斯科技有限公司 Error correction method and device for vehicle-mounted text data

Similar Documents

Publication Publication Date Title
US9286886B2 (en) Methods and apparatus for predicting prosody in speech synthesis
US10388274B1 (en) Confidence checking for speech processing and query answering
US10332508B1 (en) Confidence checking for speech processing and query answering
US9818401B2 (en) Systems and methods for adaptive proper name entity recognition and understanding
US8548806B2 (en) Voice recognition device, voice recognition method, and voice recognition program
EP2259252B1 (en) Speech recognition method for selecting a combination of list elements via a speech input
US8380505B2 (en) System for recognizing speech for searching a database
WO2016067418A1 (en) Conversation control device and conversation control method
US20080177541A1 (en) Voice recognition device, voice recognition method, and voice recognition program
US8126714B2 (en) Voice search device
AU2017326987B2 (en) Systems and methods for adaptive proper name entity recognition and understanding
US20060241936A1 (en) Pronunciation specifying apparatus, pronunciation specifying method and recording medium
JP5073024B2 (en) Spoken dialogue device
CN115545013A (en) Sound-like error correction method and device for conversation scene
JP5097802B2 (en) Japanese automatic recommendation system and method using romaji conversion
KR101483947B1 (en) Apparatus for discriminative training acoustic model considering error of phonemes in keyword and computer recordable medium storing the method thereof
Beaufays et al. Language model capitalization
Wang et al. RNN-based prosodic modeling for mandarin speech and its application to speech-to-text conversion
JP3581044B2 (en) Spoken dialogue processing method, spoken dialogue processing system, and storage medium storing program
Saychum et al. Efficient Thai Grapheme-to-Phoneme Conversion Using CRF-Based Joint Sequence Modeling.
Shi An investigation of linguistic information for speech recognition error detection
Allauzen et al. Voice query refinement
JP2006040150A (en) Voice data search device
CN116246611A (en) Method for determining a vehicle domain and speech recognition system for a vehicle
CN117275467A (en) Voice instruction recognition method and device in noise environment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination