WO2020014890A1 - 基于口音的语音识别处理方法、电子设备和存储介质 - Google Patents

基于口音的语音识别处理方法、电子设备和存储介质 Download PDF

Info

Publication number
WO2020014890A1
WO2020014890A1 PCT/CN2018/096131 CN2018096131W WO2020014890A1 WO 2020014890 A1 WO2020014890 A1 WO 2020014890A1 CN 2018096131 W CN2018096131 W CN 2018096131W WO 2020014890 A1 WO2020014890 A1 WO 2020014890A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice instruction
training
accent
voice
instruction
Prior art date
Application number
PCT/CN2018/096131
Other languages
English (en)
French (fr)
Inventor
谢冠宏
廖明进
高铭坤
Original Assignee
深圳魔耳智能声学科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳魔耳智能声学科技有限公司 filed Critical 深圳魔耳智能声学科技有限公司
Priority to PCT/CN2018/096131 priority Critical patent/WO2020014890A1/zh
Priority to CN201880000936.0A priority patent/CN109074804B/zh
Publication of WO2020014890A1 publication Critical patent/WO2020014890A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Definitions

  • the present application relates to the technical field of speech recognition, and in particular, to an accent-based speech recognition processing method, an electronic device, and a storage medium.
  • an accent-based speech recognition processing method an electronic device, and a storage medium capable of improving the accuracy of speech recognition are provided.
  • An accent-based speech recognition processing method includes:
  • fuzzy matching is performed between the speech recognition result and the standard speech instruction to obtain a candidate standard speech instruction
  • an accent feature of the training voice instruction is determined, and the accent feature is used for correcting and identifying a voice instruction to be recognized that carries a corresponding accent feature.
  • An electronic device includes a memory and a processor.
  • the memory stores computer-readable instructions.
  • the processor causes the processor to perform the following steps:
  • fuzzy matching is performed between the speech recognition result and the standard speech instruction to obtain a candidate standard speech instruction
  • an accent feature of the training voice instruction is determined, and the accent feature is used for correcting and identifying a voice instruction to be recognized that carries a corresponding accent feature.
  • One or more non-volatile storage media storing computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the following steps:
  • fuzzy matching is performed between the speech recognition result and the standard speech instruction to obtain a candidate standard speech instruction
  • an accent feature of the training voice instruction is determined, and the accent feature is used for correcting and identifying a voice instruction to be recognized that carries a corresponding accent feature.
  • FIG. 1 is an application environment diagram of an accent-based speech recognition processing method according to an embodiment
  • FIG. 2 is a schematic flowchart of an accent-based speech recognition processing method according to an embodiment
  • FIG. 3 is a schematic flowchart of steps for triggering accent training and comparison in an embodiment
  • FIG. 4 is a schematic flowchart of an accent feature generation step in another embodiment
  • FIG. 5 is a schematic flowchart of an accent correction recognition step in an embodiment
  • FIG. 6 is a schematic flowchart of an accent-based speech recognition processing method according to an embodiment
  • FIG. 7 is a structural block diagram of an accent-based speech recognition processing device according to an embodiment
  • FIG. 8 is a structural block diagram of an electronic device in an embodiment.
  • the accent-based speech recognition processing method provided in this application can be applied to the application environment shown in FIG. 1.
  • the user interacts with the electronic device 102 by sending a sound signal.
  • the user sends a sound signal
  • the electronic device 102 collects the sound signal sent by the user through the microphone array to obtain a voice instruction carrying relevant information, and analyzes the voice instruction.
  • the electronic device 102 collects multiple repeated sound signals continuously issued by the user through the microphone array to obtain a training voice instruction carrying relevant information, and the electronic device 102 performs preliminary training voice instructions Recognize to get the speech recognition result corresponding to each training speech instruction.
  • the trigger enters the accent training state, and through accent training, a standard voice command matching the training voice command is determined. Furthermore, the accent characteristics of the training voice command are determined according to the training voice command and the matched standard voice command.
  • the electronic device 102 uses the accent feature to correct and recognize the voice command to be recognized, so as to accurately obtain a standard voice command matching the voice command.
  • the electronic device 102 may be an electronic device with a voice recognition function, including, but not limited to, various smart home devices, personal computers, smart phones, voice interactive robots, and the like. Among them, smart home devices are devices that perform corresponding operations through voice instructions, such as smart speakers, smart home appliances, and on-board voice control systems that can implement voice control.
  • an accent-based speech recognition processing method is provided.
  • the method is applied to the electronic device in FIG. 1 as an example, and includes the following steps:
  • S202 Receive and recognize a preset number of training voice commands, and obtain a voice recognition result corresponding to each training voice command.
  • the voice instruction carries the voice signal content of the control instruction text obtained by collecting the sound signal sent by the user through the microphone array.
  • the control command is a wake-up command
  • its corresponding voice command is a voice signal carrying the text content of "play”
  • the control command is a switch command
  • its corresponding voice command is carried
  • the electronic device is provided with a microphone array to collect sound signals. It can be understood that, in order to achieve a better effect of collecting voice signals, any one of a ring microphone array, a linear microphone array, or a stereo microphone array may be used according to the application scenario of the electronic device. For example, for a smart speaker, a ring microphone array may be used in order to collect sound source signals within a 360-degree range.
  • the training voice instruction is a voice instruction obtained by the microphone array of the electronic device collecting a user continuously issuing a preset number of sounds carrying specific text content.
  • the training voice instructions carry the accent features of the user.
  • the preset number can be set in advance according to demand. Taking the intelligent voice device as an example, assuming that the preset number is three, when the intelligent voice device is in a standby or normal working state, it receives a voice instruction corresponding to three consecutive sound signals issued by the user.
  • users can sound different specific text content. For example, for smart speakers or smart home appliances, the specific text content can be control instructions for these devices.
  • the specific text can be "on”, “next song”, “loop play”, etc .; in the intelligent air conditioning accent training state, the specific text can be "cooling", “ventilation”, or Time-specific temperature value, such as "27 degrees” and so on.
  • the accent training state refers to a state in which a standard voice instruction process that finally matches a training voice instruction is obtained by processing the received training voice instruction.
  • the microphone array of the electronic device collects the sound signal to obtain a voice instruction, and the electronic device recognizes the received voice instruction to obtain a corresponding response. Speech recognition results and store them.
  • the number of received voice instructions is judged. When the number of received voice instructions reaches a preset number, it is determined that the preset number of voice instructions is a training voice instruction.
  • the recognition method is a preset speech recognition algorithm.
  • the preset speech recognition algorithm is a traditional speech recognition algorithm, for example, a speech recognition algorithm based on a neural network, a speech recognition algorithm based on DTW (Dynamic Time Warping).
  • a training trigger condition is a condition that can be used to trigger an accent training state. For example, whether the received training voice instruction is issued within a specified time.
  • the comparison result refers to the similarity between the speech recognition results.
  • Consistency condition refers to whether the training voice instructions corresponding to each speech recognition result are the same voice instruction, that is, whether each training voice instruction carries the same information, for example, the training voice instruction is a preset number of "" "On" voice command.
  • the consistency condition is that the similarity between the speech recognition results reaches a similarity threshold.
  • the operation corresponding to the control instruction is performed; otherwise, it switches to the standby or working state before receiving the training voice instruction. For example, switch to standby mode and wait for receiving voice instructions; or switch to working state and perform work before receiving training voice instructions.
  • standard voice instructions refer to pre-stored voice information that can be accurately identified.
  • a speech recognition algorithm is a recognition algorithm established based on standard Mandarin
  • the standard voice instruction refers to voice information that complies with the pronunciation rules of standard Mandarin.
  • the standard voice instruction is voice information carrying the text content of the control instruction, and the standard voice instruction can be accurately recognized by a preset voice recognition algorithm.
  • the candidate standard voice instruction refers to the result output by the standard voice instruction matching model.
  • fuzzy recognition is performed on each voice recognition result to obtain a fuzzy recognition result.
  • the fuzzy recognition result is matched with the pre-stored standard voice instructions to obtain candidates that match the training voice instructions.
  • Standard voice instructions include easy-to-mix Pinyin replacement, simple grammatical analysis, and so on.
  • the candidate standard voice instruction is confirmed based on a preset confirmation method.
  • the candidate voice instruction is used as the standard voice instruction matching the training voice instruction.
  • the preset confirmation method can be either confirmation based on user feedback, or confirmation based on a set automatic confirmation rule.
  • the automatic confirmation rule may be that when the similarity between the candidate standard voice instruction and the training voice instruction reaches a preset value, the candidate voice instruction is considered to be the same as the training voice instruction.
  • S210 Determine an accent feature of the training voice command according to the training voice command and the matched standard voice command, and the accent feature is used to correct and identify the voice command to be recognized that carries the corresponding accent feature.
  • Accent features refer to the unique features of voice instructions compared to standard voice instructions.
  • the accent features include the sound features of the training voice instruction itself and the correction coefficients existing compared to the standard voice instruction.
  • the voices spoken by different users usually have different accent characteristics. Sound characteristics, as the name suggests, refer to the characteristic information contained in sounds, such as tone color, pitch, and speed of speech.
  • the models of the speech recognition system usually include an acoustic model and a language model, which correspond to the calculation of speech-to-syllable probability and the calculation of syllable-to-word probability, respectively.
  • the acoustic features can be extracted through the acoustic model.
  • the correction coefficient also known as the accent recognition correction coefficient, refers to the correction coefficient for the difference between the training voice command and the standard voice command, for example, including the accent coefficient and the error coefficient.
  • the electronic device compares the acquired training voice instruction with the standard voice instruction corresponding to the specific text content to obtain a matched standard voice instruction, and further analyzes the difference between them to obtain the accent recognition correction coefficient.
  • a difference analysis is performed on the training voice instruction and the matched standard voice instruction to determine the accent features of the training voice instruction, so that in the subsequent voice recognition process, the accent features are applied to the voice recognition algorithm to correct and recognize the voice instruction, Then get accurate speech recognition results. Since the accent feature is obtained based on the difference between the training voice command and the standard voice command, the voice command carrying the corresponding accent can be effectively identified based on the accent feature.
  • the above-mentioned accent-based speech recognition processing method obtains a speech recognition result corresponding to each training speech instruction by receiving and recognizing a preset number of training speech instructions.
  • the system enters the accent training state and compares the speech recognition results of each training voice instruction.
  • the comparison results meet the consistency conditions, the speech recognition results are fuzzy-matched with the standard speech information.
  • candidate standard voice instructions confirm the candidate standard voice instructions, and determine standard voice instructions that match the training voice instructions.
  • the accent features used to correct and recognize the voice command to be recognized are determined.
  • accent features are obtained by using accent training. Accent features are used to correct recognition of the speech instructions to be recognized, and optimize the speech recognition results, thereby improving the accuracy of speech recognition.
  • S302 Obtain a receiving duration of receiving a training voice instruction.
  • the receiving duration refers to a time interval between the first time a training voice instruction is received and the last time a training voice instruction is received.
  • the receiving time can be obtained by recording the time point when each training voice instruction is received, or based on the calculation of the time point; or, the timer is started when the training voice instruction is received for the first time, and the training voice instruction is received for the last time The time ends, and the receiving time is obtained based on the timing result of the timer.
  • the preset duration refers to a preset duration based on the estimated duration of the training voice instruction.
  • the receiving duration of the training voice instruction is less than or equal to the preset duration, it indicates that the accent training is currently required; when the receiving duration of the training voice instruction is longer than the preset duration, it indicates that the accent training is not currently required.
  • an accent training state is triggered to perform accent training. It can be understood that when the receiving time is longer than the preset time, it is determined whether there is a control command corresponding to the last received voice command, that is, whether the recognition result of the voice command is the same as the text content of the control command, and if so, the control command is executed. Corresponding operation; otherwise, switch to standby or working state before receiving training voice instructions.
  • comparing the speech recognition results of each training speech instruction includes: performing similarity calculation on each speech recognition result to obtain the similarity between the speech recognition results; when the speech recognition results are between When the similarity reaches the threshold of similarity, it is determined that the comparison result meets the consistency condition, that is, each speech recognition result meets the consistency condition.
  • the similarity threshold refers to the minimum similarity value that can be determined when the corresponding voice instruction of each speech recognition result is the same. It can be understood that when the similarity between the speech recognition results reaches the similarity threshold, it is considered that the training speech instruction corresponding to each speech recognition result is the same speech instruction repeated.
  • the similarity calculation is performed on each speech recognition result to obtain the similarity between the speech recognition results, and whether the similarity between the speech recognition results reaches a similarity threshold, and if the similarity between the speech recognition results is similar If the degree reaches the similarity threshold, it is determined that the speech recognition result meets the consistency condition.
  • the step of confirming the candidate standard voice instruction and determining the standard voice instruction matching the training voice signal includes: outputting the candidate standard voice instruction; and determining and training the voice according to the user's feedback on the candidate standard voice instruction. Command-matched standard voice commands.
  • the speech recognition result is fuzzy-matched with the pre-stored standard speech instruction to obtain a standard speech instruction that fuzzy-matches the speech recognition result, and the standard speech instruction is used as a candidate standard.
  • Voice command output When the user obtains a candidate standard voice instruction from the output information, it is determined whether the candidate standard voice instruction is a standard voice instruction that matches the training voice instruction, that is, whether the candidate standard voice instruction is the same as the text content carried by the training voice instruction. If they are the same, the confirmation information is fed back, and according to the feedback confirmation information, it is determined that the candidate standard voice instruction is a standard voice instruction that matches the training voice instruction.
  • the output method may be a method of displaying text on a display screen, or a voice broadcast method.
  • the step of determining a standard voice instruction matching the speech recognition result according to the user's feedback on the candidate standard voice instruction includes: receiving user feedback information on the candidate standard voice instruction; when the feedback information includes the voice recognition result and When the candidate standard voice instruction is matched, it is determined that the candidate standard voice instruction is a standard voice instruction that matches the training voice instruction.
  • the feedback information refers to information that is fed back by the user according to the output candidate standard voice instruction, and includes a result of matching the voice recognition result and the candidate standard voice instruction, or a result of the voice recognition result not matching the candidate standard voice instruction.
  • the confirmation information (such as "yes") may be input through the displayed instruction information to indicate the result of the speech recognition result matching the candidate standard voice instruction; or the non-confirmation information (such as "no") may be input through the displayed instruction information, To indicate the result of a mismatch between the speech recognition result and the candidate standard speech instruction.
  • the candidate standard voice instruction is a standard voice instruction that matches the training voice instruction . It can be understood that the standard voice command matching the voice recognition result, that is, the training voice command corresponding to the voice recognition result is matched.
  • the fuzzy recognition result of the training speech instruction and the standard speech instruction are used to obtain candidate standard speech instructions.
  • the user confirms the matching result to improve the accuracy of the matching result and ensure the training speech instruction and the corresponding standard speech instruction. The match is correct.
  • the comparison result does not satisfy the consistency condition, or when the feedback information includes a result that the speech recognition result and the candidate standard voice instruction do not match, exit the accent training state and switch to the standby or working state before receiving the training voice instruction .
  • the method further includes: associating and storing the training voice instruction and the standard voice instruction matching the training voice instruction.
  • the training voice instruction and the standard voice instruction matching the training voice instruction are stored in association, so that when the accent feature determination and determination condition is satisfied, the stored training voice instruction and the matched standard voice instruction are obtained, and the accent feature determination step is performed.
  • the method further includes: exiting the accent training state, and switching to a standby or working state before receiving the training voice instruction.
  • this accent training is completed, it will exit the accent training state and switch to the standby or working state before receiving the training voice instruction.
  • the method further includes: generating and outputting prompt information about whether to perform an operation corresponding to the training voice instruction. According to the prompt information, the user feedbacks whether to perform a result corresponding to the training voice instruction. If the result of the feedback is to perform a corresponding operation to the training voice instruction, the user performs an operation corresponding to the standard voice instruction that matches the training voice instruction.
  • the accent features include: the sound features of the training voice instruction and the accent recognition correction coefficient. As shown in FIG. 4, according to the training voice instruction and the matched standard voice instruction, determining the accent characteristics of the training voice instruction includes:
  • the accent feature determination condition refers to that the accent training of the same user reaches a preset number of times.
  • the stored training voice instruction of the user and the standard voice instruction matching the training voice instruction are obtained.
  • the standard voice command is voice information that does not carry an accent
  • the corresponding training voice command refers to voice information that carries the same specific text information and accent as the standard voice command. There is a difference between the two voice characteristics. .
  • the sound feature-based extraction method extracts the sound features of the training voice instruction and the sound features of the standard voice instruction, respectively.
  • the acoustic feature extraction method can be extracted using traditional acoustic models, such as the commonly used acoustic model based on hidden Markov models, the acoustic model based on recurrent neural networks, and so on.
  • the difference between the sound characteristics of the training voice instruction and the standard voice instruction is analyzed, and the accent recognition correction coefficient corresponding to the training voice instruction is determined based on the obtained difference coefficient, so as to optimize the voice by using the accent recognition correction coefficient during the speech recognition Identify the results.
  • the accent-based speech recognition processing method further includes:
  • the voice instruction is a signal carrying the text content of the control instruction obtained by collecting a voice signal issued by the user through the microphone array, and the corresponding intelligent voice device can be controlled by the voice instruction.
  • the voice command to be recognized refers to a voice command that needs to be currently recognized.
  • the voice command to be recognized may be a voice signal carrying the content of the control instruction text of "play”; it may be a voice signal carrying "next" A voice signal to control the text content of the instruction.
  • the microphone array of the electronic device collects a voice instruction to be recognized.
  • the received voice instruction is analyzed by using an acoustic model in a preset voice recognition algorithm to extract a sound feature of the voice signal.
  • the preset speech recognition algorithm is a traditional speech recognition algorithm, for example, a neural network-based speech recognition algorithm, a DTW (Dynamic Time Warping) -based speech recognition algorithm, and the like.
  • the accent feature refers to the accent feature corresponding to the training voice instruction obtained by the electronic device based on the accent training.
  • the accent feature includes the sound feature of the training voice instruction itself, for example, the tone color, tone, and speed of the voice instruction.
  • the accent feature also includes an accent recognition correction coefficient for correcting a voice instruction to be recognized.
  • the sound feature of the voice command to be recognized is matched with the sound feature of the stored accent feature to obtain the accent feature that matches the sound feature of the voice command to be recognized, and then the accent recognition in the matched accent feature is obtained. Correction factor.
  • S508 Recognize the voice instruction according to the accent recognition correction coefficient, and obtain a voice recognition result.
  • the accent recognition correction coefficient is applied to a speech recognition algorithm, and the speech instruction is modified and recognized, thereby obtaining a speech recognition result. Since the accent recognition correction coefficient is a difference correction coefficient obtained based on the training voice instruction and the standard voice instruction, the voice instruction carrying the corresponding accent can be effectively identified based on the difference correction coefficient.
  • the speech recognition result of the speech instruction to be recognized is obtained, a corresponding operation can be performed based on the speech recognition result.
  • the voice recognition result is a "play" instruction, then the smart speaker is controlled to perform a playback operation.
  • the above-mentioned accent-based speech recognition processing method analyzes the speech instruction to be recognized to obtain a sound feature.
  • an accent recognition correction coefficient corresponding to the accent feature is obtained, and then the correction coefficient is accentuated according to the accent recognition. Recognize the voice command to get the voice recognition result.
  • the accent features of the voice command to be recognized are matched with the stored accent features, and the accent recognition correction coefficient corresponding to the matched accent feature is obtained, and then the accent recognition correction corresponding to the accent feature is corrected Coefficients, to recognize the speech instructions to be recognized, to optimize the speech recognition results, thereby improving the accuracy of speech recognition.
  • recognizing a voice instruction according to an accent recognition correction coefficient to obtain a voice recognition result includes: correcting the voice instruction according to the accent recognition correction coefficient; and recognizing the modified voice instruction to obtain a voice recognition result.
  • the accent recognition correction coefficient is based on the training voice instruction and the standard voice instruction to obtain the difference correction coefficient. Based on the accent recognition correction coefficient, a correction relationship between the training voice instruction and the standard voice instruction can be established. Using this correction relationship and the accent recognition correction coefficient, After receiving the voice instruction for correction, the modified voice instruction is recognized based on a preset voice recognition algorithm to obtain a voice recognition result.
  • the accent recognition correction coefficient includes an accent coefficient and an error coefficient
  • a training voice instruction can be described as equivalent to a product of a matched standard voice instruction and an accent coefficient, plus an error coefficient. Therefore, based on the description relationship and the obtained accent coefficient and error coefficient, the voice command to be recognized can be modified so that the modified voice command conforms to the standard voice command as much as possible.
  • the voice recognition result is optimized to a certain extent and the accuracy of the voice recognition is improved.
  • the method includes the following steps:
  • the smart speaker when it is in a standby or working state, it receives multiple training voice instructions continuously collected by the microphone. For example, three “random mode” voice commands issued by a user collected through a microphone are recognized each time the “random mode” command is received, and the recognition result is stored in a memory. Due to the interference of accent features, it is difficult to achieve a completely accurate recognition result.
  • the recognition result of the first "random mode” instruction is the data corresponding to "who machine mode”
  • the recognition result of the second "random mode” instruction is The data corresponding to the “random pattern” and the recognition result of the third “random pattern” instruction are the data corresponding to “random things”.
  • the three consecutive instructions issued by the user collected through the microphone may be different instructions, and the corresponding recognition results are also different recognition results.
  • the number of received voice signals is judged. When the number of received voice signals reaches a preset number of 3 times, it is determined that the preset number of voice signals is a training voice command, and then it is determined whether a preset training trigger condition is satisfied. .
  • S602 Obtain a receiving duration of receiving a training voice instruction.
  • the receiving time can be obtained by recording the time point when each training voice instruction is received, or based on the calculation of the time point; or, the timer is started when the training voice instruction is received for the first time, and the training voice instruction is received for the last time The time ends, and the receiving time is obtained based on the timing result of the timer. For example, the time point when the "random mode" instruction is received for the first time and the time point when the "random mode” instruction is received for the third time are recorded, and the interval between the two time points is used as the reception duration.
  • step S603. When the receiving duration is less than or equal to the preset duration, trigger an accent training state; otherwise, execute step S611.
  • the receiving duration is less than or equal to 30 seconds.
  • the accent training state of the smart speaker is triggered for accent training; when the receiving time is longer than the preset time, it is determined whether there is a control command corresponding to the last received voice command, and That is, whether the recognition result of the voice instruction is the same as the text content of the control instruction, and if so, the operation corresponding to the control instruction is performed; otherwise, the accent training state is exited, and the standby or working state before receiving the training voice instruction is switched.
  • the smart speaker is in the playing state, then it switches to the playing state to continue playing the song.
  • similarity calculation is performed on each speech recognition result to obtain the similarity between the speech recognition results, so as to determine whether the similarity between the speech recognition results reaches a similarity threshold. For example, the similarity between the data corresponding to the "who-machine mode” and the data corresponding to the "random mode”, the similarity between the data corresponding to the "who-machine mode” and the data corresponding to “random things”, and the corresponding "random mode” The similarity between the data and the data corresponding to "random events”.
  • the similarity between the speech recognition results reaches a similarity threshold, and if the similarity between the speech recognition results reaches a similarity threshold, it is determined that the speech recognition results meet a consistency condition. For example, the similarity between the data corresponding to the "who-machine mode” and the data corresponding to the "random mode”, the similarity between the data corresponding to the "who-machine mode” and the data corresponding to “random things”, and the data corresponding to the "random mode” When the similarity of the data corresponding to "random thing” reaches 99%, the comparison result is considered to satisfy the consistency condition.
  • the speech recognition result is fuzzy-matched with a pre-stored standard voice instruction to obtain a standard voice instruction that is fuzzy-matched with the speech recognition result, and the standard is The voice instruction serves as a candidate standard voice instruction. If the consistency conditions are not met, the accent training state is exited, and the mode is switched to the standby or working state before receiving the training voice instruction.
  • the smart speaker stores standard voice instructions that can be executed, and it is assumed that the standard voice instructions of the "random mode" are included.
  • the speech recognition result is fuzzy-matched with the pre-stored standard voice instruction to obtain a "random pattern” standard voice instruction that fuzzy-matches the speech recognition result, and the "random pattern” is selected as a candidate Standard voice commands are output through the smart speaker. For example, through the speaker output of smart speakers. If the three similarities are less than 99%, exit the accent training state and switch to the playing state to continue playing the song.
  • the output mode is the voice broadcast mode.
  • step S609 When the feedback information includes a result of the speech recognition result and the candidate standard voice instruction matching, determine that the candidate standard voice instruction is a standard voice instruction that matches the training voice instruction; otherwise, perform step S611.
  • the candidate standard voice instruction Receives the user's feedback information on the candidate standard voice instruction and analyze the feedback information.
  • the feedback information includes the result of the speech recognition and the candidate standard voice instruction match, determine that the candidate standard voice instruction is a standard voice instruction that matches the voice instruction. It can be understood that the standard voice command matching the voice recognition result, that is, the training voice command corresponding to the voice recognition result is matched.
  • the feedback information includes a result that the voice recognition result and the candidate standard voice instruction do not match, the accent training state is exited, and the standby or working state before receiving the voice instruction is switched.
  • the feedback information may be “yes” or “no” voice information.
  • the candidate standard voice instruction “random mode” is determined. Standard voice instructions that match the training voice instructions. If the received voice message is "No", it will exit the accent training state and switch to the playback state to continue playing the song.
  • step S610 Store the training voice instruction and the standard voice information matching the training voice instruction in association. Then step S611 is executed.
  • the training voice instruction and the standard voice instruction matching the training voice instruction are stored in association, so that when the condition for determining the correction coefficient of the training voice instruction is met, the stored training voice instruction and the standard voice instruction matching the training voice instruction are acquired, and the accent is executed Identify the steps for correction factor extraction.
  • the three received "random mode" training voice instructions are associated with the "random mode” standard voice instructions and stored in the memory of the smart speaker.
  • the stored training voice instruction of the user and the standard voice instruction matching the training voice instruction are obtained.
  • the training voice instructions for the 7 accent trainings are "play”, “pause”, “off”, “standby”, “next”, “random mode”, “Sequential Play”, get 7 training voice instructions and their matching standard voice instructions.
  • the sound features of the training voice instruction and the standard voice instruction are extracted.
  • the difference between the sound characteristics of the training voice instruction and the standard voice instruction is analyzed, and the accent recognition correction coefficient of the training voice instruction is determined based on the obtained difference coefficient, so as to optimize the speech recognition by using the accent recognition correction coefficient in the speech recognition process result.
  • the microphone array of the electronic device collects a voice signal to be recognized.
  • the smart speaker collects the "single cycle" instruction issued by the user through the microphone.
  • S616 Analyze the voice instructions to obtain the sound characteristics.
  • the received voice instruction is analyzed by a preset voice recognition algorithm to extract the sound characteristics of the voice instruction.
  • the received "single cycle" command is analyzed to obtain accent features such as tone, pitch, and speed of speech.
  • the intelligent voice device stores the accent features obtained through accent training in advance, and the accent features include a sound feature and an accent recognition correction coefficient. Match the sound feature of the voice command to be recognized with the sound feature in the stored accent feature to obtain the matched accent feature, and obtain the accent recognition correction coefficient corresponding to the matched accent feature.
  • the accent recognition correction coefficient is based on the training voice instruction and the standard voice instruction to obtain the difference correction coefficient. Based on the accent recognition correction coefficient, a correction relationship between the training voice instruction and the standard voice instruction can be established. Using this correction relationship and the accent recognition correction coefficient, After receiving the voice instruction for correction, the modified voice instruction is recognized based on a preset voice recognition algorithm to obtain a voice recognition result. For example, the accent recognition correction coefficient is used to instruct the "single cycle" instruction to be identified, and then the modified "single cycle” instruction is identified to obtain the recognition result. Based on the "single cycle” with an accent The instructions are corrected before recognizing to ensure that the "single cycle" instruction is accurately identified.
  • the above-mentioned accent-based speech recognition processing method fully considers the influence of accent features on speech recognition results, matches the sound features of the voice command to be recognized with the stored accent features, and obtains the accent recognition correction coefficient corresponding to the matched accent features. Furthermore, based on the accent recognition correction coefficient corresponding to the accent features, the speech instruction to be recognized is identified. Since the accent recognition correction coefficient is a difference correction coefficient obtained based on the training voice instruction and the standard voice instruction, the voice instruction carrying the corresponding accent can be effectively identified based on the difference correction coefficient.
  • an accent-based speech recognition processing device includes a speech recognition module 702, a comparison module 704, a matching module 706, a standard instruction confirmation module 708, and an accent feature determination. Module 710.
  • the voice recognition module 702 is configured to receive and recognize a preset number of training voice instructions, and obtain a voice recognition result corresponding to each training voice instruction.
  • the microphone array of the electronic device collects the sound signal to obtain a voice instruction
  • the voice recognition module 702 receives the voice instruction
  • the instruction performs recognition, and the corresponding speech recognition result is obtained and stored.
  • the number of received voice instructions is determined, and when the number of received voice instructions reaches a preset number, it is determined that the preset number of voice instructions is a voice instruction.
  • the recognition method is a preset speech recognition algorithm.
  • the preset speech recognition algorithm is a traditional speech recognition algorithm, for example, a speech recognition algorithm based on a neural network, a speech recognition algorithm based on DTW (Dynamic Time Warping).
  • a comparison module 704 is configured to trigger an accent training state when a preset training trigger condition is met, and compare the speech recognition results of each training voice instruction to obtain a comparison result.
  • a preset number of training voice instructions when a preset number of training voice instructions are received, it is determined whether a preset training trigger condition is satisfied, and when a preset training trigger condition is met, an accent training state is triggered to obtain each stored training voice.
  • the instructed speech recognition result is compared with each speech recognition result to determine whether each speech recognition result meets a consistency condition.
  • the comparison result refers to the similarity between the speech recognition results.
  • Consistency condition refers to whether the training voice instructions corresponding to each speech recognition result are the same voice instruction, that is, whether each training voice instruction carries the same information, for example, the training voice instruction is a preset number of "" "On" voice signal.
  • the consistency condition is that the similarity between the speech recognition results reaches a similarity threshold.
  • the matching module 706 is configured to perform fuzzy matching between the speech recognition result and the standard voice instruction to obtain candidate standard voice instructions when the comparison result meets the consistency condition.
  • the matching module 706 performs fuzzy matching between the speech recognition result and the pre-stored standard voice instruction, and determines the standard voice instruction that matches the training voice instruction based on the matching result.
  • the standard instruction confirmation module 708 is configured to confirm a candidate standard voice instruction and determine standard voice information that matches the training voice instruction.
  • the candidate standard voice instruction is confirmed based on a preset confirmation method.
  • the candidate voice instruction is used as a standard voice instruction that matches the training voice instruction.
  • the preset confirmation method can be either confirmation based on user feedback, or confirmation based on a set automatic confirmation rule.
  • the automatic confirmation rule may be that when the similarity between the candidate standard voice instruction and the training voice instruction reaches a preset value, the candidate voice instruction is considered to be the same as the training voice instruction.
  • the accent feature determining module 710 is configured to determine the accent feature of the training voice command according to the training voice command and the matched standard voice command, and the accent feature is used to correct and identify the voice command to be recognized that carries the corresponding accent feature.
  • the accent feature determination module 710 performs a difference analysis on the training voice instruction and the matched standard voice instruction to determine the accent feature of the training voice instruction.
  • the speech instructions are modified and recognized, and then the speech recognition result is obtained. Since the accent feature is obtained based on the difference between the training voice command and the standard voice command, the voice command carrying the corresponding accent can be effectively identified based on the accent feature.
  • the above-mentioned accent-based voice recognition processing device receives and recognizes a preset number of training voice instructions to obtain a voice recognition result corresponding to each training voice instruction.
  • the system enters the accent training state and compares the speech recognition results of each training voice instruction.
  • the speech recognition results are fuzzy-matched with the standard speech information.
  • candidate standard voice instructions confirm the candidate standard voice instructions, and determine standard voice instructions that match the training voice instructions.
  • the accent features used to correct and recognize the voice command to be recognized are determined.
  • comparison module 704 includes a trigger module and a comparison execution module.
  • the triggering module is used to obtain a receiving duration of receiving a training voice instruction; when the receiving duration is less than or equal to a preset duration, triggering enters an accent training state.
  • the receiving time can be obtained by recording the time point when each training voice instruction is received, or based on the calculation of the time point; or, the timer is started when the training voice instruction is received for the first time, and the training voice instruction is received for the last time The time ends, and the receiving time is obtained based on the timing result of the timer. Determine whether the reception duration is less than or equal to the preset duration, and when the reception duration is less than or equal to the preset duration, an accent training state is triggered to perform accent training. It can be understood that when the receiving time is longer than the preset time, it switches to the standby or working state before receiving the training voice instruction.
  • the comparison execution module is used to compare the speech recognition results of each training voice instruction to obtain the comparison result. Specifically, the stored speech recognition result of each training voice instruction is acquired, and the speech recognition results are compared to determine whether each speech recognition result meets a consistency condition.
  • the comparison execution module further includes a similarity calculation module and a consistency determination module.
  • the similarity calculation module is used to calculate the similarity of each speech recognition result to obtain the similarity between the speech recognition results
  • the consistency determination module is used to similarize the similarity between the speech recognition results. When the threshold value is determined, the comparison result meets the consistency condition.
  • the matching module 706 includes: an output module and a feedback determination module.
  • the output module is configured to output the candidate standard voice instruction;
  • the feedback determination module is configured to determine a standard voice instruction that matches the training voice instruction according to the user's feedback on the candidate standard voice instruction.
  • the candidate voice module fuzzy-matches the voice recognition result with a pre-stored standard voice instruction, obtains a standard voice instruction that fuzzyly matches the voice recognition result, and uses the standard voice instruction Output as a candidate standard voice command.
  • the candidate standard voice instruction is standard voice information that matches the training voice instruction, that is, whether the candidate standard voice instruction is the same as the text content carried by the training voice instruction. If they are the same, the confirmation information is fed back, and the feedback determination module determines that the candidate standard voice instruction is standard voice information that matches the training voice instruction according to the feedback confirmation information.
  • the feedback determination module is further configured to receive the user's feedback information on the candidate standard voice instruction; when the feedback information includes the result of the speech recognition and the candidate standard voice instruction match, determine that the candidate standard voice instruction is the training voice Command-matched standard voice commands.
  • the feedback determination module receives the user's feedback information on the candidate standard voice instruction and analyzes the feedback information. When the feedback information includes the result of the speech recognition and the candidate standard voice instruction match, it determines that the candidate standard voice instruction matches the training voice instruction. Standard voice instructions. It can be understood that the standard voice command matching the voice recognition result, that is, the training voice command corresponding to the voice recognition result is matched.
  • the fuzzy recognition result of the training speech instruction and the standard speech instruction are used to obtain candidate standard speech instructions.
  • the user confirms the matching result to improve the accuracy of the matching result and ensure the training speech instruction and the corresponding standard speech instruction. The match is correct.
  • the accent feature determination module 710 includes: a signal acquisition module, a sound feature module, and a coefficient determination module, where:
  • the signal acquisition module is configured to acquire a training voice instruction and a standard voice instruction matching the training voice instruction when the accent feature determination condition is satisfied.
  • the signal acquisition module acquires a stored training voice instruction of the user and a standard voice instruction matching the training voice instruction.
  • the sound feature module is used to obtain the sound features of the training voice instruction and the standard voice instruction, respectively.
  • the sound feature module extracts the sound features of the training voice instruction and the standard voice instruction based on the sound feature extraction method.
  • the coefficient determining module is configured to determine an accent recognition correction coefficient corresponding to the training voice instruction according to the difference between the sound characteristics of the training voice instruction and the standard voice instruction.
  • the coefficient determination module analyzes the difference between the sound characteristics of the training voice instruction and the standard voice instruction, and determines the accent recognition correction coefficient corresponding to the training voice instruction based on the obtained difference coefficient, so as to optimize the voice by using the accent recognition correction coefficient in the speech recognition process Identify the results.
  • the accent-based speech recognition processing device further includes a storage module for associating and storing the training voice instruction and the standard voice instruction matching the training voice instruction.
  • the training voice instruction and the standard voice instruction matching the training voice instruction are stored in association, so that when the condition for determining the correction coefficient of the training voice instruction is satisfied, the stored training voice instruction and the matched standard voice instruction are obtained to perform accent feature determination. operating.
  • the accent-based speech recognition processing device further includes a state switching module for exiting the accent training state and switching to a standby or working state before receiving a training voice instruction.
  • the accent-based speech recognition processing device further includes: a correction coefficient acquisition module and a correction recognition module.
  • the voice recognition module is further configured to receive a voice instruction to be recognized, analyze the voice instruction, and obtain a sound feature.
  • the voice recognition module receives a voice instruction to be recognized, analyzes the received voice instruction through an acoustic model in a preset voice recognition algorithm, and extracts a voice feature of the voice instruction.
  • the preset speech recognition algorithm is a traditional speech recognition algorithm, for example, a neural network-based speech recognition algorithm, a DTW (Dynamic Time Warping) -based speech recognition algorithm, and the like.
  • the correction coefficient obtaining module is configured to obtain an accent recognition correction coefficient corresponding to the matched accent feature when the sound feature matches the stored accent feature.
  • the accent-based speech recognition processing device previously stores accent features obtained through accent training, and the accent features include an accent recognition correction coefficient.
  • the sound feature of the voice command to be recognized is matched with the stored accent feature.
  • the correction coefficient acquisition module 706 obtains the accent recognition correction coefficient corresponding to the matched accent feature.
  • a modified recognition module is configured to recognize a voice instruction according to an accent recognition correction coefficient and obtain a voice recognition result.
  • the correction recognition module applies the accent recognition correction coefficient to the speech recognition algorithm, and corrects and recognizes the speech instructions, and then obtains the speech recognition result. Since the accent recognition correction coefficient is based on the difference between the training voice instruction and the standard voice instruction, a difference correction coefficient can be effectively identified based on the difference correction coefficient.
  • the modified recognition module is further configured to modify the voice instruction according to the accent recognition correction coefficient; to recognize the modified voice instruction to obtain a voice recognition result.
  • the accent recognition correction coefficient is based on the training voice instruction and the standard voice instruction to obtain the difference correction coefficient. Based on the accent recognition correction coefficient, a correction relationship between the training voice instruction and the standard voice instruction can be established. Using this correction relationship and the accent recognition correction coefficient, After receiving the voice instruction for correction, the modified voice instruction is recognized based on a preset voice recognition algorithm to obtain a voice recognition result.
  • the voice recognition result is optimized to a certain extent and the accuracy of the voice recognition is improved.
  • each module in the above-mentioned accent-based speech recognition processing device may be implemented in whole or in part by software, hardware, and a combination thereof.
  • the above-mentioned modules may be embedded in the hardware form or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor calls and performs the operations corresponding to the above modules.
  • an electronic device in one embodiment, is provided, and an internal structure diagram of the electronic device may be as shown in FIG. 8.
  • the electronic device includes a processor, a memory, a network interface, a display screen, an input device, and a microphone array connected through a system bus.
  • the processor of the electronic device is used to provide computing and control capabilities.
  • the memory of the electronic device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system and a computer program.
  • the internal memory provides an environment for running an operating system and computer programs in a non-volatile storage medium.
  • the network interface of the electronic device is used to communicate with an external terminal through a network connection.
  • the computer program is executed by a processor to implement a speech recognition method.
  • the display screen of the electronic device may be a liquid crystal display screen or an electronic ink display screen
  • the input device of the electronic device may be a touch layer covered on the display screen, or a button, a trackball, or a touchpad provided on the casing of the electronic device. , Or an external keyboard, trackpad, or mouse.
  • FIG. 8 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the electronic device to which the solution of the present application is applied. Include more or fewer parts than shown in the figure, or combine certain parts, or have a different arrangement of parts.
  • an electronic device including a memory and a processor.
  • the memory stores computer-readable instructions.
  • the processor causes the processor to perform the following steps:
  • the system enters the accent training state and compares the speech recognition results of each training voice instruction to obtain the comparison result;
  • the speech recognition result is fuzzy-matched with the standard speech instruction to obtain candidate standard speech instructions
  • the accent features of the training voice command are determined, and the accent feature is used to correct and recognize the voice command to be recognized that carries the corresponding accent feature.
  • the computer-readable instructions further cause the processor to perform the following steps:
  • the computer-readable instructions further cause the processor to perform the following steps:
  • a standard voice instruction matching the training voice instruction is determined.
  • the computer-readable instructions further cause the processor to perform the following steps:
  • the candidate standard voice instruction is a standard voice instruction matching the training voice instruction.
  • the computer-readable instructions further cause the processor to perform the following steps:
  • the trigger enters the accent training state
  • the computer-readable instructions further cause the processor to perform the following steps:
  • the training voice instructions and the standard voice instructions matching the training voice instructions are stored in association.
  • the computer-readable instructions further cause the processor to perform the following steps:
  • the computer-readable instructions further cause the processor to perform the following steps:
  • the accent recognition correction coefficient corresponding to the training voice instruction is determined.
  • the computer-readable instructions further cause the processor to perform the following steps:
  • the voice instruction is recognized according to the accent recognition correction coefficient, and a voice recognition result is obtained.
  • the computer-readable instructions further cause the processor to perform the following steps:
  • one or more non-volatile storage media storing computer-readable instructions are provided.
  • the computer-readable instructions are executed by one or more processors, the one or more processors execute the following steps:
  • the system enters the accent training state and compares the speech recognition results of each training voice instruction to obtain the comparison result
  • the speech recognition result is fuzzy-matched with the standard speech instruction to obtain candidate standard speech instructions
  • the accent features of the training voice command are determined, and the accent feature is used to correct and recognize the voice command to be recognized with the corresponding accent feature.
  • the one or more processors execute the following steps:
  • the one or more processors execute the following steps:
  • a standard voice instruction matching the training voice instruction is determined.
  • the one or more processors execute the following steps:
  • the candidate standard voice instruction is a standard voice instruction matching the training voice instruction.
  • the one or more processors execute the following steps:
  • the trigger enters the accent training state
  • the one or more processors execute the following steps:
  • the training voice instructions and the standard voice instructions matching the training voice instructions are stored in association.
  • the one or more processors execute the following steps:
  • the one or more processors execute the following steps:
  • the accent recognition correction coefficient corresponding to the training voice instruction is determined.
  • the one or more processors execute the following steps:
  • the voice instruction is recognized according to the accent recognition correction coefficient, and a voice recognition result is obtained.
  • the one or more processors execute the following steps:
  • steps in the embodiments of the present application are not necessarily performed sequentially in the order indicated by the step numbers. Unless explicitly stated in this document, the execution of these steps is not strictly limited, and these steps can be performed in other orders. Moreover, at least a part of the steps in each embodiment may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily performed at the same time, but may be performed at different times. The execution of these sub-steps or stages The sequence is not necessarily performed sequentially, but may be performed in turn or alternately with other steps or at least a part of the sub-steps or stages of other steps.
  • Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory can include random access memory (RAM) or external cache memory.
  • RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Synchlink DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
  • SRAM static RAM
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • DDRSDRAM dual data rate SDRAM
  • ESDRAM enhanced SDRAM
  • SLDRAM synchronous chain Synchlink DRAM
  • Rambus direct RAM
  • DRAM direct memory bus dynamic RAM
  • RDRAM memory bus dynamic RAM

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

一种基于口音的语音识别处理方法、电子设备和存储介质,该方法包括:接收并识别预设数量的训练语音指令,得到与各训练语音信号对应的语音识别结果(S202);当满足预设的训练触发条件时,触发进入口音训练状态,对各训练语音指令的语音识别结果进行比对,得到比对结果(S204);当比对结果满足一致性条件时,将语音识别结果与标准语音指令进行模糊匹配,得到候选标准语音指令(S206);对候选标准语音指令进行确认,确定与训练语音指令匹配的标准语音指令(S208);根据训练语音指令以及匹配的标准语音指令,确定训练语音指令的口音特征,口音特征用于修正识别携带有对应口音特征的待识别的语音指令(S210)。利用口音训练得到修正语音指令的口音特征,从而提高识别准确率。

Description

基于口音的语音识别处理方法、电子设备和存储介质 技术领域
本申请涉及语音识别技术领域,特别是涉及一种基于口音的语音识别处理方法、电子设备和存储介质。
背景技术
随着移动互联网、车联网和智能家居的发展,语音识别发挥了越来越重要的作用。比如,通过语音实现与车载信息通讯系统的交互、通过语音指令控制智能家居等。随着语音识别技术的广泛应用,提高语音识别的准确率也成为语音识别技术发展所面临的重点及难点。
在传统技术中,语音识别的研究以及开发基本是基于标准普通话,然而,在实际应用中,用户发音往往难以达到标准普通话的水平,而是通常携带有不同的口音。由于传统的语音识别算法均是基于标准普通话建立的,因此,当面临携带有口音的语音时,无法识别出口音,导致语音识别的准确率很低。
申请内容
根据本申请的各种实施例,提供一种能够提高语音识别准确率的基于口音的语音识别处理方法、电子设备和存储介质。
一种基于口音的语音识别处理方法,所述方法包括:
接收并识别预设数量的训练语音指令,得到与各所述训练语音信号对应的语音识别结果;
当满足预设的训练触发条件时,触发进入口音训练状态,对各所述训练语音指令的所述语音识别结果进行比对,得到比对结果;
当比对结果满足一致性条件时,将所述语音识别结果与标准语音指令进行模糊匹配,得到候选标准语音指令;
对所述候选标准语音指令进行确认,确定与所述训练语音指令匹配的标 准语音指令;
根据所述训练语音指令以及匹配的所述标准语音指令,确定所述训练语音指令的口音特征,所述口音特征用于修正识别携带有对应口音特征的待识别的语音指令。
一种电子设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行如下步骤:
接收并识别预设数量的训练语音指令,得到与各所述训练语音信号对应的语音识别结果;
当满足预设的训练触发条件时,触发进入口音训练状态,对各所述训练语音指令的所述语音识别结果进行比对,得到比对结果;
当比对结果满足一致性条件时,将所述语音识别结果与标准语音指令进行模糊匹配,得到候选标准语音指令;
对所述候选标准语音指令进行确认,确定与所述训练语音指令匹配的标准语音指令;
根据所述训练语音指令以及匹配的所述标准语音指令,确定所述训练语音指令的口音特征,所述口音特征用于修正识别携带有对应口音特征的待识别的语音指令。
一个或多个存储有计算机可读指令的非易失性存储介质,所述计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行如下步骤:
接收并识别预设数量的训练语音指令,得到与各所述训练语音信号对应的语音识别结果;
当满足预设的训练触发条件时,触发进入口音训练状态,对各所述训练语音指令的所述语音识别结果进行比对,得到比对结果;
当比对结果满足一致性条件时,将所述语音识别结果与标准语音指令进行模糊匹配,得到候选标准语音指令;
对所述候选标准语音指令进行确认,确定与所述训练语音指令匹配的标 准语音指令;
根据所述训练语音指令以及匹配的所述标准语音指令,确定所述训练语音指令的口音特征,所述口音特征用于修正识别携带有对应口音特征的待识别的语音指令。
本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其它特征、目的和优点将从说明书、附图以及权利要求书变得明显。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为一个实施例中基于口音的语音识别处理方法的应用环境图;
图2为一个实施例中基于口音的语音识别处理方法的流程示意图;
图3为一个实施例中触发口音训练并比对的步骤的流程示意图;
图4为另一个实施例中口音特征生成步骤的流程示意图;
图5为一个实施例中口音修正识别步骤的流程示意图;
图6为一个实施例中基于口音的语音识别处理方法的流程示意图;
图7为一个实施例中基于口音的语音识别处理装置的结构框图;
图8为一个实施例中电子设备的结构框图。
具体实施方式
为使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步的详细说明。应当理解,此处所描述的具体实施方式仅仅用以解释本申请,并不限定本申请的保护范围。
本申请提供的基于口音的语音识别处理方法,可以应用于如图1所示的应用环境中。其中,用户通过发送声音信号与电子设备102进行交互。具体 地,由用户发出声音信号,电子设备102通过麦克风阵列采集用户发出的声音信号得到携带有相关信息的语音指令,对语音指令进行分析。以声音信号为进行口音训练的声音信号为例,电子设备102通过麦克风阵列采集用户连续发出的多次重复声音信号,得到携带有相关信息的训练语音指令,电子设备102再对训练语音指令进行初步识别,得到与各训练语音指令对应的语音识别结果。当满足预设的训练触发条件时,触发进入口音训练状态,通过口音训练,确定与训练语音指令匹配的标准语音指令。进而根据训练语音指令以及匹配的标准语音指令,确定训练语音指令的口音特征。在后续语音识别过程中,电子设备102利用口音特征,对待识别的语音指令进行修正识别,以准确得到与之匹配的标准语音指令。其中,电子设备102可以是具有语音识别功能的电子设备,包括但不限于是各种智能家居设备、个人计算机、智能手机、语音交互机器人等。其中,智能家居设备为通过语音指令执行对应操作的设备,比如可实现语音控制的智能音箱、智能家电、车载语音控制系统等。
在一个实施例中,如图2所示,提供了一种基于口音的语音识别处理方法,以该方法应用于图1中的电子设备为例进行说明,包括以下步骤:
S202,接收并识别预设数量的训练语音指令,得到与各训练语音指令对应的语音识别结果。
其中,语音指令通过麦克风阵列采集用户发出的声音信号得到的携带有控制指令文字内容的语音信号。以电子设备为智能音箱为例,当控制指令为唤醒指令时,其对应的语音指令为携带有“播放”这一文字内容的语音信号;当控制指令为切换指令时,其对应的语音指令为携带有“下一首”这一文字内容的语音信号。电子设备设置有麦克风阵列,采集声音信号。可以理解的是,为达到较好的语音信号的采集效果,可根据电子设备的应用场景,可采用环形麦克风阵列、线性麦克风阵列或立体麦克风阵列的任意一种。例如,对于智能音箱而言,为了采集360度范围内的声源信号,可采用环形麦克风阵列。
进一步地,训练语音指令是电子设备的麦克风阵列采集用户连续发出预设数量的、携带有特定文字内容的声音所得到的语音指令。训练语音指令携带了用户的口音特征。预设数量可根据需求预先进行设置。以智能语音设备为例,假设预设数量为3,当智能语音设备处于待机或者正常工作状态时,接收由用户连续发出的3次声音信号对应的语音指令。其中,基于不同电子设备,用户可发出不同特定文字内容的声音。比如,对于智能音箱或是智能家电,特定文字内容可以是这些设备的控制指令。在智能音箱口音训练状态下,特定文字可以为“开启”、“下一首”、“循环播放”等;在智能空调口音训练状态下,特定文字可以为“制冷”、“换气”,或者时特定的温度值,比如“27度”等。口音训练状态是指处于通过对接收到的训练语音指令进行处理,最终得到与训练语音指令匹配的标准语音指令过程的状态。
在本实施例中,每当用户在电子设备的麦克风阵列可接收范围内发出的声音信号时,电子设备麦克风阵列采集声音信号得到语音指令,由电子设备对接收到的语音指令进行识别,得到对应的语音识别结果并存储。对接收到的语音指令的数量进行判断,当接收到的语音指令的数量达到预设数量时,确定该预设数量的语音指令为训练语音指令。其中,识别方法为预设的语音识别算法。预设的语音识别算法为传统的语音识别算法,比如,基于神经网络的语音识别算法、基于DTW(Dynamic Time Warping,动态时间归整)的语音识别算法等。
S204,当满足预设的训练触发条件时,触发进入口音训练状态,对各训练语音指令的语音识别结果进行比对,得到比对结果。
训练触发条件是指可用于触发进入口音训练状态的条件。比如,接收到的训练语音指令是否在规定时长内发出等。
本实施例中,当接收到预设数量的训练语音指令时,判断是否满足预设的训练触发条件,当满足预设的训练触发条件时,触发进入口音训练状态,获取已存储的各训练语音指令的语音识别结果,将各语音识别结果进行比对,以判断各语音识别结果是否满足一致性条件。其中,比对结果是指各语音识 别结果之间的相似度。一致性条件是指表示各语音识别结果对应的训练语音指令是否为相同语音指令,也即各训练语音指令是否携带有相同信息,比如,训练语音指令为由同一用户重复发出的预设数量的“开启”语音指令。具体地,一致性条件为各语音识别结果之间的相似度达到相似度阈值。通过对多次重复语音指令进行口音训练,确保最终得到的口音特征能够充分表示用户的口音。
此外,当判断是否满足预设的训练触发条件时,判断结果为不满足预设的训练触发条件时,当存在与最后接收到的语音指令对应的控制指令时,也即该语音指令的识别结果与控制指令文字内容相同时,则执行该控制指令对应的操作;否则,切换至接收训练语音指令前的待机或工作状态。比如,切换至待机模式,等待接收语音指令;或者切换至工作状态,执行接收训练语音指令之前的工作等。
S206,当比对结果满足一致性条件时,将语音识别结果与标准语音指令进行模糊匹配,得到候选标准语音指令。
其中,标准语音指令是指预存储的、可被精确识别的语音信息。通常,语音识别算法是基于标准普通话建立的识别算法,则标准语音指令是指符合标准普通话发音规律的语音信息。在智能语音设备中,标准语音指令为携带有控制指令文字内容的语音信息,且标准语音指令是可以被预设的语音识别算法准确识别出的。候选标准语音指令是指由标准语音指令匹配模型输出的结果。
具体地,当比对结果满足一致性条件时,对各语音识别结果进行模糊识别,得到一个模糊识别结果,将模糊识别结果与预存储的标准语音指令进行匹配,得到与训练语音指令匹配的候选标准语音指令。其中,模糊识别包括进行易混拼音替换、简单的语法分析等。
S208,对候选标准语音指令进行确认,确定与训练语音指令匹配的标准语音指令。
具体地,基于预设的确认方法对候选标准语音指令进行确认,当确认该 候选标准语音指令与训练语音指令相同时,将该候选语音指令作为与训练语音指令匹配的标准语音指令。预设的确认方法既可以是基于用户反馈确认,也可以是基于所设定的自动确认规则进行确认。自动确认规则可以为当候选标准语音指令与训练语音指令之间的相似度达到预设值时,认为该候选语音指令与训练语音指令相同。
S210,根据训练语音指令以及匹配的标准语音指令,确定训练语音指令的口音特征,口音特征用于修正识别携带有对应口音特征的待识别的语音指令。
口音特征是指语音指令与标准语音指令相比,存在的特有特征。比如,口音特征包括训练语音指令本身的声音特征以及与标准语音指令相比存在的修正系数等。不同用户说出的声音,通常具有不同的口音特征。声音特征,顾名思义是指声音所包含的特征信息,比如音色、音调、语速等。在语音识别系统的模型中,通常包括声学模型和语言模型,分别对应于语音到音节概率的计算和音节到字概率的计算,通过声学模型可对声音特征进行提取。修正系数,又可称为口音识别修正系数,是指训练语音指令与标准语音指令之间的差异修正系数,比如,包括口音系数、误差系数等。口音训练状态下,电子设备根据采集得到的训练语音指令与特定文字内容对应的标准语音指令进行比较,得到匹配的标准语音指令,进一步对二者进行差异分析,得到口音识别修正系数。
具体地,对训练语音指令以及匹配的标准语音指令进行差异分析,确定训练语音指令的口音特征,以便在后续语音识别过程中,将口音特征应用于语音识别算法中,对语音指令进行修正识别,进而得到准确的语音识别结果。由于口音特征是基于训练语音指令与标准语音指令之间的差异分析得到,因此,基于该口音特征能够有效识别出携带有相应口音的语音指令。
上述基于口音的语音识别处理方法,通过接收并识别预设数量的训练语音指令,得到与各训练语音指令对应的语音识别结果。当满足预设的训练触发条件时,触发进入口音训练状态,对各训练语音指令的语音识别结果进行 比对,当比对结果满足一致性条件时,将语音识别结果与标准语音信息进行模糊匹配,得到候选标准语音指令,并对候选标准语音指令进行确认,确定与训练语音指令匹配的标准语音指令。再根据训练语音指令以及匹配的标准语音指令,确定用于修正识别待识别语音指令的口音特征。通过充分考虑口音特征对语音识别结果的影响,利用口音训练得到口音特征,口音特征用于对待识别的语音指令进行修正识别,优化语音识别结果,从而提高语音识别的准确率。
在一实施例中,如图3所示,当满足预设的训练触发条件时,触发进入口音训练状态,将各语音指令的语音识别结果进行比对,得到比对结果的步骤,包括:
S302,获取接收训练语音指令的接收时长。
其中,接收时长是指第一次接收到训练语音指令到最后一次接收到训练语音指令的时间间隔长度。具体地,可通过记录每次接收到训练语音指令的时间点,基于时间点的计算获得接收时长;或者,在第一次接收到训练语音指令时启动计时器计时,最后一次接收到训练语音指令时结束计时,基于计时器的计时结果获得接收时长。
S304,当接收时长小于或等于预设时长时,触发进入口音训练状态。
其中,预设时长是指基于训练语音指令的估计时长而预先设定的时间长度。当训练语音指令的接收时长小于或等于预设时长时,说明当前需要进行口音训练;当训练语音指令的接收时长大于预设时长时,则说明当前并非需要进行口音训练。
本实施例中,判断接收时长是否小于或等于预设时长,当接收时长小于或等于预设时长时,则触发进入口音训练状态,以进行口音训练。可以理解,当接收时长大于预设时长时,判断是否存在与最后接收到的语音指令对应的控制指令,也即该语音指令的识别结果是否与控制指令文字内容相同,若是,则执行该控制指令对应的操作;否则,则切换至接收训练语音指令前的待机或工作状态。
S306,对各训练语音指令的语音识别结果进行比对,得到比对结果。
获取以存储的各训练语音指令的语音识别结果,将各语音识别结果进行比对,以判断各语音识别结果是否满足一致性条件。
在一具体实施例中,将各训练语音指令的语音识别结果进行比对,包括:对各语音识别结果进行相似度计算,得到各语音识别结果之间的相似度;当各语音识别结果之间的相似度达到相似度阈值时,确定比对结果满足一致性条件,也即各语音识别结果满足一致性条件。
其中,相似度阈值是指可确定各语音识别结果对应语音指令相同时,所需达到的最小相似度数值。可以理解,当各语音识别结果之间的相似度达到相似度阈值时,认为各语音识别结果对应的训练语音指令为重复相同的语音指令。
具体地,对各语音识别结果进行相似度计算,得到各语音识别结果之间的相似度,并判断各语音识别结果之间的相似度是否达到相似度阈值,若各语音识别结果之间的相似度达到相似度阈值,则确定语音识别结果满足一致性条件。
在一实施例中,对候选标准语音指令进行确认,确定与训练语音信号匹配的标准语音指令的步骤,包括:将候选标准语音指令输出;根据用户对候选标准语音指令的反馈,确定与训练语音指令匹配的标准语音指令。
具体地,当比对结果满足一致性条件时,将语音识别结果与预存储的标准语音指令进行模糊匹配,得到与语音识别结果模糊匹配的一个标准语音指令,并将该标准语音指令作为候选标准语音指令输出。用户通过输出的信息获取到候选标准语音指令时,判断该候选标准语音指令是否为与训练语音指令匹配的标准语音指令,也即该候选标准语音指令是否与训练语音指令所携带的文字内容相同,若相同则反馈确认信息,根据反馈的确认信息,确定该候选标准语音指令为与训练语音指令匹配的标准语音指令。其中,输出的方式可以是通过显示屏显示文字的方式,也可以是语音播报的方式。
在一实施例中,根据用户对候选标准语音指令的反馈,确定与语音识别 结果匹配的标准语音指令的步骤,包括:接收用户对候选标准语音指令的反馈信息;当反馈信息包括语音识别结果和候选标准语音指令匹配的结果时,确定候选标准语音指令为与训练语音指令匹配的标准语音指令。
其中,反馈信息是指用户根据输出的候选标准语音指令反馈的信息,包括语音识别结果和候选标准语音指令匹配的结果,或者语音识别结果和候选标准语音指令不匹配的结果。具体地,可通过显示的指示信息输入确认信息(比如“是”),以表示语音识别结果和候选标准语音指令匹配的结果;或者通过显示的指示信息输非确认信息(比如“否”),以表示语音识别结果和候选标准语音指令不匹配的结果。
接收用户对候选标准语音指令的反馈信息,并对反馈信息进行分析,当反馈信息包括语音识别结果和候选标准语音指令匹配的结果时,确定候选标准语音指令为与训练语音指令匹配的标准语音指令。可以理解,与语音识别结果匹配的标准语音指令,也就是与该语音识别结果对应的训练语音指令匹配。
通过将训练语音指令的语音识别结果与标准语音指令进行模糊匹配,得到候选标准语音指令,进一步再由用户对匹配结果进行确认,提高匹配结果的准确性,确保训练语音指令与对应的标准语音指令匹配无误。
进一步地,当比对结果不满足一致性条件时,或者当反馈信息包括语音识别结果和候选标准语音指令不匹配的结果时,退出口音训练状态,切换至接收训练语音指令前的待机或工作状态。
在一实施例中,对候选标准语音指令进行确认,确定与训练语音指令匹配的标准语音指令之后,还包括:关联存储训练语音指令以及与训练语音指令匹配的标准语音指令。
通过将训练语音指令以及与训练语音指令匹配的标准语音指令关联存储,以便在满足口音特征确定确定条件时,获取已存储的训练语音指令以及与匹配的标准语音指令,执行口音特征确定的步骤。
进一步地,关联存储训练语音指令以及与训练语音指令对应的标准语音 指令的步骤之后,还包括:退出口音训练状态,并切换至接收训练语音指令前的待机或工作状态。当完成本次口音训练时,则退出口音训练状态,并切换至接收训练语音指令前的待机或工作状态。
在一实施例中,关联存储训练语音指令以及与训练语音指令对应的标准语音指令的步骤之后,还包括:生成是否执行训练语音指令对应操作的提示信息并输出。用户根据该提示信息反馈是否执行训练语音指令对应操作的结果,若反馈的结果为执行训练语音指令对应操作,则执行与该训练语音指令匹配的标准语音指令对应的操作。
在一实施例中,口音特征包括:训练语音指令的声音特征和口音识别修正系数。如图4所示,根据训练语音指令以及匹配的标准语音指令,确定训练语音指令的口音特征,包括:
S402,当满足口音特征确定条件时,获取训练语音指令以及与训练语音指令匹配的标准语音指令。
其中,口音特征确定条件是指同一用户的口音训练达到预设次数。当同一用户的口音训练达到预设次数时,获取已存储的该用户的训练语音指令,以及与训练语音指令匹配的标准语音指令。
S404,分别得到训练语音指令以及标准语音指令的声音特征。
其中,标准语音指令是未携带有口音的语音信息,而与其对应的训练语音指令是指携带有与标准语音指令相同的特定文字信息、以及口音的语音信息,二者存在声音特征之间的差异。
本实施例中,基于声音特征的提取方法,分别提取得到训练语音指令的声音特征,以及标准语音指令的声音特征。其中,声音特征的提取方法可以采用传统的声学模型进行提取,比如采用常用的基于隐马尔可夫模型建立的声学模型、基于循环神经网络建立的声学模型等。
S406,根据训练语音指令以及标准语音指令的声音特征之间的差异,确定训练语音指令对应的口音识别修正系数。
具体地,对训练语音指令和标准语音指令的声音特征之间的差异进行分 析,基于得到的差异系数确定训练语音指令对应的口音识别修正系数,以在语音识别过程中利用口音识别修正系数优化语音识别结果。
在一实施例中,如图5所示,基于口音的语音识别处理方法还包括:
S502,接收待识别的语音指令。
其中,语音指令通过麦克风阵列采集用户发出的声音信号得到的携带有控制指令文字内容的信号,通过语音指令可实现对对应智能语音设备的控制。可以理解,待识别的语音指令是指当前需要进行识别的语音指令。以电子设备为智能音箱为例,以电子设备为智能音箱为例,待识别的语音指令可以为携带有“播放”这一控制指令文字内容的语音信号;可以为携带有“下一首”这一控制指令文字内容的语音信号。
在本实施例中,当用户在电子设备的麦克风阵列可接收范围内发出的声音信号时,电子设备麦克风阵列采集得到待识别的语音指令。
S504,对语音指令进行分析,得到声音特征。
具体地,通过预设的语音识别算法中的声学模型对接收到的语音指令进行分析,提取该语音信号的声音特征。其中,预设的语音识别算法为传统的语音识别算法,比如,基于神经网络的语音识别算法、基于DTW(Dynamic Time Warping,动态时间归整)的语音识别算法等。
S506,当声音特征与存储的口音特征匹配时,获取匹配的口音特征对应的口音识别修正系数。
其中,口音特征是指电子设备基于口音训练得到的与训练语音指令对应的口音特征,口音特征中包括训练语音指令本身的声音特征,比如,包括训练语音指令的音色、音调、语速等。同时,口音特征中还包括用于修正待识别的语音指令的口音识别修正系数。
具体地,将待识别的语音指令的声音特征与已存储的口音特征中的声音特征进行匹配,得到与待识别的语音指令的声音特征匹配的口音特征,进而获取匹配的口音特征中的口音识别修正系数。
S508,根据口音识别修正系数对语音指令进行识别,得到语音识别结果。
具体地,将口音识别修正系数应用于语音识别算法中,对语音指令进行修正识别,进而得到语音识别结果。由于口音识别修正系数是基于训练语音指令与标准语音指令得到差异修正系数,因此,基于该差异修正系数能够有效识别出携带有相应口音的语音指令。
在得到待识别的语音指令的语音识别结果后,基于该语音识别结果即可执行对应的操作。以智能音箱为例,语音识别结果为“播放”指令,则控制智能音箱执行播放操作。
上述基于口音的语音识别处理方法,通过对待识别的语音指令进行分析,得到声音特征,当该声音特征与存储的口音特征匹配时,获取口音特征对应的口音识别修正系数,进而根据口音识别修正系数对该语音指令进行识别,得到语音识别结果。通过充分考虑口音特征对语音识别结果的影响,将待识别的语音指令的口音特征与存储的口音特征匹配,并得到匹配的口音特征对应的口音识别修正系数,进而基于口音特征对应的口音识别修正系数,对待识别的语音指令进行识别,以优化语音识别结果,从而提高了语音识别的准确率。
在一实施例中,根据口音识别修正系数对语音指令进行识别,得到语音识别结果,包括:根据口音识别修正系数,对语音指令进行修正;对修正后的语音指令进行识别,得到语音识别结果。
口音识别修正系数是基于训练语音指令与标准语音指令得到差异修正系数,基于该口音识别修正系数可建立训练语音指令与标准语音指令之间的修正关系,利用该修正关系以及口音识别修正系数,对接收到语音指令进行修正,而后基于预设的语音识别算法对修正后的语音指令进行识别,得到语音识别结果。
具体地,口音识别修正系数包括口音系数和误差系数,训练语音指令可被描述等价于为相匹配的标准语音指令与口音系数的乘积,再加上误差系数。因此,基于该描述关系以及已获得的口音系数和误差系数,即可对待识别的语音指令进行修正,使修正后的语音指令尽可能地符合标准语音指令。
通过利用口音识别修正系数对语音指令进行修正,使其尽可能地符合标准语音指令,进而对修正后的语音指令进行识别,一定程度优化了语音识别结果,提高了语音识别的准确率。
下面以智能音箱为例,对本申请的基于口音的语音识别处理方法进行说明,如图6所示,该方法包括以下步骤:
S601,接收并识别预设数量的训练语音指令,得到与各训练语音指令对应的语音识别结果。
具体地,智能音箱处于待机或者工作状态时,接收麦克风连续采集的多次训练语音指令。比如,通过麦克风采集到的用户连续发出的3次“随机模式”语音指令,每接收到一次“随机模式”指令时,均对该指令进行识别,并将识别结果存储至存储器。由于存在口音特征的干扰,识别结果难以达到完全精确的程度,比如第一次“随机模式”指令的识别结果为“谁机模式”对应的数据,第二次“随机模式”指令的识别结果为“随机模式”对应的数据、第三次“随机模式”指令的识别结果为“随机么事”对应的数据。在其他实施例中,通过麦克风采集的用户连续发出的3次指令可能为不同的指令,则对应的识别结果也为不同的识别结果。对接收到的语音信号的数量进行判断,当接收到的语音信号的数量达到预设数量3次时,确定该预设数量的语音信号为训练语音指令,进而判断是否满足预设的训练触发条件。
S602,获取接收训练语音指令的接收时长。
具体地,可通过记录每次接收到训练语音指令的时间点,基于时间点的计算获得接收时长;或者,在第一次接收到训练语音指令时启动计时器计时,最后一次接收到训练语音指令时结束计时,基于计时器的计时结果获得接收时长。比如,记录第一次接收到“随机模式”指令的时间点,以及第三次接收到“随机模式”指令的时间点,将该两个时间点的间隔时长作为接收时长。
S603,当接收时长小于或等于预设时长时,触发进入口音训练状态;否则执行步骤S611。
假设预设时长为30秒,则判断接收时长是否小于或等于30秒。当接收 时长小于或等于30秒时,则触发进入智能音箱的口音训练状态,以进行口音训练;当接收时长大于预设时长时,判断是否存在与最后接收到的语音指令对应的控制指令,也即该语音指令的识别结果是否与控制指令文字内容相同,若是,则执行该控制指令对应的操作;否则,则退出口音训练状态,切换至接收训练语音指令前的待机或工作状态。假设不存在与最后接收到的语音指令对应的控制指令,且在接收训练语音语音指令前,智能音箱处于播放状态,则切换至播放状态继续播放歌曲。
S604,对各语音识别结果进行相似度计算,得到各语音识别结果之间的相似度。
本实施例中,对各语音识别结果进行相似度计算,得到各语音识别结果之间的相似度,以判断各语音识别结果之间的相似度是否达到相似度阈值。比如,分别计算“谁机模式”对应的数据和“随机模式”对应的数据的相似度、“谁机模式”对应的数据和“随机么事”对应的数据的相似度、“随机模式”对应的数据和“随机么事”对应的数据的相似度。
S605,当各语音识别结果之间的相似度达到相似度阈值时,确定比对结果满足一致性条件。
判断各语音识别结果之间的相似度是否达到相似度阈值,若各语音识别结果之间的相似度达到相似度阈值,则确定语音识别结果满足一致性条件。比如,“谁机模式”对应的数据和“随机模式”对应的数据的相似度、“谁机模式”对应的数据和“随机么事”对应的数据的相似度、“随机模式”对应的数据和“随机么事”对应的数据的相似度均达到99%时,则认为比对结果满足一致性条件。
S606,当比对结果满足一致性条件时,将语音识别结果与标准语音指令进行模糊匹配,得到候选标准语音指令;否则,执行步骤S611。
具体地,当各语音识别结果的比对结果满足一致性条件时,将语音识别结果与预存储的标准语音指令进行模糊匹配,得到与语音识别结果模糊匹配的一个标准语音指令,并将该标准语音指令作为候选标准语音指令。若不满 足一致性条件,则退出口音训练状态,切换至接收训练语音指令前的待机或工作状态。
智能音箱中存储有可执行的标准语音指令,假设包括“随机模式”的标准语音指令。当各语音识别结果满足一致性条件时,将语音识别结果与预存储的标准语音指令进行模糊匹配,得到与语音识别结果模糊匹配的“随机模式”标准语音指令,并将“随机模式”作为候选标准语音指令通过智能音箱输出。比如通过智能音箱的扬声器输出。若三个相似度存在小于99%的情况时,则退出口音训练状态,切换至播放状态继续播放歌曲。
S607,将候选标准语音指令输出。其中,输出的方式为语音播报的方式。
S608,接收用户对候选标准语音指令的反馈信息。
S609,当反馈信息包括语音识别结果和候选标准语音指令匹配的结果时,确定候选标准语音指令为与训练语音指令匹配的标准语音指令;否则,执行步骤S611。
接收用户对候选标准语音指令的反馈信息,并对反馈信息进行分析,当反馈信息包括语音识别结果和候选标准语音指令匹配的结果时,确定候选标准语音指令为与语音指令匹配的标准语音指令。可以理解,与语音识别结果匹配的标准语音指令,也就是与该语音识别结果对应的训练语音指令匹配。当反馈信息包括语音识别结果和候选标准语音指令不匹配的结果时,退出口音训练状态,切换至接收语音指令前的待机或工作状态。
比如,反馈信息可以为“是”或者“否”的语音信息,智能音箱在输出候选标准语音指令后的预设时间内接收到“是”的语音信息时,确定候选标准语音指令“随机模式”为与训练语音指令匹配的标准语音指令。若接收到的为“否”的语音信息时,则退出口音训练状态,切换至播放状态继续播放歌曲。
S610,关联存储训练语音指令以及与训练语音指令匹配的标准语音信息。而后执行步骤S611。
通过训练语音指令以及与训练语音指令匹配的标准语音指令关联存储, 以便在满足训练语音指令的修正系数确定条件时,获取已存储的训练语音指令以及与训练语音指令匹配的标准语音指令,执行口音识别修正系数提取的步骤。
比如,将接收到的三次“随机模式”训练语音指令与“随机模式”标准语音指令关联存储至智能音箱的存储器中。
S611,退出口音训练状态,执行语音指令对应的操作,或者,切换至接收语音指令前的待机或工作状态。
S612,当满足口音特征确定条件时,获取训练语音指令以及与训练语音指令匹配的标准语音指令。
当同一用户的口音训练达到预设次数时,获取已存储的该用户的训练语音指令,以及与训练语音指令匹配的标准语音指令。假设智能音箱对同一用户进行了7次口音训练,7次口音训练的训练语音指令分别为“播放”、“暂停”、“关闭”、“待机”、“下一首”、“随机模式”、“顺序播放”,获取7次训练语音指令及其匹配的标准语音指令。
S613,分别得到训练语音指令以及标准语音指令的声音特征。
基于声音特征的提取方法,分别提取得到训练语音指令以及标准语音指令的声音特征。
S614,根据训练语音指令以及标准语音指令的声音特征之间的差异,确定口音特征对应的口音识别修正系数。
具体地,对训练语音指令和标准语音指令的声音特征之间的差异进行分析,基于得到的差异系数确定训练语音指令的口音识别修正系数,以在语音识别过程中利用口音识别修正系数优化语音识别结果。
S615,接收待识别的语音指令。
当用户在电子设备的麦克风阵列可接收范围内发出的声音信号时,电子设备的麦克风阵列采集得到待识别的语音信号。比如,智能音箱通过麦克风采集到用户发出的“单曲循环”指令。
S616,对语音指令进行分析,得到声音特征。
通过预设的语音识别算法对接收到的语音指令进行分析,提取该语音指令的声音特征。比如,对接收到“单曲循环”指令进行分析,得到音色、音调、语速等口音特征。
S617,当声音特征与存储的口音特征匹配时,获取匹配的口音特征对应的口音识别修正系数。
智能语音设备预先存储有通过口音训练得到的口音特征,口音特征包括声音特征和口音识别修正系数。将待识别的语音指令的声音特征与存储的口音特征中的声音特征进行匹配,得到匹配的口音特征,获取匹配的口音特征对应的口音识别修正系数。
S618,根据口音识别修正系数,对语音指令进行修正。
S619,对修正后的语音指令进行识别,得到语音识别结果。
口音识别修正系数是基于训练语音指令与标准语音指令得到差异修正系数,基于该口音识别修正系数可建立训练语音指令与标准语音指令之间的修正关系,利用该修正关系以及口音识别修正系数,对接收到语音指令进行修正,而后基于预设的语音识别算法对修正后的语音指令进行识别,得到语音识别结果。比如,通过获得的口音识别修正系数对待识别的“单曲循环”指令进行指令,而后对修正后的“单曲循环”指令进行识别,得到识别结果,基于对携带有口音的“单曲循环”指令进行修正后再识别,确保“单曲循环”指令被准确识别出。
上述基于口音的语音识别处理方法,充分考虑了口音特征对语音识别结果的影响,将待识别的语音指令的声音特征与存储的口音特征匹配,并得到匹配的口音特征对应的口音识别修正系数,进而基于口音特征对应的口音识别修正系数,对待识别的语音指令进行识别。由于口音识别修正系数是基于训练语音指令与标准语音指令得到差异修正系数,因此,基于该差异修正系数能够有效识别出携带有相应口音的语音指令。
在一实施例中,如图7所示,提供一种基于口音的语音识别处理装置,该装置包括:语音识别模块702、比对模块704、匹配模块706、标准指令确 认模块708和口音特征确定模块710。
语音识别模块702,用于接收并识别预设数量的训练语音指令,得到与各训练语音指令对应的语音识别结果。
在本实施例中,每当用户在电子设备的麦克风阵列可接收范围内发出的声音信号时,电子设备麦克风阵列采集声音信号得到语音指令,语音识别模块702接收语音指令,并对接收到的语音指令进行识别,得到对应的语音识别结果并存储。对接收到的语音指令的数量进行判断,当接收到的语音指令的数量达到预设数量时,确定该预设数量的语音指令为语音指令。其中,识别方法为预设的语音识别算法。预设的语音识别算法为传统的语音识别算法,比如,基于神经网络的语音识别算法、基于DTW(Dynamic Time Warping,动态时间归整)的语音识别算法等。
比对模块704,用于当满足预设的训练触发条件时,触发进入口音训练状态,对各训练语音指令的语音识别结果进行比对,得到比对结果。
本实施例中,当接收到预设数量的训练语音指令时,判断是否满足预设的训练触发条件,当满足预设的训练触发条件时,触发进入口音训练状态,获取已存储的各训练语音指令的语音识别结果,将各语音识别结果进行比对,以判断各语音识别结果是否满足一致性条件。其中,比对结果是指各语音识别结果之间的相似度。一致性条件是指表示各语音识别结果对应的训练语音指令是否为相同语音指令,也即各训练语音指令是否携带有相同信息,比如,训练语音指令为由同一用户重复发出的预设数量的“开启”语音信号。具体地,一致性条件为各语音识别结果之间的相似度达到相似度阈值。通过对多次重复语音指令进行口音训练,确保最终得到的口音特征能够充分表示用户的口音。
匹配模块706,用于当比对结果满足一致性条件时,将语音识别结果与标准语音指令进行模糊匹配,得到候选标准语音指令。
当比对结果满足一致性条件时,匹配模块706将语音识别结果与预存储的标准语音指令进行模糊匹配,基于匹配结果确定与训练语音指令匹配的标 准语音指令。
标准指令确认模块708,用于对候选标准语音指令进行确认,确定与训练语音指令匹配的标准语音信息。
具体地,基于预设的确认方法对候选标准语音指令进行确认,当确认该候选标准语音指令与训练语音指令相同时,将该候选语音指令作为与训练语音指令匹配的标准语音指令。预设的确认方法既可以是基于用户反馈确认,也可以是基于所设定的自动确认规则进行确认。自动确认规则可以为当候选标准语音指令与训练语音指令之间的相似度达到预设值时,认为该候选语音指令与训练语音指令相同。
口音特征确定模块710,用于根据训练语音指令以及匹配的标准语音指令,确定训练语音指令的口音特征,口音特征用于修正识别携带有对应口音特征的待识别的语音指令。
具体地,口音特征确定模块710对训练语音指令以及匹配的标准语音指令进行差异分析,确定训练语音指令的口音特征。以便在后续语音识别过程中,将口音特征应用于语音识别算法中,对语音指令进行修正识别,进而得到语音识别结果。由于口音特征是基于训练语音指令与标准语音指令之间的差异分析得到,因此,基于该口音特征能够有效识别出携带有相应口音的语音指令。
上述基于口音的语音识别处理装置,通过接收并识别预设数量的训练语音指令,得到与各训练语音指令对应的语音识别结果。当满足预设的训练触发条件时,触发进入口音训练状态,对各训练语音指令的语音识别结果进行比对,当比对结果满足一致性条件时,将语音识别结果与标准语音信息进行模糊匹配,得到候选标准语音指令,并对候选标准语音指令进行确认,确定与训练语音指令匹配的标准语音指令。再根据训练语音指令以及匹配的标准语音指令,确定用于修正识别待识别语音指令的口音特征。通过充分考虑口音特征对语音识别结果的影响,利用口音训练得到口音特征,以基于口音特征对待识别的语音指令进行修正识别,优化语音识别结果,从而提高语音识 别的准确率。
进一步地,比对模块704包括触发模块和比对执行模块。
其中,触发模块用于获取接收训练语音指令的接收时长;当接收时长小于或等于预设时长时,触发进入口音训练状态。
具体地,可通过记录每次接收到训练语音指令的时间点,基于时间点的计算获得接收时长;或者,在第一次接收到训练语音指令时启动计时器计时,最后一次接收到训练语音指令时结束计时,基于计时器的计时结果获得接收时长。判断接收时长是否小于或等于预设时长,当接收时长小于或等于预设时长时,则触发进入口音训练状态,以进行口音训练。可以理解,当接收时长大于预设时长时,则切换至接收训练语音指令前的待机或工作状态。
比对执行模块,用于对各训练语音指令的语音识别结果进行比对,得到比对结果。具体地,获取以存储的各训练语音指令的语音识别结果,将各语音识别结果进行比对,以判断各语音识别结果是否满足一致性条件。
在一实施例中,比对执行模块进一步包括:相似度计算模块和一致性确定模块。其中,相似度计算模块,用于对各语音识别结果进行相似度计算,得到各语音识别结果之间的相似度;一致性确定模块,用于当各语音识别结果之间的相似度达到相似度阈值时,确定比对结果满足一致性条件。
进一步地,匹配模块706包括:输出模块和反馈确定模块。其中,输出模块用于将候选标准语音指令输出;反馈确定模块用于根据用户对候选标准语音指令的反馈,确定与训练语音指令匹配的标准语音指令。
具体地,当比对结果满足一致性条件时,候选语音模块将语音识别结果与预存储的标准语音指令进行模糊匹配,得到与语音识别结果模糊匹配的一个标准语音指令,并将该标准语音指令作为候选标准语音指令输出。用户通过输出的信息获取到候选标准语音指令时,判断该候选标准语音指令是否为与训练语音指令匹配的标准语音信息,也即该候选标准语音指令是否与训练语音指令所携带的文字内容相同,若相同则反馈确认信息,反馈确定模块根据反馈的确认信息,确定该候选标准语音指令为与训练语音指令匹配的标准 语音信息。
在一具体实施例中,反馈确定模块还用于接收用户对候选标准语音指令的反馈信息;当反馈信息包括语音识别结果和候选标准语音指令匹配的结果时,确定候选标准语音指令为与训练语音指令匹配的标准语音指令。
反馈确定模块接收用户对候选标准语音指令的反馈信息,并对反馈信息进行分析,当反馈信息包括语音识别结果和候选标准语音指令匹配的结果时,确定候选标准语音指令为与训练语音指令匹配的标准语音指令。可以理解,与语音识别结果匹配的标准语音指令,也就是与该语音识别结果对应的训练语音指令匹配。
通过将训练语音指令的语音识别结果与标准语音指令进行模糊匹配,得到候选标准语音指令,进一步再由用户对匹配结果进行确认,提高匹配结果的准确性,确保训练语音指令与对应的标准语音指令匹配无误。
进一步地,口音特征确定模块710包括:信号获取模块、声音特征模块和系数确定模块,其中:
信号获取模块,用于当满足口音特征确定条件时,获取训练语音指令以及与训练语音指令匹配的标准语音指令。
具体地,当同一用户的口音训练达到预设次数时,信号获取模块获取已存储的该用户的训练语音指令,以及与训练语音指令匹配的标准语音指令。
声音特征模块,用于分别得到训练语音指令以及标准语音指令的声音特征。
具体地,声音特征模块基于声音特征的提取方法,分别提取得到训练语音指令以及标准语音指令的声音特征。
系数确定模块,用于根据训练语音指令以及标准语音指令的声音特征之间的差异,确定训练语音指令对应的口音识别修正系数。
系数确定模块对训练语音指令和标准语音指令的声音特征之间的差异进行分析,基于得到的差异系数确定训练语音指令对应的口音识别修正系数,以在语音识别过程中利用口音识别修正系数优化语音识别结果。
在一实施例中,基于口音的语音识别处理装置还包括存储模块,用于关联存储训练语音指令以及与训练语音指令匹配的标准语音指令。通过将训练语音指令以及与训练语音指令匹配的标准语音指令关联存储,以便在满足训练语音指令的修正系数确定条件时,获取已存储的训练语音指令以及与匹配的标准语音指令,执行口音特征确定操作。
进一步地,基于口音的语音识别处理装置还包括状态切换模块,用于退出口音训练状态,并切换至接收训练语音指令前的待机或工作状态。
在一实施例中,基于口音的语音识别处理装置进一步包括:修正系数获取模块和修正识别模块。
在本实施例中,语音识别模块还用于接收待识别的语音指令,对语音指令进行分析,得到声音特征。
具体地,语音识别模块接收待识别的语音指令,通过预设的语音识别算法中的声学模型对接收到的语音指令进行分析,提取该语音指令的声音特征。其中,预设的语音识别算法为传统的语音识别算法,比如,基于神经网络的语音识别算法、基于DTW(Dynamic Time Warping,动态时间归整)的语音识别算法等。
修正系数获取模块,用于当声音特征与存储的口音特征匹配时,获取匹配的口音特征对应的口音识别修正系数。
基于口音的语音识别处理装置预先存储有通过口音训练得到的口音特征,口音特征包括口音识别修正系数。将待识别的语音指令的声音特征与存储的口音特征进行匹配,当二者匹配时,修正系数获取模块706获取匹配的口音特征对应的口音识别修正系数。
修正识别模块,用于根据口音识别修正系数对语音指令进行识别,得到语音识别结果。
修正识别模块将口音识别修正系数应用于语音识别算法中,对语音指令进行修正识别,进而得到语音识别结果。由于口音识别修正系数是基于训练语音指令与标准语音指令得到差异修正系数,因此,基于该差异修正系数能 够有效识别出携带有相应口音的语音信号。
在一实施例中,修正识别模块还用于根据口音识别修正系数,对语音指令进行修正;对修正后的语音指令进行识别,得到语音识别结果。
口音识别修正系数是基于训练语音指令与标准语音指令得到差异修正系数,基于该口音识别修正系数可建立训练语音指令与标准语音指令之间的修正关系,利用该修正关系以及口音识别修正系数,对接收到语音指令进行修正,而后基于预设的语音识别算法对修正后的语音指令进行识别,得到语音识别结果。
通过利用口音识别修正系数对语音指令进行修正,使其尽可能地符合标准语音指令,进而对修正后的语音指令进行识别,一定程度优化了语音识别结果,提高了语音识别的准确率。
关于基于口音的语音识别处理装置的具体限定可以参见上文中对于基于口音的语音识别处理方法的限定,在此不再赘述。上述基于口音的语音识别处理装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
在一个实施例中,提供了一种电子设备,其内部结构图可以如图8所示。该电子设备包括通过系统总线连接的处理器、存储器、网络接口、显示屏、输入装置和麦克风阵列。其中,该电子设备的处理器用于提供计算和控制能力。该电子设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和计算机程序。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该电子设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种语音识别方法。该电子设备的显示屏可以是液晶显示屏或者电子墨水显示屏,该电子设备的输入装置可以是显示屏上覆盖的触摸层,也可以是电子设备外壳上设置的按键、轨迹球或触控板,还可以是外接的键盘、触控板或鼠标等。
本领域技术人员可以理解,图8中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的电子设备的限定,具体的电子设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
在一实施例中,提供一种电子设备,包括存储器和处理器,存储器中存储有计算机可读指令,计算机可读指令被处理器执行时,使得处理器执行如下步骤:
接收并识别预设数量的训练语音指令,得到与各训练语音信号对应的语音识别结果;
当满足预设的训练触发条件时,触发进入口音训练状态,对各训练语音指令的语音识别结果进行比对,得到比对结果;
当比对结果满足一致性条件时,将语音识别结果与标准语音指令进行模糊匹配,得到候选标准语音指令;
对候选标准语音指令进行确认,确定与训练语音指令匹配的标准语音指令;
根据训练语音指令以及匹配的标准语音指令,确定训练语音指令的口音特征,口音特征用于修正识别携带有对应口音特征的待识别的语音指令。
在一实施例中,计算机可读指令还使得处理器执行如下步骤:
获取接收训练语音指令的接收时长;
当接收时长小于或等于预设时长时,触发进入口音训练状态;
对各训练语音指令的语音识别结果进行比对,得到比对结果。
在一实施例中,计算机可读指令还使得处理器执行如下步骤:
将候选标准语音指令输出;
根据用户对候选标准语音指令的反馈,确定与训练语音指令匹配的标准语音指令。
在一实施例中,计算机可读指令还使得处理器执行如下步骤:
接收用户对候选标准语音指令的反馈信息;
当反馈信息包括语音识别结果和候选标准语音指令匹配的结果时,确定候选标准语音指令为与训练语音指令匹配的标准语音指令。
在一实施例中,计算机可读指令还使得处理器执行如下步骤:
当满足预设的训练触发条件时,触发进入口音训练状态;
对各语音识别结果进行相似度计算,得到各语音识别结果之间的相似度;
当各语音识别结果之间的相似度达到相似度阈值时,确定比对结果满足一致性条件。
在一实施例中,计算机可读指令还使得处理器执行如下步骤:
关联存储训练语音指令以及与训练语音指令匹配的标准语音指令。
在一实施例中,计算机可读指令还使得处理器执行如下步骤:
退出口音训练状态,并切换至接收训练语音指令前的待机或工作状态。
在一实施例中,计算机可读指令还使得处理器执行如下步骤:
当满足口音特征确定条件时,获取训练语音指令以及与训练语音指令匹配的标准语音指令;
分别得到训练语音指令以及标准语音指令的声音特征;
根据训练语音指令以及标准语音指令的声音特征之间的差异,确定训练语音指令对应的口音识别修正系数。
在一实施例中,计算机可读指令还使得处理器执行如下步骤:
接收待识别的语音指令;
对语音指令进行分析,得到声音特征;
当声音特征与存储的口音特征匹配时,获取匹配的口音特征对应的口音识别修正系数;
根据口音识别修正系数对语音指令进行识别,得到语音识别结果。
在一实施例中,计算机可读指令还使得处理器执行如下步骤:
根据口音识别修正系数,对语音指令进行修正;
对修正后的语音指令进行识别,得到语音识别结果。
在一实施例中,提供一个或多个存储有计算机可读指令的非易失性存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行如下步骤:
接收并识别预设数量的训练语音指令,得到与各训练语音信号对应的语音识别结果;
当满足预设的训练触发条件时,触发进入口音训练状态,对各训练语音指令的语音识别结果进行比对,得到比对结果;
当比对结果满足一致性条件时,将语音识别结果与标准语音指令进行模糊匹配,得到候选标准语音指令;
对候选标准语音指令进行确认,确定与训练语音指令匹配的标准语音指令;
根据训练语音指令以及匹配的标准语音指令,确定训练语音指令的口音特征,口音特征用于修正识别携带有对应口音特征的待识别的语音指令。
在一实施例中,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行如下步骤:
获取接收训练语音指令的接收时长;
当接收时长小于或等于预设时长时,触发进入口音训练状态;
对各训练语音指令的语音识别结果进行比对,得到比对结果。
在一实施例中,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行如下步骤:
将候选标准语音指令输出;
根据用户对候选标准语音指令的反馈,确定与训练语音指令匹配的标准语音指令。
在一实施例中,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行如下步骤:
接收用户对候选标准语音指令的反馈信息;
当反馈信息包括语音识别结果和候选标准语音指令匹配的结果时,确定 候选标准语音指令为与训练语音指令匹配的标准语音指令。
在一实施例中,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行如下步骤:
当满足预设的训练触发条件时,触发进入口音训练状态;
对各语音识别结果进行相似度计算,得到各语音识别结果之间的相似度;
当各语音识别结果之间的相似度达到相似度阈值时,确定比对结果满足一致性条件。
在一实施例中,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行如下步骤:
关联存储训练语音指令以及与训练语音指令匹配的标准语音指令。
在一实施例中,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行如下步骤:
退出口音训练状态,并切换至接收训练语音指令前的待机或工作状态。
在一实施例中,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行如下步骤:
当满足口音特征确定条件时,获取训练语音指令以及与训练语音指令匹配的标准语音指令;
分别得到训练语音指令以及标准语音指令的声音特征;
根据训练语音指令以及标准语音指令的声音特征之间的差异,确定训练语音指令对应的口音识别修正系数。
在一实施例中,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行如下步骤:
接收待识别的语音指令;
对语音指令进行分析,得到声音特征;
当声音特征与存储的口音特征匹配时,获取匹配的口音特征对应的口音识别修正系数;
根据口音识别修正系数对语音指令进行识别,得到语音识别结果。
在一实施例中,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行如下步骤:
根据口音识别修正系数,对语音指令进行修正;
对修正后的语音指令进行识别,得到语音识别结果。
应该理解的是,虽然本申请各实施例中的各个步骤并不是必然按照步骤标号指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,各实施例中至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,的程序可存储于一非易失性计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (20)

  1. 一种基于口音的语音识别处理方法,其特征在于,所述方法包括:
    接收并识别预设数量的训练语音指令,得到与各所述训练语音信号对应的语音识别结果;
    当满足预设的训练触发条件时,触发进入口音训练状态,对各所述训练语音指令的所述语音识别结果进行比对,得到比对结果;
    当比对结果满足一致性条件时,将所述语音识别结果与标准语音指令进行模糊匹配,得到候选标准语音指令;
    对所述候选标准语音指令进行确认,确定与所述训练语音指令匹配的标准语音指令;
    根据所述训练语音指令以及匹配的所述标准语音指令,确定所述训练语音指令的口音特征,所述口音特征用于修正识别携带有对应口音特征的待识别的语音指令。
  2. 根据权利要求1所述的方法,其特征在于,所述当满足预设的训练触发条件时,触发进入口音训练状态,将各所述训练语音指令的所述语音识别结果进行比对,得到比对结果,包括:
    获取接收所述训练语音指令的接收时长;
    当所述接收时长小于或等于预设时长时,触发进入口音训练状态;
    对各所述训练语音指令的所述语音识别结果进行比对,得到比对结果。
  3. 根据权利要求1所述的方法,其特征在于,所述对所述候选标准语音指令进行确认,确定与所述训练语音信号匹配的标准语音指令,包括:
    将所述候选标准语音指令输出;
    根据用户对所述候选标准语音指令的反馈,确定与所述训练语音指令匹配的标准语音指令。
  4. 根据权利要求3所述的方法,其特征在于,所述根据用户对所述候选标准语音指令的反馈,确定与所述训练语音指令匹配的标准语音指令,包括:
    接收用户对所述候选标准语音指令的反馈信息;
    当所述反馈信息包括所述语音识别结果和所述候选标准语音指令匹配的结果时,确定所述候选标准语音指令为与所述训练语音指令匹配的标准语音指令。
  5. 根据权利要求1所述的方法,其特征在于,所述当满足预设的训练触发条件时,触发进入口音训练状态,将各所述训练语音指令的所述语音识别结果进行比对,得到比对结果,包括:
    当满足预设的训练触发条件时,触发进入口音训练状态;
    对各所述语音识别结果进行相似度计算,得到各所述语音识别结果之间的相似度;
    当各所述语音识别结果之间的相似度达到相似度阈值时,确定比对结果满足一致性条件。
  6. 根据权利要求1所述的方法,其特征在于,所述对所述候选标准语音指令进行确认,确定与所述训练语音指令匹配的标准语音指令之后,还包括:
    关联存储所述训练语音指令以及与所述训练语音指令匹配的所述标准语音指令。
  7. 根据权利要求6所述的方法,其特征在于,所述关联存储所述训练语音指令以及与所述训练语音指令对应的所述标准语音指令之后,还包括:
    退出所述口音训练状态,并切换至接收训练语音指令前的待机或工作状态。
  8. 根据权利要求1所述的方法,其特征在于,所述口音特征包括:所述训练语音指令的声音特征和口音识别修正系数,所述根据所述训练语音指令以及匹配的所述标准语音指令,确定所述训练语音指令的口音特征,包括:
    当满足口音特征确定条件时,获取训练语音指令以及与所述训练语音指令匹配的标准语音指令;
    分别得到所述训练语音指令以及所述标准语音指令的声音特征;
    根据所述训练语音指令以及所述标准语音指令的声音特征之间的差异, 确定所述训练语音指令对应的口音识别修正系数。
  9. 根据权利要求8所述的方法,其特征在于,所述根据所述训练语音指令以及匹配的所述标准语音指令,确定所述训练语音指令的口音特征之后,还包括:
    接收待识别的语音指令;
    对所述语音指令进行分析,得到声音特征;
    当所述声音特征与存储的口音特征匹配时,获取匹配的所述口音特征对应的口音识别修正系数;
    根据所述口音识别修正系数对所述语音指令进行识别,得到语音识别结果。
  10. 根据权利要求9所述的方法,其特征在于,所述根据所述口音识别修正系数对所述语音指令进行识别,得到语音识别结果,包括:
    根据所述口音识别修正系数,对所述语音指令进行修正;
    对修正后的语音指令进行识别,得到语音识别结果。
  11. 一种电子设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,其特征在于,所述计算机可读指令被所述处理器执行时,使得所述处理器执行如下步骤:
    接收并识别预设数量的训练语音指令,得到与各所述训练语音指令对应的语音识别结果;
    当满足预设的训练触发条件时,触发进入口音训练状态,对各所述训练语音指令的所述语音识别结果进行比对,得到比对结果;
    当比对结果满足一致性条件时,将所述语音识别结果与标准语音指令进行模糊匹配,得到候选标准语音指令;
    对所述候选标准语音指令进行确认,确定与所述训练语音指令匹配的标准语音指令;
    根据所述训练语音指令以及匹配的所述标准语音指令,确定所述训练语音指令的口音特征,所述口音特征用于修正识别携带有对应口音特征的待识 别的语音指令。
  12. 根据权利要求11所述的电子设备,其特征在于,所述计算机可读指令还使得所述处理器执行如下步骤:
    获取接收所述训练语音指令的接收时长;
    当所述接收时长小于或等于预设时长时,触发进入口音训练状态;
    对各所述训练语音指令的所述语音识别结果进行比对,得到比对结果。
  13. 根据权利要求11所述的电子设备,其特征在于,所述计算机可读指令还使得所述处理器执行如下步骤:
    将所述候选标准语音指令输出;
    根据用户对所述候选标准语音指令的反馈,确定与所述训练语音指令匹配的标准语音指令。
  14. 根据权利要求11所述的电子设备,其特征在于,所述计算机可读指令还使得所述处理器执行如下步骤:
    当满足口音特征确定条件时,获取训练语音指令以及与所述训练语音指令匹配的标准语音指令;
    分别得到所述训练语音指令以及所述标准语音指令的声音特征;
    根据所述训练语音指令以及所述标准语音指令的声音特征之间的差异,确定所述训练语音指令对应的口音识别修正系数。
  15. 根据权利要求14所述的电子设备,其特征在于,所述计算机可读指令还使得所述处理器执行如下步骤:
    接收待识别的语音指令;
    对所述语音指令进行分析,得到声音特征;
    当所述声音特征与存储的口音特征匹配时,获取所述口音特征对应的口音识别修正系数;
    根据所述口音识别修正系数对所述语音指令进行识别,得到语音识别结果。
  16. 一个或多个存储有计算机可读指令的非易失性存储介质,其特征在 于,所述计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行如下步骤:
    接收并识别预设数量的训练语音指令,得到与各所述训练语音指令对应的语音识别结果;
    当满足预设的训练触发条件时,触发进入口音训练状态,对各所述训练语音指令的所述语音识别结果进行比对,得到比对结果;
    当比对结果满足一致性条件时,将所述语音识别结果与标准语音指令进行模糊匹配,得到候选标准语音指令;
    对所述候选标准语音指令进行确认,确定与所述训练语音指令匹配的标准语音指令;
    根据所述训练语音指令以及匹配的所述标准语音指令,确定所述训练语音指令的口音特征,所述口音特征用于修正识别携带有对应口音特征的待识别的语音指令。
  17. 根据权利要求16所述的存储介质,其特征在于,所述计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行如下步骤:
    获取接收所述训练语音指令的接收时长;
    当所述接收时长小于或等于预设时长时,触发进入口音训练状态;
    对各所述训练语音指令的所述语音识别结果进行比对,得到比对结果。
  18. 根据权利要求16所述的存储介质,其特征在于,所述计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行如下步骤:
    将所述候选标准语音指令输出;
    根据用户对所述候选标准语音指令的反馈,确定与所述训练语音指令匹配的标准语音指令。
  19. 根据权利要求16所述的存储介质,其特征在于,所述计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行如下步骤:
    当满足口音特征确定条件时,获取训练语音指令以及与所述训练语音指令匹配的标准语音指令;
    分别得到所述训练语音指令以及所述标准语音指令的声音特征;
    根据所述训练语音指令以及所述标准语音指令的声音特征之间的差异,确定所述训练语音指令对应的口音识别修正系数。
  20. 根据权利要求19所述的存储介质,其特征在于,所述计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行如下步骤:
    接收待识别的语音指令;
    对所述语音指令进行分析,得到声音特征;
    当所述声音特征与存储的口音特征匹配时,获取所述口音特征对应的口音识别修正系数;
    根据所述口音识别修正系数对所述语音指令进行识别,得到语音识别结果。
PCT/CN2018/096131 2018-07-18 2018-07-18 基于口音的语音识别处理方法、电子设备和存储介质 WO2020014890A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2018/096131 WO2020014890A1 (zh) 2018-07-18 2018-07-18 基于口音的语音识别处理方法、电子设备和存储介质
CN201880000936.0A CN109074804B (zh) 2018-07-18 2018-07-18 基于口音的语音识别处理方法、电子设备和存储介质

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/096131 WO2020014890A1 (zh) 2018-07-18 2018-07-18 基于口音的语音识别处理方法、电子设备和存储介质

Publications (1)

Publication Number Publication Date
WO2020014890A1 true WO2020014890A1 (zh) 2020-01-23

Family

ID=64789402

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/096131 WO2020014890A1 (zh) 2018-07-18 2018-07-18 基于口音的语音识别处理方法、电子设备和存储介质

Country Status (2)

Country Link
CN (1) CN109074804B (zh)
WO (1) WO2020014890A1 (zh)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109686362B (zh) * 2019-01-02 2021-04-02 百度在线网络技术(北京)有限公司 语音播报方法、装置和计算机可读存储介质
CN109767775A (zh) * 2019-02-26 2019-05-17 珠海格力电器股份有限公司 语音控制方法、装置和空调
CN110211609A (zh) * 2019-06-03 2019-09-06 四川长虹电器股份有限公司 一种提升语音识别准确率的方法
CN110299139A (zh) * 2019-06-29 2019-10-01 联想(北京)有限公司 一种语音控制方法、装置及电子设备
CN112770154A (zh) * 2021-01-19 2021-05-07 深圳西米通信有限公司 具有语音交互功能的智能机顶盒及其交互方法
CN112967717B (zh) * 2021-03-01 2023-08-22 郑州铁路职业技术学院 一种高准确性的英语语音翻译的模糊匹配训练方法

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1162365A (zh) * 1994-11-01 1997-10-15 英国电讯公司 语音识别
CN106131173A (zh) * 2016-07-01 2016-11-16 北京奇虎科技有限公司 移动终端和移动终端远程协助与受助方法、装置
CN106548774A (zh) * 2015-09-18 2017-03-29 三星电子株式会社 语音识别的设备和方法以及训练变换参数的设备和方法
CN106663422A (zh) * 2014-07-24 2017-05-10 哈曼国际工业有限公司 利用单一声学模型和自动口音检测实现的基于文本规则的多口音言语识别
CN106875942A (zh) * 2016-12-28 2017-06-20 中国科学院自动化研究所 基于口音瓶颈特征的声学模型自适应方法
CN107065679A (zh) * 2017-05-15 2017-08-18 佛山市顺德区美的洗涤电器制造有限公司 洗碗机及其控制装置和控制方法
CN107146607A (zh) * 2017-04-10 2017-09-08 北京猎户星空科技有限公司 智能设备交互信息的修正方法、装置及系统
CN108053823A (zh) * 2017-11-28 2018-05-18 广西职业技术学院 一种语音识别系统及方法

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1162365A (zh) * 1994-11-01 1997-10-15 英国电讯公司 语音识别
CN106663422A (zh) * 2014-07-24 2017-05-10 哈曼国际工业有限公司 利用单一声学模型和自动口音检测实现的基于文本规则的多口音言语识别
CN106548774A (zh) * 2015-09-18 2017-03-29 三星电子株式会社 语音识别的设备和方法以及训练变换参数的设备和方法
CN106131173A (zh) * 2016-07-01 2016-11-16 北京奇虎科技有限公司 移动终端和移动终端远程协助与受助方法、装置
CN106875942A (zh) * 2016-12-28 2017-06-20 中国科学院自动化研究所 基于口音瓶颈特征的声学模型自适应方法
CN107146607A (zh) * 2017-04-10 2017-09-08 北京猎户星空科技有限公司 智能设备交互信息的修正方法、装置及系统
CN107065679A (zh) * 2017-05-15 2017-08-18 佛山市顺德区美的洗涤电器制造有限公司 洗碗机及其控制装置和控制方法
CN108053823A (zh) * 2017-11-28 2018-05-18 广西职业技术学院 一种语音识别系统及方法

Also Published As

Publication number Publication date
CN109074804B (zh) 2021-04-06
CN109074804A (zh) 2018-12-21

Similar Documents

Publication Publication Date Title
WO2020014890A1 (zh) 基于口音的语音识别处理方法、电子设备和存储介质
US11900948B1 (en) Automatic speaker identification using speech recognition features
US11776540B2 (en) Voice control of remote device
US11887582B2 (en) Training and testing utterance-based frameworks
US10600414B1 (en) Voice control of remote device
US20200258506A1 (en) Domain and intent name feature identification and processing
KR102339594B1 (ko) 객체 인식 방법, 컴퓨터 디바이스 및 컴퓨터 판독 가능 저장 매체
US10593328B1 (en) Voice control of remote device
CN109643549B (zh) 基于说话者识别的语音识别方法和装置
CN110136692B (zh) 语音合成方法、装置、设备及存储介质
Anguera et al. Speaker diarization: A review of recent research
US20140129222A1 (en) Speech recognition system, recognition dictionary registration system, and acoustic model identifier series generation apparatus
WO2023207472A1 (zh) 一种音频合成方法、电子设备及可读存储介质
US11881224B2 (en) Multilingual speech recognition and translation method and related system for a conference which determines quantity of attendees according to their distances from their microphones
WO2019228135A1 (zh) 匹配阈值的调整方法、装置、存储介质及电子设备
KR20190119521A (ko) 전자 장치 및 그 동작 방법
CN112163084B (zh) 问题反馈方法、装置、介质以及电子设备
CN113192530B (zh) 模型训练、嘴部动作参数获取方法、装置、设备及介质
CN115700871A (zh) 模型训练和语音合成方法、装置、设备及介质
CN112712793A (zh) 语音交互下基于预训练模型的asr纠错方法及相关设备
CN111798849A (zh) 一种机器人指令识别方法、装置及电子设备和存储介质
CN112802465A (zh) 一种语音控制方法及系统
CN112908308B (zh) 一种音频处理方法、装置、设备及介质
KR20230122470A (ko) 가상 ai 모델 제작과 ai 보이스 매칭 시스템
TW201618076A (zh) 語音控制系統及其方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18926769

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18926769

Country of ref document: EP

Kind code of ref document: A1