WO2023207149A1 - 一种语音识别方法和电子设备 - Google Patents

一种语音识别方法和电子设备 Download PDF

Info

Publication number
WO2023207149A1
WO2023207149A1 PCT/CN2022/140339 CN2022140339W WO2023207149A1 WO 2023207149 A1 WO2023207149 A1 WO 2023207149A1 CN 2022140339 W CN2022140339 W CN 2022140339W WO 2023207149 A1 WO2023207149 A1 WO 2023207149A1
Authority
WO
WIPO (PCT)
Prior art keywords
wake
word
terminal
speech
audio data
Prior art date
Application number
PCT/CN2022/140339
Other languages
English (en)
French (fr)
Inventor
陆彩霞
Original Assignee
荣耀终端有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 荣耀终端有限公司 filed Critical 荣耀终端有限公司
Publication of WO2023207149A1 publication Critical patent/WO2023207149A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Definitions

  • the present application relates to the field of terminals, and in particular, to a speech recognition method and electronic device.
  • custom wake-up words users can set personalized wake-up words on personal mobile phones and other terminal devices according to their own needs, that is, custom wake-up words.
  • the custom wake-up word lacks training samples covering various volumes, noises, and emotions for the wake-up word, so the reliability of identifying the custom wake-up word is much lower than that of the default wake-up word.
  • This application provides a speech recognition method.
  • terminal devices such as mobile phones can use a speech synthesizer to synthesize speech samples whose contents are customized wake-up words in various scenarios. Then, using the above speech samples, the terminal device can optimize the current use.
  • the custom wake-up word recognition model makes it a recognition model that can recognize custom wake-up words in various scenarios.
  • this application provides a speech recognition method.
  • the method includes: determining a first wake-up word, which is set by the user; synthesizing a speech sample according to the first wake-up word and preset control parameters, and the speech
  • the sample is audio data whose speech content includes the first wake-up word, and the control parameters are used to control the speaking manner and/or speaking scene shown in the synthesized speech sample; the synthesized speech sample is used to train the first speech recognition model to obtain the first speech recognition model.
  • the first speech recognition model is a speech recognition model used to recognize the first wake-up word before training
  • the second speech recognition model is a speech recognition model used to recognize the first wake-up word after training
  • a terminal device such as a mobile phone can receive a custom wake-up word input by the user and synthesize a voice sample containing the custom wake-up word in various scenarios. Then, the terminal device can use the above voice sample to optimize the current use Custom wake word recognition model, thereby improving the recognition accuracy of this model, so that the optimized model can recognize the custom wake word spoken by the user in any background environment.
  • the method further includes: determining that the audio data that successfully wakes up the terminal device among the audio data collected by the microphone is valid audio data; using the valid audio data and the synthesized voice sample to The second speech recognition model is optimized to obtain a third speech recognition model; the third speech recognition model is used to process the audio data collected by the microphone.
  • the terminal device can also determine the environmental audio that includes the custom wake word and successfully wakes up the terminal device as valid audio data when implementing custom wake word detection.
  • the terminal device can then use effective audio equipment and synthesized speech samples to optimize the currently used speech recognition model, thereby periodically updating the speech recognition model, improving the recognition effect of the speech recognition model, and improving the user experience.
  • the method further includes: determining that the audio data that successfully wakes up the terminal device among the audio data collected by the microphone is valid audio data; using the valid audio data to perform the second speech recognition model Optimize and obtain the third speech recognition model; use the third speech recognition model to process the audio data collected by the microphone.
  • the terminal device can also determine the environmental audio that includes the custom wake word and successfully wakes up the terminal device as valid audio data when implementing custom wake word detection. Then the terminal device can use effective audio equipment to optimize the currently used speech recognition model, thereby periodically updating the speech recognition model, improving the recognition effect of the speech recognition model, and improving the user experience.
  • the method before optimizing the second speech recognition model using valid audio data and synthesized speech samples, the method further includes: confirming that the amount of valid audio data is greater than or equal to the first Quantity threshold, the first quantity threshold is preset.
  • the terminal device can accumulate valid audio data. After the number of accumulated valid audio data reaches the preset first quantity threshold, the terminal device then uses the valid audio data and the synthesized speech sample to process the second speech.
  • the recognition model is optimized to avoid the waste of computing resources caused by immediately updating the currently used speech recognition model every time a piece of valid audio data is determined.
  • the method before using the valid audio data to optimize the second speech recognition model, the method further includes: confirming that the amount of valid audio data is greater than or equal to the second quantity threshold, and the second The quantity threshold is preset.
  • the terminal device can accumulate valid audio data. After the number of accumulated valid audio data reaches the preset second quantity threshold, the terminal device then uses the valid audio data to optimize the second speech recognition model. This is to avoid the waste of computing resources caused by immediately updating the currently used speech recognition model every time a piece of valid audio data is determined.
  • the method before optimizing the second speech recognition model, the method further includes: confirming that the current moment is within a preset update time range.
  • the terminal device can avoid updating the speech recognition model when the user is using the terminal device, thereby avoiding overload causing system freezes or abnormalities and affecting the user experience.
  • control parameters include prosodic features; the prosodic features are used to control the speaker's speaking manner in the synthesized speech sample, and the speaker's speaking manner includes the next one or more items: The speaker's emotions and pauses when speaking.
  • the terminal device when the terminal device synthesizes a speech sample, it can control the speaking style and speaking situation of the speaker in the synthesized speech sample through prosodic features to simulate the audio of the speaker speaking custom wake-up words in various emotional states.
  • synthesizing a speech sample according to the first wake-up word and preset control parameters specifically includes: inputting the first wake-up word and preset prosodic features into a speech synthesizer; using The speech synthesizer synthesizes N speech samples, N ⁇ 1;
  • the method further includes: sequentially performing data enhancement processing on N voice samples to obtain M voice samples, where M ⁇ N.
  • the terminal device can further expand the synthesized multiple speech samples through data enhancement processing to obtain a larger number of speech samples. There are slight differences in speed, volume, pitch, etc. between these voice samples, thereby further enriching the synthesized voice samples and simulating more custom wake-up word audios spoken by speakers in different scenarios.
  • control parameters also include noise parameters.
  • the noise parameters are used to control the speaking scene of the speaker in the synthesized speech samples, and data enhancement processing is performed on the N speech samples in sequence. Specifically, Including: adding noise to the data of N speech samples through noise parameters.
  • the terminal device can simulate audio data of a speaker speaking a custom wake-up word in different encounter environments through data noise addition.
  • data enhancement processing includes data noise addition
  • the noise used in data noise addition includes one or more of the following: human voice noise, wind noise, construction noise, and traffic noise.
  • the noise used for data noise includes one or more of the following: home noise, office noise, shopping mall noise, and park noise.
  • the terminal device when the terminal device synthesizes speech samples, it can further obtain the audio of speakers speaking custom wake-up words based on different usage environments through data noise addition, thereby obtaining richer training samples to improve the robustness of the speech recognition model.
  • the method further includes: extracting prosodic features from the valid audio data; updating the synthesized speech using the first wake-up word, the prosodic features in the control parameters and the extracted prosodic features sample.
  • the terminal device can also extract the speaker's prosodic features from the determined valid audio data that includes a custom wake-up word and successfully wakes up the terminal. Then, the terminal can combine the above-extracted prosodic features of the speaker with the preset prosodic feature parameters in the speech synthesizer to synthesize new speech samples, thereby making the synthesized speech samples richer. In this way, based on richer speech samples, the terminal can obtain a better wake word recognition model.
  • the input layer of the first speech recognition model and the input layer of the second speech recognition model include the same number of data processing layers; the input layer of the first speech recognition model The parameters of the corresponding data processing layer in the input layer of the second speech recognition model are the same.
  • the terminal device can keep the number of previous data processing layers and parameters the same during the process of optimizing the model, thereby saving algorithm costs, time costs, etc. in the optimization process and improving the efficiency of model optimization. .
  • the present application provides an electronic device, which includes one or more processors and one or more memories; wherein one or more memories are coupled to one or more processors, and one or more
  • the memory is used to store computer program code.
  • the computer program code includes computer instructions.
  • the present application provides a computer-readable storage medium, including instructions.
  • the electronic device When the instructions are run on an electronic device, the electronic device causes the electronic device to execute as described in the first aspect and any possible implementation manner of the first aspect. method.
  • the electronic device provided in the second aspect and the computer storage medium provided in the third aspect are both used to execute the method provided in this application. Therefore, the beneficial effects it can achieve can be referred to the beneficial effects in the corresponding methods, and will not be described again here.
  • Figure 1 is a flow chart of a speech recognition method provided by an embodiment of the present application.
  • Figure 2 is a schematic diagram of a speech synthesizer provided by an embodiment of the present application for synthesizing sample speech
  • Figure 3 is a schematic diagram of model optimization provided by the embodiment of the present application.
  • Figure 4A is a flow chart of another speech recognition method provided by an embodiment of the present application.
  • Figure 4B is a flow chart of another speech recognition method provided by an embodiment of the present application.
  • Figure 5A is a flow chart of another speech recognition method provided by an embodiment of the present application.
  • Figure 5B is a flow chart of another speech recognition method provided by an embodiment of the present application.
  • Figures 6A-6I are a set of user interface schematic diagrams provided by embodiments of the present application.
  • Figures 7A-7D are another set of user interface schematic diagrams provided by embodiments of the present application.
  • Figure 8 is a schematic system structure diagram of a terminal device provided by an embodiment of the present application.
  • Figure 9 is a schematic diagram of the hardware structure of a terminal device provided by an embodiment of the present application.
  • Terminal devices such as mobile phones and tablet computers can enter the voice control mode through preset wake-up words.
  • the voice control mode means that the user controls the terminal 100 to perform one or more operations by speaking.
  • the terminal 100 may detect a command of "play music" spoken by the user, and in response to the above command, the terminal 100 may open a music application to play music.
  • the above wake words used to trigger entry into voice control mode are generally set by developers, that is, the default wake words.
  • the terminal 100 now also supports the user to set a personalized wake-up word, that is, a customized wake-up word, during use of the terminal 100 .
  • the wake-up word used by the terminal 100 is the default wake-up word, such as "Hello, YOYO".
  • the user can replace the above default wake-up word with a custom wake-up word, such as " ⁇ " according to the setting interface provided on the terminal 100 .
  • the terminal 100 can confirm whether to wake up the terminal 100 and enter the voice control mode by detecting the wake-up word " ⁇ ".
  • the terminal 100 can also be a desktop computer, a laptop computer, a handheld computer, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a cellular phone, or a personal digital assistant.
  • PDA personal digital assistant
  • AR augmented reality
  • VR virtual reality
  • AI artificial intelligence
  • wearable devices wearable devices
  • vehicle-mounted devices smart home devices and /or smart city equipment.
  • the custom wake words set by users are highly random. Therefore, the wake word recognition model preset in the terminal 100 cannot be optimized for wake words in advance, and it lacks training samples covering various volumes, noises, and emotions. Therefore, the model has low recognition accuracy and low robustness for custom wake-up words (the ability to adapt to different complex usage scenarios and users’ different pronunciation habits).
  • the terminal 100 has low recognition accuracy and low robustness for custom wake words, which degrades the user experience.
  • command words and other words used for voice control also face the above problems.
  • the terminal 100 can obtain from the cloud voice samples whose content is the custom wake-up word collected and uploaded by other terminal devices.
  • the above voice samples can cover various volume, noise, and emotional scenes.
  • the terminal 100 can obtain a voice sample with the content " ⁇ " from the cloud. If other terminal devices have used the custom wake word " ⁇ " before, the cloud can store voice samples with the content " ⁇ " collected and uploaded by other terminal devices. At this time, the terminal 100 can obtain the above-mentioned voice sample from the cloud.
  • the terminal 100 can use the above speech sample to optimize the currently used wake word recognition model, so that the above wake word recognition model can be well adapted to customized wake word recognition.
  • the cloud can provide the above-mentioned voice samples with custom wake-up words covering various volumes, noises, and emotions, it means that the cloud needs to obtain the custom wake-up words set by each user from the terminal devices it covers. And frequently obtain audio data containing customized wake words from each user's terminal device. This not only requires huge operation and maintenance costs, but also poses serious privacy issues.
  • the cloud may not include a voice sample whose content is a custom wake-up word set by the user, such as the voice sample of " ⁇ ". In this case, the terminal 100 cannot obtain the content of the custom wake-up word from the cloud. Speech sample that defines the wake word.
  • embodiments of this application provide a speech recognition method. This method can be applied to the terminal 100.
  • the terminal 100 may be preset with a generalized wake word recognition model, referred to as a rough model.
  • This coarse model can be used to identify any set of custom wake words.
  • the above-mentioned rough model can be the wake word recognition model originally used to identify the default wake word, or it can be a separate speech recognition model.
  • the recognition accuracy of the above-mentioned rough model is low, it is easily affected by the user's emotions, pauses and usage environment, and its reliability is low.
  • the terminal 100 can use the custom wake-up word text and prosodic features set by the user to synthesize a large number of speech samples (synthesized speech samples) of simulated speakers speaking the custom wake-up word in different scenarios.
  • Prosodic features are used to reflect the speaker's speaking style, including but not limited to the speaker's emotions, pauses and other features.
  • Synthetic speech samples can be used to augment the training set of custom wake words in the coarse model.
  • the terminal 100 can retrain the above coarse model to obtain a wake word recognition model suitable for detecting customized wake words in various contexts and environments, which is noted as a fine model.
  • the terminal 100 can identify whether the user speaks a custom wake-up word in various contexts and environments, thereby improving the accuracy of identifying the custom wake-up word and improving the user experience.
  • the terminal 100 can also add the valid audio data whose content is the custom wake-up word and successfully wakes up the terminal 100 to the training set, and continuously update the currently used fine model, thereby continuously improving Robustness of fine models.
  • the terminal 100 can also extract prosodic features from the above-mentioned valid audio data. Then, the terminal 100 can use the preset prosodic feature control parameters and the above-extracted prosodic features to control the prosodic effect of the synthesized speech sample. Among them, a specific value of the prosodic feature control parameter is a prosodic feature. Then, the terminal 100 uses the synthesized speech sample and/or valid audio data to update the currently used detailed model, thereby further improving the accuracy and robustness of the custom wake word recognition and improving the user experience.
  • terminal devices such as smart TVs and smart speakers used at home
  • mobile phones and tablets are usually used by only one person. Therefore, this type of terminal equipment often also has voiceprint recognition capabilities, that is, identifying whether the speaker is the owner of the device. Therefore, in some examples, the terminal 100 may also perform voiceprint verification during the process of customizing the wake-up word. Only when the speaker is determined to be the owner of the machine will it wake up and enter the voice control mode.
  • the valid audio data is further limited to audio data in which the owner shouts a custom wake-up word and successfully wakes up the terminal 100 .
  • the ability to identify custom wake-up words can be further improved, avoiding the situation of others accidentally waking up, and improving the user experience. .
  • Figure 1 schematically shows a flow chart of a speech recognition method provided by an embodiment of the present application. The specific process of the terminal 100 implementing this method will be introduced in detail below with reference to FIG. 1 .
  • the terminal 100 determines the customized wake-up word.
  • the wake-up word used by the terminal 100 is the default wake-up word, such as "Hello, YOYO".
  • the terminal 100 may provide the user with an interface for setting custom wake words.
  • the user can set a custom wake-up word through the above interface.
  • users can use the above interface to replace the default wake-up word "Hello, YOYO" with a custom wake-up word, such as " ⁇ ".
  • the above interface can be an input text box.
  • the terminal 100 may receive a customized wake word input by the user through the above input text box.
  • the terminal 100 has a rough model preset.
  • the coarse model can be used to identify arbitrary wake words.
  • the terminal 100 may set the target recognition object of the rough model to the custom wake-up word. Then, the terminal 100 can use the above-mentioned rough model to detect whether the user speaks the custom wake-up word.
  • the terminal 100 receives a custom wake-up word input by the user as " ⁇ ".
  • the target recognition object of the rough model in the terminal 100 may be set as " ⁇ ".
  • the coarse model can then be used to identify whether any audio data includes the speech content " ⁇ ".
  • the crude model accuracy is low. Therefore, when the user's emotion, speaking speed, and environmental noise change, the rough model may easily fail to recognize the custom wake-up words spoken by the user, or may misidentify other words spoken by the user as the above-mentioned custom wake-up words, affecting the user. Use experience.
  • the terminal 100 synthesizes a speech sample according to the customized wake-up word text and preset prosodic features.
  • Using a large amount of content to train a coarse model for the audio data of custom wake words can optimize the coarse model, thereby obtaining a wake word recognition model with higher custom wake word recognition accuracy and better recognition effect, that is, a fine model.
  • the terminal 100 cannot obtain a large amount of audio data containing the custom wake-up words through microphone collection or downloading.
  • the terminal 100 can use the text of the custom wake-up word input by the user and the speech synthesizer to synthesize a large amount of audio data whose content is the custom wake-up word. For example, after receiving the custom wake-up word input by the user as " ⁇ ", the terminal 100 may input the text of " ⁇ " to the speech synthesizer. Then, the speech synthesizer can generate N pieces of audio data with the content of " ⁇ ". The terminal 100 can control the amount of synthesized audio data, that is, the value of N. For example, N can be 1200 and so on. In this way, the terminal 100 can obtain 1200 pieces of audio data with the content of " ⁇ ".
  • FIG. 2 exemplarily shows a schematic diagram of a speech synthesizer synthesizing speech samples.
  • the terminal 100 needs to set the parameters of the speech synthesizer.
  • the above parameters include target speech content and prosodic feature control parameters.
  • the target speech content is used to instruct the speech synthesizer what speech content to synthesize.
  • the custom wake-up word text input to the speech synthesizer indicates the target speech content of the speech synthesizer.
  • Prosodic feature control parameters are used to set various prosodic features. Prosodic features are used to reflect the speaker's speaking style and context, including but not limited to the speaker's emotions, pauses, cadences and other characteristics when speaking.
  • the various prosodic features set by the prosodic feature control parameters make the N speech samples synthesized by the speech synthesizer rich and diverse, and can represent the customized wake-up words spoken by the speaker in various emotional states.
  • synthetic speech sample 1 and synthetic speech sample 2 are both " ⁇ ".
  • synthetic speech sample 1 also includes the characteristics of happy and fast (less pauses); synthetic speech sample 2 also includes crying, slower ( Many pauses) characteristics.
  • the synthesized speech sample 1 can represent the audio data of the speaker speaking the custom wake-up word quietly and quickly.
  • Synthetic speech sample 2 can represent the audio data of a speaker crying and speaking a custom wake word slowly.
  • the speech synthesizer can synthesize the corresponding speech samples.
  • a set of parameter combinations can synthesize one or more speech samples.
  • a set of parameter combinations can synthesize multiple speech samples.
  • the terminal 100 can synthesize 20 voice samples with the content " ⁇ " spoken by a simulated user in a happy and fast-speaking scenario. There are certain differences between these 20 synthetic speech samples. The above differences are inherent in the synthesis process.
  • the terminal 100 can also set multiple sets of parameters at one time to quickly synthesize sample data covering more scenarios.
  • a speech synthesizer can receive 10 sets of parameters at once. The specific contents of these 10 sets of parameters will not be repeated here. Assume that the speech synthesizer can synthesize 20 speech samples for each set of parameters, so that the speech synthesizer can ultimately synthesize 200 speech samples.
  • the terminal 100 After processing by the speech synthesizer, the terminal 100 can obtain a large amount of audio data containing custom wake-up words to simulate the custom wake-up word audio collected by speakers in various scenarios.
  • the speech synthesizer can also perform data enhancement processing on the above-mentioned speech samples to obtain more speech samples.
  • the above-mentioned data enhancement technologies include, but are not limited to, time-frequency masking (time domain masking, frequency domain masking), speed enhancement, volume enhancement, pitch enhancement, data noise addition and other enhancement technologies.
  • data noise addition refers to adding noise effects to the generated speech samples, so that the noised speech samples also simulate the usage environment of the speaker.
  • the noise used in data noise includes but is not limited to human voice noise, wind noise, construction noise, traffic noise, etc. Or, according to the spatial scene, the noise used for data noise can also be home noise, office noise, shopping mall noise, park noise, etc.
  • various types of noise can include different intensities such as [I, II, III].
  • Type I vocal noise can mean that the vocal noise is low
  • Type III vocal noise can mean that the vocal noise is loud.
  • the speech synthesizer can perform one or more data enhancement processes on any of the above synthetic speech samples, thereby further expanding one synthetic speech sample to multiple synthetic speech samples.
  • the speech synthesizer can perform 5 data enhancement processes on the synthesized speech sample 001.
  • the above 5 data enhancement processes can be the time domain masking, frequency domain masking, speed enhancement, volume enhancement, pitch enhancement and data noise addition introduced above. Any combination of (human voice, wind, architecture, traffic, etc.).
  • Synthetic speech sample 001 is any one of the above 200 synthetic speech samples. In this way, after data enhancement processing, the speech synthesizer can obtain another 5 synthesized speech samples based on the synthesized speech sample 001. Therefore, the above 200 synthetic speech samples can be further expanded to 1,200.
  • the terminal 100 optimizes the coarse model based on the synthesized speech sample and generates a fine model suitable for the customized wake-up word.
  • the terminal 100 After being processed by the speech synthesizer, the terminal 100 can obtain a large number of synthesized speech samples whose contents are customized wake words. At this time, the terminal 100 can use the large number of synthesized speech samples to train the currently used rough model, so that it can learn more features of the audio data whose content is the custom wake-up word.
  • the trained model is a fine model suitable for identifying custom wake words.
  • Figure 3 exemplarily shows a schematic diagram of optimizing the rough model.
  • the left rectangular frame 31 may represent the network structure 31 of the coarse model.
  • the network structure 31 may include 7 data processing layers ("1" to "7").
  • the above-mentioned data processing layer may be a convolution layer in a convolutional neural network, which is not limited in the embodiments of the present application.
  • the number of data processing layers mentioned above is exemplary, and more or fewer processing layers may also be included.
  • the data processing layer in the network structure 31 can be divided into an input layer and an output layer.
  • the previous data processing layer in the network structure can be called the input layer, and the subsequent data processing layer can be called the output layer.
  • the first three layers in the exemplary network structure 31 are input layers (“1” to “3”), and the last four layers are output layers (“4” to “7”).
  • the terminal 100 will input the synthesized speech samples into the network structure 31 and adjust the data processing layers of the network structure 31, including adjusting the number of data processing layers, and/or adjusting the data processing Layer parameters, etc.
  • the terminal 100 keeps the input layer of the original rough model unchanged, that is, it does not increase the number of input layers and does not change the parameters of the input layer.
  • the terminal 100 only adjusts the configuration of the output layer (the number of data processing layers, and/or the parameters of the data processing layer) to make it more suitable for identifying customized wake words in various scenarios. In this way, the terminal 100 can optimize the calculation cost and improve the model training efficiency in the model optimization process.
  • the terminal 100 After training a coarse model using synthetic speech samples and adjusting the configuration of the output layer of the coarse model, the terminal 100 can obtain a fine model suitable for the customized wake word.
  • the right rectangular box 32 may represent the network structure 32 of the detailed model. Among them, the configuration of the input layer (“1” to “3”) in the network structure 32 is consistent with that of the input layer in the network structure 31, but the output layer (“4’” to “6’”) in the network structure 32 The configuration is different from the output layer in the network structure 31.
  • the above-mentioned difference in the output layer includes a difference in the number of data processing layers in the output layer, and/or different parameters of the data processing layer.
  • the terminal 100 can optimize the rough model through synthesized speech samples based on the preset rough model to obtain a fine model suitable for custom wake words, thereby realizing various functions. Detect whether the user speaks a custom wake word in the scenario, providing users with better wake word recognition services.
  • the training method based on the preset rough model saves the terminal 100 a certain amount of training costs, including algorithm costs, time costs, etc. This allows the terminal 100 to obtain a speech recognition model suitable for the customized wake-up word while performing less calculations.
  • the terminal 100 obtains environmental audio through the microphone.
  • the microphone of the terminal 100 can collect environmental sounds in real time and generate environmental audio.
  • the terminal 100 can input the collected environmental audio into the detailed model, and identify whether the above-mentioned environmental audio includes a custom wake-up word, that is, detect whether the speaker speaks the custom wake-up word.
  • the terminal 100 determines whether the custom wake-up word is recognized.
  • the terminal 100 can wake itself up and enter the voice control mode.
  • the terminal 100 can light up the screen and display the voice assistant icon to indicate that the user has been awakened.
  • the terminal 100 may display the voice assistant icon and so on. In this way, the user can continue to issue voice commands to the terminal 100.
  • the terminal 100 can perform corresponding operations according to the recognized voice instructions.
  • the terminal 100 When the input ambient audio does not include the custom wake-up word, the terminal 100 will re-identify the newly collected ambient audio input by the microphone until the custom wake-up word is recognized.
  • the terminal 100 can automatically generate a voice sample whose content is the custom wake-up word after the user sets the custom wake-up word. Based on the above voice samples, the terminal 100 can optimize the existing wake word recognition model to obtain a wake word recognition model with higher accuracy and stronger environmental adaptability, so as to accurately recognize user-defined wake-up words in various scenarios. function of words.
  • the terminal 100 can also convert the actually collected content into the audio data of the custom wake-up word when implementing the custom wake-up word detection. Input it into the fine model to further optimize the fine model, improve the recognition efficiency of the fine model, and improve the user experience.
  • the terminal 100 can also perform steps S106 and S107 to input the actually collected audio data whose content is the custom wake-up word into the detailed model to further optimize the detailed model. Model.
  • the terminal 100 adds the valid audio data that wakes up the terminal 100 to the detailed model.
  • the ambient audio generated by the microphone collecting environmental sounds may or may not include custom wake words.
  • the fine model can recognize the custom wake-up word, and thus, the terminal 100 can confirm to wake itself up.
  • the above-mentioned environmental audio that includes a custom wake-up word and successfully wakes up the terminal 100 can be called valid audio data.
  • the effective audio data is the audio data actually collected by the terminal 100 of the speaker speaking the custom wake-up word in a certain scene. Utilizing effective audio data to train the currently used fine model generated based on synthetic speech samples can further improve the recognition effect of the fine model, allowing the terminal 100 to more accurately and quickly detect whether the user speaks a custom wake word.
  • the terminal 100 can add the corresponding valid audio data to the training set of the currently used fine model.
  • the fine model can then be retrained using the speech samples synthesized by the speech synthesizer and the above-mentioned valid audio data, thereby updating the currently used fine model.
  • the terminal 100 can more quickly and accurately recognize the customized wake-up word spoken by the user.
  • the terminal 100 confirms whether the number of valid voice samples is sufficient and within the update time?
  • the terminal 100 can monitor the amount of newly added valid audio data. When the amount of newly added valid audio data meets the requirement of the quantity threshold, the terminal 100 can update the currently used detailed model using all the newly added valid audio data and synthesized speech samples. Optionally, the terminal 100 may also use all valid audio data to further optimize the currently used fine model.
  • the above quantity threshold may be 100.
  • the terminal 100 can start to optimize the currently used fine model.
  • the terminal 100 can optimize the currently used fine model using the above 100 newly added valid audio data and the above synthesized 1200 synthetic speech samples to obtain an updated fine model, such as fine model 2.0.
  • the terminal 100 may detect the custom wake word using the updated fine model (fine model 2.0).
  • the terminal 100 can also directly use the accumulated valid audio data to optimize the currently used fine model to obtain an updated fine model.
  • the above quantity threshold can be 1000.
  • the terminal 100 can directly use the above 1,000 newly added valid audio data to optimize the currently used fine model, without using synthetic speech samples.
  • the terminal 100 can also detect whether the current time meets the update time requirement.
  • the update time refers to the preset idle time that will not affect the user's current experience, such as 1 to 4 in the morning.
  • the above update time can also be a time specified by the user.
  • the terminal 100 can avoid updating the detailed model when the user is using the device, thereby avoiding overload causing system freezes or abnormalities and affecting the user experience.
  • the terminal 100 may display a pop-up window containing an update prompt on the screen. After seeing the above pop-up window, users can confirm that the model for identifying custom wake words has been updated, and can obtain better custom wake word recognition services in the future.
  • the terminal 100 may display a pop-up window containing update prompts and selection controls on the screen. Users can choose to update or not. Further, the user can choose to update immediately, update later, or set a time to update, so as to avoid updating when the terminal 100 is busy and affecting the user experience.
  • the terminal 100 may also extract the speaker's prosodic features from the valid audio data. Then, the terminal 100 can combine the extracted prosodic features of the speaker and the preset prosodic feature control parameters, and use the combined prosodic features to control the synthesized speech sample.
  • the terminal 100 may extract the speaker's prosodic features from the determined valid audio data that includes the custom wake-up word and successfully wakes up the terminal 100 . Then, the terminal 100 can combine the above-mentioned extracted prosodic features of the speaker with the preset prosodic feature control parameters in the speech synthesizer, and synthesize a new speech sample using the above-mentioned extracted and preset prosodic features.
  • the prosodic features in Figure 2 include both preset prosodic feature control parameters and extracted prosodic features. In this way, the terminal 100 can obtain more prosodic features, so that the synthesized speech samples cover more speaking styles with different emotions and different pauses.
  • the terminal 100 can use the above-mentioned new speech samples to train the currently used fine model, thereby achieving further fine model optimization.
  • the optimized fine model has higher accuracy and better recognition effect.
  • the terminal 100 also performs voiceprint recognition when recognizing voice commands such as wake-up words, that is, identifying whether the speaker is the owner of the device. The terminal 100 will be awakened only when the wake-up word is recognized and the wake-up word is confirmed to be spoken by the owner.
  • FIG. 5A exemplarily shows a flow chart of another speech recognition method provided by an embodiment of the present application.
  • the terminal 100 determines the customized wake-up word, performs user registration at the same time, and determines the voiceprint information of the owner.
  • the terminal 100 can also obtain the voiceprint information of the owner.
  • Voiceprint information refers to audio information describing the identity of the speaker. A user's unique voiceprint information is used to mark the user.
  • the terminal 100 may instruct the current user to perform user registration.
  • the terminal 100 may determine the user's voiceprint information (the owner's voiceprint information).
  • the terminal 100 may instruct the user to repeat the customized wake word three times.
  • the microphone of the terminal 100 can collect corresponding audio data, that is, registered voice data. The above registered voice data can be used to extract the voiceprint information of the phone owner.
  • the terminal 100 may also extract the voiceprint information of the owner from previously collected audio data of the default wake-up word.
  • the terminal 100 synthesizes a speech sample according to the customized wake-up word text and preset prosodic feature control parameters.
  • the terminal 100 optimizes the coarse model based on the synthesized speech sample and generates a fine model suitable for the customized wake-up word.
  • the terminal 100 obtains environmental audio through the microphone.
  • S202 to S204 reference can be made to the introduction of S102 to S104 in Figure 1 and will not be described again here.
  • the terminal 100 determines whether the custom wake-up word is recognized.
  • the fine model obtained after optimization can first identify whether the above-mentioned audio includes a custom wake-up word.
  • the terminal 100 will re-input the newly collected environmental audio into the detailed model and continue to recognize until the custom wake-up word is recognized.
  • the terminal 100 determines whether the speaker is the owner.
  • the terminal 100 after identifying the custom wake-up word, the terminal 100 will also perform voiceprint verification to determine whether the custom wake-up word included in the collected environmental audio was spoken by the owner, that is, to determine the speaker. Is it the owner of the machine?
  • the terminal 100 can confirm to wake itself up and then enter the voice control mode. When it is confirmed that the speaker is not the owner, the terminal 100 will not wake up itself. At this time, the terminal 100 will continue to collect the current environmental audio, and continue to identify whether the newly collected audio includes a custom wake-up word and whether it is spoken by the owner, until it recognizes that the custom wake-up word is spoken by the owner.
  • the order in which the terminal 100 performs S205 and S206 can also be exchanged, that is, first confirm whether the speaker is the owner, and then confirm whether the voice content includes a custom wake-up word.
  • the terminal 100 after the terminal 100 recognizes the ambient audio input from the microphone, it simultaneously outputs the voiceprint recognition result and the customized wake word recognition result. At this time, when the voiceprint recognition result and the custom wake-up word recognition result respectively meet the above requirements, the terminal 100 confirms to wake itself up. If any one of them does not meet the requirements, the terminal 100 does not wake itself up.
  • the terminal 100 adds the valid audio data that wakes up the terminal 100 to the detailed model.
  • the environmental audio that can wake up the terminal 100 can be called valid audio data.
  • the terminal 100 can add the valid audio data to the fine model to expand the speech samples of the fine model, thereby improving the recognition effect of the fine model.
  • the terminal 100 confirms whether the number of valid voice samples is sufficient and within the update time?
  • the terminal 100 may monitor the amount of newly added valid audio data and the current time. When the amount of newly added valid audio data meets the requirement of the quantity threshold and is within the update time, the terminal 100 may determine to update the currently used fine model.
  • the terminal 100 can use the above-mentioned valid audio data and/or synthesized speech samples to update the currently used fine model to obtain a fine model with higher accuracy and better recognition effect.
  • the terminal 100 can also extract prosodic features from the above-mentioned valid audio data, synthesize a new speech sample based on the preset prosodic feature control parameters, and further update the currently used details. Model, I won’t go into details here.
  • the terminal 100 may not immediately use the synthesized speech sample to update the coarse model.
  • the rough model preset by the terminal 100 can identify whether the audio data is audio data whose content includes the above-mentioned custom wake-up word.
  • the terminal 100 can identify whether the environmental audio collected by the microphone is audio data whose content includes a custom wake-up word, that is, detect whether a user speaks a custom wake-up word.
  • the terminal 100 can wake itself up from the confirmation and confirm that the audio containing the custom wake-up word and confirming that the speaker is the owner is valid audio data.
  • the terminal 100 can extract prosodic features from the above-mentioned valid audio data. Use the above extracted prosodic features, customized wake word text, and preset prosodic feature control parameters to synthesize speech samples. When the number of synthesized speech samples is sufficient and within the update time, the terminal 100 can update the original coarse model to a fine model with higher recognition accuracy and better robustness.
  • the terminal 100 can obtain a recognition model suitable for identifying user-defined wake-up words through fewer calculation operations.
  • the use of custom wake-up word text and preset prosodic feature control parameters shown in S102 in Figure 1 is reduced.
  • the calculation operations of synthesizing speech samples; the calculation operations of using the above speech samples to train a rough model shown in S103 in Figure 1 are reduced, etc.
  • a user-set custom wake-up word (for example, " ⁇ ") received by the terminal 100 may be called a first wake-up word.
  • the coarse model that the terminal 100 first determines after determining the custom wake-up word to recognize the custom wake-up word may be called the first speech recognition model, and the fine model adapted to recognize the custom wake-up word obtained after training using synthetic speech samples may be called Second speech recognition model.
  • the fine model obtained by training the currently used fine model using valid audio data and/or synthetic speech samples, such as fine model 2.0 can be called the third speech recognition model.
  • the resulting fine model fine model 2.0
  • the third speech recognition model after updating the synthesized speech sample using the prosodic features extracted based on the valid audio data, and then updating the currently used fine model, the resulting fine model (fine model 2.0) can also be called the third speech recognition model.
  • the prosodic feature control parameters and noise in the speech synthesizer can be called control parameters.
  • the set valid audio data quantity threshold (for example, 100) may be called a first quantity threshold.
  • the set valid audio data quantity threshold (for example, 1000) may be called the second quantity threshold.
  • 6A to 6I and 7A to 7D exemplarily illustrate a set of user interfaces for the terminal 100 to implement the above speech recognition method.
  • FIG. 6A-FIG. 6I illustrate a set of user interfaces for the terminal 100 to provide users with a custom wake-up word function interface.
  • FIG. 6A exemplarily shows the user interface 61 on the terminal 100 for setting a custom wake word.
  • the interface may include multiple setting options, such as "application”, "battery” and other options.
  • the user interface 61 also includes a “smart assistant” option. This option can be used to set voice assistant, shortcut actions and other settings related to quick control.
  • the terminal 100 can detect user operations, such as click operations, on the above-mentioned "smart assistant" options. In response to the above operation, the terminal 100 may display the user interface 62 shown in FIG. 6B.
  • the user interface 62 may include multiple setting items for setting voice assistants, shortcut actions, and other setting items related to quick control, such as “smart voice,” “assisted vision,” “smart screen recognition,” and other setting items.
  • “Smart Voice” can be used to set wake-up words, command words, and other settings related to voice control.
  • the terminal 100 may detect the user operation on the above-mentioned "smart voice” option, and in response to the above operation, the terminal 100 may display the user interface 63 shown in FIG. 6C .
  • the user interface 63 may include "Voice Wake-up” and “Smart Service” setting items.
  • “Voice Wake” can be used to turn on or off the wake word recognition function, and set a default wake word or a custom wake word.
  • Smart Service can be used to set on or off the self-learning update function.
  • the above-mentioned self-learning update function refers to a function in which the terminal 100 adjusts the results and/or parameters of the speech recognition model according to the actual situation of the user using the wake word control function to improve the recognition accuracy.
  • the terminal 100 may detect a user operation on the above-mentioned "voice wake-up” option, and in response to the above operation, the terminal 100 may display the user interface 64 shown in FIG. 6D.
  • User interface 64 may include controls 641 .
  • the control 641 can be used to set the wake word recognition function to be turned on or off.
  • the control 641 is turned off (OFF), refer to the control 631 in the user interface 63 .
  • Turning off control 641 corresponds to turning off the "voice wake-up" function.
  • control 641 may become ON.
  • the terminal 100 enables the "voice wake-up" function, that is, the terminal 100 starts to recognize whether the user speaks the wake-up word.
  • User interface 64 also includes controls 642 and 643 .
  • Control 642 can be used to set a default wake word.
  • Control 643 can be used to set a custom wake word.
  • the terminal 100 may first select a default wake-up word. The user can switch the default wake-up word to a custom wake-up word by operating on control 643.
  • the terminal 100 may display the user interface 65 shown in FIG. 6E.
  • User interface 65 may include window 651.
  • Window 651 can be used to set a custom wake word.
  • Window 651 may include input box 652. Input box 652 may be used to receive a user input of a custom wake word. Window 651 also includes input box 652. When detecting a user operation on the cancel control 653, the terminal 100 can cancel the use of the custom wake word and close the window 651. When a user operation on confirmation control 654 is detected, terminal 100 may determine to use a custom wake word and close window 651 .
  • the terminal 100 may receive user input and determine an operation using the custom wake-up word " ⁇ ". In response to the above operation, the terminal 100 may display the user interface 66 shown in FIG. 6F .
  • the terminal 100 can display that the custom wake-up word has been selected, and display the specific content of the custom wake-up word (" ⁇ ") set by the user in the custom wake-up word control 643 . Then, the terminal 100 may detect a user operation on the exit control 644, and in response to the above operation, the terminal 100 may display the user interface 67 shown in FIG. 6G.
  • the "Voice Wakeup” option may display “Enabled” to remind the user that the wake word recognition function has been turned on.
  • the terminal 100 may also detect a user operation on the "intelligent service" option in the user interface 67, and in response to the above operation, the terminal 100 may display the user interface 68 shown in FIG. 6H.
  • the control 631 changes from the closed (OFF) state to the open state (ON).
  • the terminal 100 can record the custom wake-up words spoken by the user and the user's operations after waking up the terminal 100 in the subsequent process of identifying the custom wake-up words, thereby improving the speech recognition model so that the terminal 100 can more accurately identify the location. Custom wake word.
  • the terminal 100 can display the custom wake-up word "Xiaohua Xiaohua" set by the user on the "Smart Assistant" setting interface, refer to the user interface 69 shown in Figure 6I. In this way, the user can clearly understand the currently used wake word every time he opens the above interface.
  • the terminal 100 can use the above-mentioned valid audio data to update the currently used wake-up word recognition model.
  • the terminal 100 can automatically update without requiring the user to confirm whether to update. After the update is completed, the terminal 100 may display an update completion notification to prompt the user to enjoy better wake word recognition services.
  • FIG. 7A exemplarily shows the user interface 71 of the terminal 100 displaying a notification of update completion. Notifications 711 may be included in user interface 71 .
  • the notification 711 may display "The speech recognition system has been optimized to version 2.0" to prompt the user.
  • terminal 100 may ask the user whether to update.
  • Figure 7B exemplarily shows the user interface 72 of the terminal 100 asking the user whether to update.
  • Notifications 721 may be included in user interface 71 .
  • the notification 721 may display "It is detected that the speech recognition system 2.0 can be updated" to prompt the user that the wake word recognition model can be updated.
  • the notification 721 may also include a cancel control 722 and an update control 723 . When a user operation on the cancel control 722 is detected, the terminal 100 may cancel updating the wake word recognition model. When a user operation on the update control 723 is detected, the terminal 100 may start updating the wake word recognition model.
  • the terminal 100 may obtain the preferred update time from the user. After detecting a user operation on the update control 723, the terminal 100 may display the window 731 as shown in the user interface 73 of FIG. 7C. Window 731 may include options 732 and 733.
  • Option 732 provides users with the ability to set a custom update time. For example, the terminal 100 may receive an update time setting of 1 hour set by the user after detecting the operation on option 732. Then, the terminal 100 may start timing for one hour, and start updating the wake word recognition model after the timing ends for one hour. Option 733. Can be used to set updates during free time (at night, for example, 1:00-4:00).
  • the terminal 100 can display the currently used version of the customized wake word recognition model on the "Smart Assistant" setting interface, with reference to "Xiaohua” shown in the user interface 74 shown in Figure 7D Xiaohua V2.0 latest version".
  • the user can know the version of the wake word recognition model currently used and whether it is the latest.
  • the user can instruct the terminal 100 to update to the latest version to obtain better wake word recognition service.
  • FIG. 8 is a schematic system structure diagram of the terminal 100 provided by the embodiment of the present application.
  • the layered architecture divides the system into several layers, and each layer has clear roles and division of labor.
  • the layers communicate through software interfaces.
  • the system is divided into five layers, from top to bottom: application layer (application layer), application framework layer (framework bed), hardware abstraction layer, driver layer and hardware layer.
  • the application layer can include a series of application packages, such as dial-up applications, gallery applications, and so on.
  • the application package also includes a speech recognition SDK (software development kit).
  • the system of the terminal 100 and the third application installed on the terminal 100 can obtain speech recognition functions including wake word recognition through the speech recognition SDK.
  • the framework layer provides application programming interface (API) and programming framework for applications in the application layer.
  • the framework layer includes some predefined functions.
  • the framework layer may include a microphone service interface and a wake-up word recognition service interface.
  • the wake word recognition service interface can provide an application programming interface and programming framework for applications that obtain wake word recognition services.
  • the microphone service can be used to provide an application programming interface and programming framework for applications that call the microphone.
  • the hardware abstraction layer is the interface layer between the framework layer and the driver layer, providing a virtual hardware platform for the operating system.
  • the hardware abstraction layer may include a microphone hardware abstraction layer and a wake word recognition algorithm library.
  • the microphone hardware abstraction layer can provide virtual hardware for microphone 1, microphone 2, or more microphone devices.
  • the wake word recognition algorithm library may include running code and data to implement the wake word recognition method provided by the embodiment of the present application.
  • the driver layer is the layer between hardware and software.
  • the driver layer includes drivers for various hardware.
  • the driver layer can include microphone device drivers, digital signal processor drivers, etc.
  • the microphone device driver is used to drive the microphone sensor to collect sound signals, and drive the audio signal processor to preprocess the sound signals to obtain audio digital signals.
  • the digital signal processor driver is used to drive the digital signal processor to process audio digital signals.
  • the hardware layer includes sensors and audio signal processors.
  • the sensor includes microphone 1 and microphone 2.
  • the microphones included in the sensor correspond to the virtual microphones included in the microphone hardware abstraction layer one-to-one.
  • An audio signal processor can be used to convert the sound signal collected by the microphone into an audio digital signal.
  • Digital signal processors can be used to process audio digital signals.
  • the wake-up word wake-up function is always on. Therefore, when the terminal 100 is powered on, the speech recognition SDK will be enabled. In response to enabling the speech recognition SDK, the speech recognition SDK may call the wake word recognition service interface to obtain the application programming interface and programming framework provided by the wake word recognition service.
  • the wake-up word recognition service can call the microphone service at the framework layer and collect sound signals in the environment through the microphone service.
  • the microphone service can send instructions for collecting sound signals to the microphone 1 sensor of the hardware layer by calling microphone 1 in the microphone hardware abstraction layer.
  • the microphone hardware abstraction layer sends this instruction to the microphone device driver of the driver layer.
  • the microphone device driver can start the microphone 1 according to the above instructions, thereby acquiring the sound signal in the environment, and generating a digital audio signal through the audio signal processor.
  • the wake word recognition service can initialize the wake word recognition algorithm.
  • the wake word recognition algorithm can obtain the audio signal processor to generate a digital audio signal through the microphone hardware abstraction layer. Then, according to the digital audio signal processing method stored in the wake-up word recognition algorithm, the wake-up word recognition algorithm can use the digital signal processor to calculate the acquired digital audio signal to determine whether the wake-up word (default wake-up word/custom wake word).
  • the wake-up word recognition model used in the above-mentioned wake-up word algorithm library is the default wake-up word.
  • Word recognition model when the wake-up word is a custom wake-up word, the wake-up word recognition model used by the above-mentioned wake-up word algorithm library is a custom wake-up word recognition model, that is, the fine model introduced above that is suitable for identifying custom wake-up words.
  • the wake word recognition algorithm can pass the recognition results back to the wake word recognition service and then back to the application layer.
  • the speech recognition SDK can trigger to wake up the terminal 100 and enter the voice control mode; conversely, the speech recognition SDK will not trigger to wake up the terminal 100 .
  • Figure 9 shows a schematic diagram of the hardware structure of the terminal 100.
  • the terminal 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, Mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, headphone interface 170D, sensor module 180, button 190, motor 191, indicator 192, camera 193, display screen 194, and user Identification module (subscriber identification module, SIM) card interface 195, etc.
  • a processor 110 an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, Mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, headphone interface 170D, sensor module 180, button 190, motor 191, indicator 192, camera 193, display screen
  • the sensor module 180 may include a pressure sensor 180A, a gyro sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, and ambient light. Sensor 180L, bone conduction sensor 180M, etc.
  • the structure illustrated in the embodiment of the present invention does not constitute a specific limitation on the terminal 100.
  • the terminal 100 may include more or less components than shown in the figures, or combine some components, or split some components, or arrange different components.
  • the components illustrated may be implemented in hardware, software, or a combination of software and hardware.
  • the processor 110 may include one or more processing units.
  • the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processing unit (GPU), and an image signal processor. (image signal processor, ISP), controller, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural network processor (neural-network processing unit, NPU), etc.
  • application processor application processor, AP
  • modem processor graphics processing unit
  • GPU graphics processing unit
  • image signal processor image signal processor
  • ISP image signal processor
  • controller video codec
  • digital signal processor digital signal processor
  • DSP digital signal processor
  • baseband processor baseband processor
  • neural network processor neural-network processing unit
  • the processor 110 includes an application processor, an audio signal processor, and a digital signal processor.
  • the application processor can be used to maintain the normal operation of the operating system and various application programs on the terminal 100.
  • An audio signal processor can be used to convert the sound signal collected by the microphone into an audio digital signal.
  • the digital signal processor can be used to process audio digital signals to implement the speech recognition function provided by the embodiments of the present application.
  • the controller can generate operation control signals based on the instruction operation code and timing signals to complete the control of fetching and executing instructions.
  • the processor 110 may also be provided with a memory for storing instructions and data.
  • the memory in processor 110 is cache memory. This memory may hold instructions or data that have been recently used or recycled by processor 110 . If the processor 110 needs to use the instructions or data again, it can be called directly from the memory. Repeated access is avoided and the waiting time of the processor 110 is reduced, thus improving the efficiency of the system.
  • processor 110 may include one or more interfaces.
  • Interfaces may include integrated circuit (inter-integrated circuit, I2C) interface, integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, pulse code modulation (pulse code modulation, PCM) interface, universal asynchronous receiver and transmitter (universal asynchronous receiver/transmitter (UART) interface, mobile industry processor interface (MIPI), general-purpose input/output (GPIO) interface, subscriber identity module (SIM) interface, and /or universal serial bus (USB) interface, etc.
  • I2C integrated circuit
  • I2S integrated circuit built-in audio
  • PCM pulse code modulation
  • UART universal asynchronous receiver and transmitter
  • MIPI mobile industry processor interface
  • GPIO general-purpose input/output
  • SIM subscriber identity module
  • USB universal serial bus
  • the interface connection relationships between the modules illustrated in the embodiment of the present invention are only schematic illustrations and do not constitute a structural limitation on the terminal 100 .
  • the terminal 100 may also adopt different interface connection methods in the above embodiments, or a combination of multiple interface connection methods.
  • the charging management module 140 is used to receive charging input from the charger. While the charging management module 140 charges the battery 142, it can also provide power to the electronic device through the power management module 141.
  • the power management module 141 is used to connect the battery 142, the charging management module 140 and the processor 110.
  • the power management module 141 receives input from the battery 142 and/or the charging management module 140, and supplies power to the processor 110, the internal memory 121, the display screen 194, the camera 193, the wireless communication module 160, and the like.
  • the wireless communication function of the terminal 100 can be implemented through the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, the modem processor and the baseband processor.
  • Antenna 1 and Antenna 2 are used to transmit and receive electromagnetic wave signals.
  • the mobile communication module 150 can provide wireless communication solutions including 2G/3G/4G/5G applied to the terminal 100.
  • a modem processor may include a modulator and a demodulator. Among them, the modulator is used to modulate the low-frequency baseband signal to be sent into a medium-high frequency signal.
  • the demodulator is used to demodulate the received electromagnetic wave signal into a low-frequency baseband signal.
  • the wireless communication module 160 can provide applications on the terminal 100 including wireless local area networks (WLAN) (such as wireless fidelity (Wi-Fi) network), Bluetooth (bluetooth, BT), and global navigation satellite system. (global navigation satellite system, GNSS), frequency modulation (FM), near field communication technology (near field communication, NFC), infrared technology (infrared, IR) and other wireless communication solutions.
  • WLAN wireless local area networks
  • Wi-Fi wireless fidelity
  • Bluetooth bluetooth, BT
  • GNSS global navigation satellite system
  • FM frequency modulation
  • NFC near field communication technology
  • infrared technology infrared, IR
  • the terminal 100 implements the display function through the GPU, the display screen 194, and the application processor.
  • the GPU is an image processing microprocessor and is connected to the display screen 194 and the application processor. GPUs are used to perform mathematical and geometric calculations for graphics rendering.
  • Processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.
  • the display screen 194 is used to display images, videos, etc.
  • Display 194 includes a display panel.
  • Display 194 includes a display panel.
  • the display panel can use a liquid crystal display (LCD).
  • the display panel can also use organic light-emitting diode (OLED), active matrix organic light-emitting diode or active matrix organic light-emitting diode (active-matrix organic light emitting diode, AMOLED), flexible light-emitting diode ( Manufacturing of flex light-emitting diodes (FLED), miniled, microled, micro-oled, quantum dot light emitting diodes (QLED), etc.
  • the electronic device may include 1 or N display screens 194, where N is a positive integer greater than 1.
  • the terminal 100 in response to recognizing the wake-up word, the terminal 100 lights up the screen, and the terminal 100 displays the user interface shown in FIGS. 6A-6I and 7A-7D, depending on the GPU, the display screen 194, and the application. Display functions provided by the processor, etc.
  • the terminal 100 can implement the shooting function through the ISP, camera 193, video codec, GPU, display screen 194, application processor, etc.
  • Camera 193 is used to capture still images or video.
  • the ISP is used to process the data fed back by the camera 193.
  • Video codecs are used to compress or decompress digital video.
  • Terminal 100 may support one or more video codecs.
  • NPU is a neural network (NN) computing processor.
  • NN neural network
  • the NPU can realize intelligent cognitive applications of the terminal 100, such as image recognition, face recognition, speech recognition, text understanding, etc.
  • the terminal 100 uses the wake word audio data accumulated during use to update the wake word recognition model, which can be completed through the NPU as a neural network calculation processor.
  • the internal memory 121 may include one or more random access memories (RAM) and one or more non-volatile memories (NVM).
  • RAM random access memories
  • NVM non-volatile memories
  • Random access memory can include static random-access memory (SRAM), dynamic random-access memory (DRAM), synchronous dynamic random-access memory (SDRAM), double data rate synchronous Dynamic random access memory (double data rate synchronous dynamic random access memory, DDR SDRAM, such as the fifth generation DDR SDRAM is generally called DDR5SDRAM), etc.
  • Non-volatile memory can include disk storage devices and flash memory.
  • the executable code that implements the speech recognition method provided by the embodiment of the present application may be stored in the NVM of the terminal 100, such as an SD card, etc.
  • the terminal 100 runs the above code to provide the wake word recognition function, the terminal 100 may load the above code into RAM.
  • the terminal 100 can store the audio signal data collected and generated by the microphone in the cache of RAM or NVM.
  • the audio determined by the terminal 100 as valid audio data can be further stored in the NVM by the terminal 100. For subsequent use in optimizing the wake word recognition model.
  • the random access memory can be directly read and written by the processor 110, can be used to store executable programs (such as machine instructions) of the operating system or other running programs, and can also be used to store user and application data, etc.
  • the non-volatile memory can also store executable programs and user and application program data, etc., and can be loaded into the random access memory in advance for direct reading and writing by the processor 110.
  • the external memory interface 120 can be used to connect an external non-volatile memory to expand the storage capability of the terminal 100 .
  • the external non-volatile memory communicates with the processor 110 through the external memory interface 120 to implement the data storage function. For example, save music, video and other files in external non-volatile memory.
  • the terminal 100 can implement audio functions through the audio module 170, the speaker 170A, the receiver 170B, the microphone 170C, the headphone interface 170D, and the application processor. Such as music playback, recording, etc.
  • the audio module 170 is used to convert digital audio information into analog audio signal output, and is also used to convert analog audio input into digital audio signals. Audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be provided in the processor 110 , or some functional modules of the audio module 170 may be provided in the processor 110 .
  • Speaker 170A also called “speaker” is used to convert audio electrical signals into sound signals.
  • the terminal 100 can listen to music through the speaker 170A, or listen to a hands-free call.
  • Receiver 170B also called “earpiece” is used to convert audio electrical signals into sound signals.
  • the voice can be heard by bringing the receiver 170B close to the human ear.
  • Microphone 170C also called “microphone” or “microphone” is used to convert sound signals into electrical signals. When making a call or sending a voice message, the user can speak close to the microphone 170C with the human mouth and input the sound signal to the microphone 170C.
  • the terminal 100 may be provided with at least one microphone 170C. In other embodiments, the terminal 100 may be provided with two microphones 170C, which in addition to collecting sound signals, may also implement a noise reduction function. In other embodiments, the terminal 100 can also be equipped with three, four or more microphones 170C to collect sound signals, reduce noise, identify sound sources, and implement directional recording functions, etc.
  • the terminal 100 can collect environmental audio through the microphone 170C. Based on the audio signal collected and generated by the microphone 170C, the terminal 100 can detect whether it contains a wake-up word, and then determine whether to wake itself up and enter the voice control mode.
  • the headphone interface 170D is used to connect wired headphones.
  • the headphone interface 170D may be a USB interface 130, or may be a 3.5mm open mobile terminal platform (OMTP) standard interface, or a Cellular Telecommunications Industry Association of the USA (CTIA) standard interface.
  • OMTP open mobile terminal platform
  • CTIA Cellular Telecommunications Industry Association of the USA
  • the pressure sensor 180A is used to sense pressure signals and can convert the pressure signals into electrical signals.
  • the gyro sensor 180B may be used to determine the movement posture of the terminal 100 .
  • Air pressure sensor 180C is used to measure air pressure.
  • Magnetic sensor 180D includes a Hall sensor.
  • the terminal 100 may use the magnetic sensor 180D to detect the opening and closing of the flip cover.
  • the acceleration sensor 180E can detect the acceleration of the terminal 100 in various directions (generally three axes).
  • Distance sensor 180F is used to measure distance.
  • the terminal 100 can measure distance by infrared or laser.
  • Proximity light sensor 180G may include, for example, a light emitting diode (LED) and a light detector, such as a photodiode.
  • the terminal 100 uses a photodiode to detect infrared reflected light from nearby objects to determine that there are no objects near the terminal 100 .
  • the ambient light sensor 180L is used to sense ambient light brightness.
  • Fingerprint sensor 180H is used to collect fingerprints.
  • Temperature sensor 180J is used to detect temperature.
  • Touch sensor 180K also known as "touch device”.
  • the touch sensor 180K can be disposed on the display screen 194.
  • the touch sensor 180K and the display screen 194 form a touch screen, which is also called a "touch screen”.
  • the touch sensor 180K is used to detect a touch operation on or near the touch sensor 180K.
  • the touch sensor can pass the detected touch operation to the application processor to determine the touch event type.
  • Visual output related to the touch operation may be provided through display screen 194 .
  • the touch sensor 180K may also be disposed on the surface of the terminal 100 in a position different from that of the display screen 194 .
  • the terminal 100 detects the user's click, slide and other operations on the screen of the terminal 100, relying on the touch sensor 180K.
  • Bone conduction sensor 180M can acquire vibration signals.
  • the buttons 190 include a power button, a volume button, etc.
  • the terminal 100 may receive key input and generate key signal input related to user settings and function control of the terminal 100.
  • the motor 191 can generate vibration prompts.
  • the motor 191 can be used for vibration prompts for incoming calls and can also be used for touch vibration feedback.
  • the indicator 192 may be an indicator light, which may be used to indicate charging status, power changes, or may be used to indicate messages, missed calls, notifications, etc.
  • the SIM card interface 195 is used to connect a SIM card.
  • UI user interface
  • the term "user interface (UI)” in the description, claims and drawings of this application is a media interface for interaction and information exchange between an application or operating system and a user, which implements the internal form of information. Conversion to and from a user-acceptable form.
  • the user interface of an application is source code written in specific computer languages such as Java and extensible markup language (XML).
  • XML Java and extensible markup language
  • the interface source code is parsed and rendered on the terminal device, and finally presented as content that the user can recognize.
  • Control also called widget, is the basic element of user interface. Typical controls include toolbar, menu bar, text box, button, and scroll bar. (scrollbar), images and text.
  • the properties and contents of controls in the interface are defined through tags or nodes.
  • XML specifies the controls contained in the interface through nodes such as ⁇ Textview>, ⁇ ImgView>, and ⁇ VideoView>.
  • a node corresponds to a control or property in the interface. After parsing and rendering, the node is rendered into user-visible content.
  • applications such as hybrid applications, often include web pages in their interfaces.
  • a web page also known as a page, can be understood as a special control embedded in an application interface.
  • a web page is source code written in a specific computer language, such as hypertext markup language (GTML), cascading styles Tables (cascading style sheets, CSS), java scripts (JavaScript, JS), etc.
  • web page source code can be loaded and displayed as user-recognizable content by a browser or a web page display component with functions similar to the browser.
  • the specific content contained in the web page is also defined through tags or nodes in the web page source code.
  • GTML defines the elements and attributes of the web page through ⁇ p>, ⁇ img>, ⁇ video>, and ⁇ canvas>.
  • GUI graphical user interface
  • the commonly used form of user interface is graphical user interface (GUI), which refers to a user interface related to computer operations that is displayed graphically. It can be an icon, window, control and other interface elements displayed on the display screen of the terminal device.
  • the control can include icons, buttons, menus, tabs, text boxes, dialog boxes, status bars, navigation bars, widgets, etc. Visual interface elements.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another, e.g., the computer instructions may be transferred from a website, computer, server, or data center Transmission to another website, computer, server or data center through wired (such as coaxial cable, optical fiber, digital subscriber line) or wireless (such as infrared, wireless, microwave, etc.) means.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more available media integrated.
  • the available media may be magnetic media (eg, floppy disk, hard disk, tape), optical media (eg, DVD), or semiconductor media (eg, solid state drive), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Telephone Function (AREA)
  • Telephonic Communication Services (AREA)

Abstract

一种语音识别方法和电子设备。该方法可应用于手机、平板电脑等终端设备上。终端设备可接收用户设定自定义唤醒词,然后利用语音合成器合成各种场景下的内容为上述自定义唤醒词的语音样本。利用语音样本,终端设备可以优化当前使用的自定义唤醒词识别模型,使之成为可以在各种场景下识别到自定义唤醒词的识别模型,从而提升识别准确率,提升用户使用体验。

Description

一种语音识别方法和电子设备
本申请要求于2022年04月29日提交中国专利局、申请号为202210468803.4、申请名称为“一种语音识别方法和电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及终端领域,尤其涉及一种语音识别方法和电子设备。
背景技术
目前,用户可以根据自身需求在个人的手机等终端设备上设置个性化的唤醒词,即自定义唤醒词。与默认唤醒词相比,自定义唤醒词缺少了针对唤醒词的涵盖各种音量、噪声、情绪的训练样本,所以识别自定义唤醒词的可靠性远低于默认唤醒词。
发明内容
本申请提供了一种语音识别方法,实施该方法手机等终端设备可以利用语音合成器合成各种场景下的内容为自定义唤醒词的语音样本,然后利用上述语音样本,终端设备可以优化当前使用的自定义唤醒词识别模型,使之成为可以在各种场景下识别到自定义唤醒词的识别模型。
第一方面,本申请提供了一种语音识别方法,该方法包括:确定第一唤醒词,第一唤醒词是用户设定的;根据第一唤醒词和预设的控制参数合成语音样本,语音样本是语音内容包括第一唤醒词的音频数据,控制参数用于控制合成的语音样本中所表现出的说话方式和/或说话场景;利用合成的语音样本对第一语音识别模型进行训练得到第二语音识别模型;第一语音识别模型为训练前用于识别第一唤醒词的语音识别模型,第二语音识别模型为训练后用于识别第一唤醒词的语音识别模型;使用第二语音识别模型识别麦克风采集的音频数据;当从麦克风采集的音频数据中识别到第一唤醒词时,唤醒终端设备。
实施第一方面提供的方法,手机等终端设备可以接收用户输入的自定义唤醒词,并合成各种场景下内容为自定义唤醒词的语音样本,然后,终端设备可利用上述语音样本优化当前使用的自定义唤醒词识别模型,从而提升该模式的识别准确率,使得优化后的模型可以在任何背景环境下都能识别到用户说出的自定义唤醒词。
结合第一方面提供的实施例,在一些实施例中,该方法还包括:确定麦克风采集的音频数据中成功唤醒终端设备的音频数据为有效音频数据;利用有效音频数据和合成的语音样本对第二语音识别模型进行优化,得到第三语音识别模型;使用第三语音识别模型处理麦克风采集的音频数据。
实施上述实施例提供的方法,终端设备还可在实施自定义唤醒词检测中,将包括自定义唤醒词并成功唤醒终端设备的环境音频确定为有效音频数据。然后终端设备可使用有效音频设备和合成语音样本优化当前使用的语音识别模型,从而周期地更新语音识别模型,提升语音识别模型的识别效果,提升用户使用体验。
结合第一方面提供的实施例,在一些实施例中,该方法还包括:确定麦克风采集的音 频数据中成功唤醒终端设备的音频数据为有效音频数据;利用有效音频数据对第二语音识别模型进行优化,得到第三语音识别模型;使用第三语音识别模型处理麦克风采集的音频数据。
实施上述实施例提供的方法,终端设备还可在实施自定义唤醒词检测中,将包括自定义唤醒词并成功唤醒终端设备的环境音频确定为有效音频数据。然后终端设备可使用有效音频设备优化当前使用的语音识别模型,从而周期地更新语音识别模型,提升语音识别模型的识别效果,提升用户使用体验。
结合第一方面提供的实施例,在一些实施例中,在利用有效音频数据和合成的语音样本对第二语音识别模型进行优化之前,该方法还包括:确认有效音频数据的数量大于等于第一数量阈值,第一数量阈值为预设的。
实施上述实施例提供的方法,终端设备可以积累有效音频数据,在累积的有效音频数据的数量达到预设的第一数量阈值之后,终端设备再利用有效音频数据和合成的语音样本对第二语音识别模型进行优化,以避免每确定一条有效音频数据就立即更新当前使用的语音识别模型造成的计算资源浪费。
结合第一方面提供的实施例,在一些实施例中,在利用有效音频数据对第二语音识别模型进行优化之前,该方法还包括:确认有效音频数据的数量大于等于第二数量阈值,第二数量阈值为预设的。
实施上述实施例提供的方法,终端设备可以积累有效音频数据,在累积的有效音频数据的数量达到预设的第二数量阈值之后,终端设备再利用有效音频数据对第二语音识别模型进行优化,以避免每确定一条有效音频数据就立即更新当前使用的语音识别模型造成的计算资源浪费。
结合第一方面提供的实施例,在一些实施例中,在对第二语音识别模型进行优化之前,该方法还包括:确认当前时刻在预设的更新时间范围内。
实施上述实施例提供的方法,终端设备可以避免在用户正在使用终端设备时更新语音识别模型,从而避免超负荷导致系统卡顿或异常、影响用户使用体验。
结合第一方面提供的实施例,在一些实施例中,控制参数包括韵律特征;韵律特征用于控制合成的语音样本中说话人的说话方式,说话人的说话方式包括下一项或多项:说话人的说话时的情绪、停顿。
这样,终端设备在合成语音样本时,可通过韵律特征控制合成语音样本中说话人的说话方式说话情景,以模拟各种情绪状态下说话人说出自定义唤醒词的音频。
结合第一方面提供的实施例,在一些实施例中,根据第一唤醒词和预设的控制参数合成语音样本,具体包括:将第一唤醒词和预设的韵律特征输入语音合成器;利用语音合成器合成N条语音样本,N≥1;
结合第一方面提供的实施例,在一些实施例中,该方法还包括:依次对N条语音样本进行数据增强处理,得到M条语音样本,述M≥N。
实施上述实施例提供的方法,终端设备可以通过数据增强处理将合成的多条语音样本进行进一步扩充,得到数量更多的语音样本。这些语音样本之间存在些微的速度差异、音量差异、音调差异等等,从而进一步丰富合成的语音样本,模拟更多不同场景下的说话人 说出的自定义唤醒词音频。
结合第一方面提供的实施例,在一些实施例中,控制参数还包括噪声参数,噪声参数用于控制合成的语音样本中说话人的说话场景,依次对N条语音样本进行数据增强处理,具体包括:通过噪声参数对N条语音样本进行数据加噪。
实施上述实施例提供的方法,终端设备可以通过数据加噪模拟说话人在不同遭横环境中说出自定义唤醒词的音频数据。
结合第一方面提供的实施例,在一些实施例中,数据增强处理包括数据加噪,数据加噪所使用的噪声包括以下一项或多项:人声噪声、风声噪声、建筑噪声、交通噪声;或者,数据加噪所使用的噪声包括以下一项或多项:居家噪声、办公室噪声、商场噪声、公园噪声。
这样,终端设备在合成语音样本时,可通过数据加噪进一步获得基于不同使用环境的说话人说出自定义唤醒词的音频,从而获得更丰富的训练样本,以提升语音识别模型的鲁棒性。
结合第一方面提供的实施例,在一些实施例中,该方法还包括:从有效音频数据中提取韵律特征;利用第一唤醒词、控制参数中的韵律特征和提取的韵律特征更新合成的语音样本。
实施上述实施例提供的方法,终端设备还可从确定的包括自定义唤醒词并成功唤醒终端的有效音频数据中提取说话人的韵律特征。然后,终端可将上述提取的说话人的韵律特征与语音合成器中预设的韵律特征参数结合,合成新的语音样本,从而使得合成语音样本更加丰富。这样,基于更丰富的语音样本,终端可以得到更有的唤醒词识别模型。
结合第一方面提供的实施例,在一些实施例中,第一语音识别模型的输入层与第二语音识别模型的输入层中包括的数据处理层的数量相同;第一语音识别模型的输入层与第二语音识别模型的输入层中对应的数据处理层的参数相同。
实施上述实施例提供的方法,终端设备在优化模型的过程中,可以保持在前的数据处理层的数量相同以及参数相同,从而节省优化过程中的算法成本、时间成本等,提升模型优化的效率。
第二方面,本申请提供了一种电子设备,该电子设备包括一个或多个处理器和一个或多个存储器;其中,一个或多个存储器与一个或多个处理器耦合,一个或多个存储器用于存储计算机程序代码,计算机程序代码包括计算机指令,当一个或多个处理器执行计算机指令时,使得电子设备执行如第一方面以及第一方面中任一可能的实现方式描述的方法。
第三方面,本申请提供一种计算机可读存储介质,包括指令,当上述指令在电子设备上运行时,使得上述电子设备执行如第一方面以及第一方面中任一可能的实现方式描述的方法。
可以理解地,上述第二方面提供的电子设备、第三方面提供的计算机存储介质均用于执行本申请所提供的方法。因此,其所能达到的有益效果可参考对应方法中的有益效果,此处不再赘述。
附图说明
图1是本申请实施例提供的一种语音识别方法的流程图;
图2是本申请实施例提供的语音合成器合成样本语音的示意图;
图3是本申请实施例提供的模型优化的示意图;
图4A是本申请实施例提供的另一种语音识别方法的流程图;
图4B是本申请实施例提供的另一种语音识别方法的流程图;
图5A是本申请实施例提供的另一种语音识别方法的流程图;
图5B是本申请实施例提供的另一种语音识别方法的流程图;
图6A-图6I是本申请实施例提供的一组用户界面示意图;
图7A-图7D是本申请实施例提供的另一组用户界面示意图;
图8是本申请实施例提供的终端设备的系统结构示意图;
图9是本申请实施例提供的终端设备的硬件结构示意图。
具体实施方式
本申请以下实施例中所使用的术语只是为了描述特定实施例的目的,而并非旨在作为对本申请的限制。
手机、平板电脑等终端设备(终端100)可通过预设的唤醒词进入语音控制模式。语音控制模式是指用户通过说话控制终端100执行一个或多个操作。例如,在语音控制模式下,终端100可检测到用户说的“播放音乐”的命令,响应于上述命令,终端100可打开音乐应用播放音乐。
上述用于触发进入语音控制模式的唤醒词一般都是开发人员设定的,即默认唤醒词。可选的,现在终端100也支持用户在使用终端100的过程中设定个性化的唤醒词,即自定义唤醒词。
例如,在用户初始打开终端100时,终端100所使用的唤醒词为默认唤醒词,例如“你好,YOYO”。用户可根据终端100上提供的设置接口,将上述默认唤醒词更换为自定义唤醒词,例如“小花小花”。然后,终端100可通过检测唤醒词“小花小花”确认是否唤醒终端100并进入语音控制模式。
不限于手机、平板电脑,终端100还可以是桌面型计算机、膝上型计算机、手持计算机、笔记本电脑、超级移动个人计算机(ultra-mobile personal computer,UMPC)、上网本,以及蜂窝电话、个人数字助理(personal digital assistant,PDA)、增强现实(augmented reality,AR)设备、虚拟现实(virtual reality,VR)设备、人工智能(artificial intelligence,AI)设备、可穿戴式设备、车载设备、智能家居设备和/或智慧城市设备,本申请实施例对终端100的具体类型不作特殊限制。
然而,用户设定的自定义唤醒词具有很大的随机性。因此,终端100预置的唤醒词识别模型无法预先针对唤醒词对模型进行优化,更缺少涵盖各种音量、噪声、情绪的训练样本。因此,该模型针对自定义唤醒词的识别准确率低,鲁棒性低(对不同复杂使用场景,用户的不同发音习惯的适应能力)。
例如,当用户说出的自定义唤醒词的音量较小而环境噪声较大时,或用户情绪较为激动语速较快时,终端100往往难以准确快速地识别到上述自定义唤醒词。终端100对自定义唤醒词的识别准确率低,鲁棒性低,使得用户使用体验降低。
不限于唤醒词,在语音控制场景中,命令词等其他用于语音控制的词句也面临上述问题。
在本申请提供的一种实施例中,在确定用户设定的自定义唤醒词之后,终端100可向云获取其他终端设备采集并上传的内容为上述自定义唤醒词的语音样本。上述语音样本可涵盖各种音量、噪声、情绪场景。例如,在设定自定义唤醒词“小花小花”后,终端100可向云获取内容为“小花小花”的语音样本。若其他终端设备在此之前使用过“小花小花”这一自定义唤醒词,那么云上可存储有其他终端设备采集并上传的内容为“小花小花”的语音样本。这时,终端100可向云获取上述语音样本。
然后,终端100可利用上述语音样本对当前使用的唤醒词识别模型进行优化,使得上述唤醒词识别模型能够很好地适用于自定义唤醒词识别。
然而,云如果能提供上述涵盖各种音量、噪声、情绪的自定义唤醒词的语音样本,那就意味着,云需要从它所覆盖的终端设备中获取各个用户设定的自定义唤醒词,并经常地从各个用户的终端设备中获取包含自定义唤醒词的音频数据。这不仅需要极大地运维成本,还存在严重的隐私问题。在一些情况中,云上也可能不包括内容为用户设定的自定义唤醒词的语音样本,例如“小花小花”的语音样本,这时,终端100也就无法从云上获取内容为上述自定义唤醒词的语音样本。
为了解决自定义唤醒词识别准确率较低,鲁棒性较差的问题,同时又要保证用户个人数据是安全的,本申请实施例提供了一种语音识别方法。该方法可应用于终端100上。
首先,终端100可预置有泛化的唤醒词识别模型,简称粗模型。该粗模型可用于识别任意设定的自定义唤醒词。上述粗模型可以是原来用于识别默认唤醒词的唤醒词识别模型,也可以是单独的一个语音识别模型。但是,上述粗模型的识别准确率较低,容易受用户的情绪、停顿和使用环境的影响,可靠性较低。
这时,终端100可利用用户设定的自定义唤醒词文本、韵律特征,合成大量的模拟说话人在不同情景中说出自定义唤醒词的语音样本(合成语音样本)。韵律特征用于反映说话人说话方式,包括但不限于说话人情绪、停顿等特征。
合成语音样本可用于扩充粗模型中的自定义唤醒词的训练集。基于扩充后的训练集,终端100可以对上述粗模型进行再训练,得到适用于在各种语境和环境中检测自定义唤醒词的唤醒词识别模型,记为细模型。
基于上述细模型,终端100可以在各种语境和环境中识别用户是否说出自定义唤醒词,从而提升识别自定义唤醒词的准确率,提升用户使用体验。
进一步的,在使用上述细模型识别自定义唤醒词时,终端100还可将内容为自定义唤醒词且成功唤醒终端100有效音频数据加入到训练集中,不断更新当前使用的细模型,从而不断提高细模型的鲁棒性。
其中,终端100还可从上述有效音频数据中提取韵律特征。然后,终端100可使用预设的韵律特征控制参数和上述提取的韵律特征,控制合成的语音样本的韵律效果。其中,韵律特征控制参数的一个具体取值即一个韵律特征。然后,终端100再使用上述合成的语 音样本和/或有效音频数据,更新当前使用的细模型,从而进一步提升自定义唤醒词识别准确率和鲁棒性,提升用户使用体验。
与家庭使用的智能电视,智能音响等终端设备不同的,手机、平板电脑的用户通常只有一个人。因此,这一类终端设备往往还具有声纹识别能力,即识别说话人是否为机主。因此,在一些示例中,终端100还可在自定义唤醒词的过程中,进行声纹验证。在确定说话人为机主的情况下,才唤醒自身进入语音控制模式。
这时,有效音频数据进一步限定为机主喊出自定义唤醒词并成功唤醒终端100的音频数据。这样,利用上述有效音频数据,或利用基于上述有效音频数据提取的韵律特征,优化当前使用的细模型,可以进一步提升识别自定义唤醒词的识别能力,避免他人误唤醒的情况,提升用户使用体验。
图1示例性示出了本申请实施例提供的一种语音识别方法的流程图。下面结合图1具体介绍终端100实施该方法的具体过程。
S101、终端100确定自定义唤醒词。
初始场景下,终端100所使用的唤醒词为默认唤醒词,例如“你好,YOYO”。
终端100可为用户提供设定自定义唤醒词的接口。当用户想使用个性化的唤醒词时,用户可通过上述接口设定自定义唤醒词。例如,用户可通过上述接口将上述默认唤醒词“你好,YOYO”更换为自定义唤醒词,例如“小花小花”。例如,上述接口可以为输入文本框。终端100可通过上述输入文本框接收用户的输入的自定义唤醒词。
具体的,终端100中预置有粗模型。粗模型可用于识别任意唤醒词。在接收到用户输入的自定义唤醒词之后,终端100可设定上述粗模型的目标识别对象为上述自定义唤醒词。然后,终端100可利用上述粗模型检测用户是否说出自定义唤醒词。
例如,终端100接收到用户输入的自定义唤醒词为“小花小花”。这时,终端100中粗模型的目标识别对象可被设定为“小花小花”。然后,粗模型可用于识别任意音频数据中是否包括语音内容“小花小花”。
但是,由于缺少对于自定义唤醒词的深度训练,粗模型准确率较低。因此,当用户的情绪、语速和环境噪声发生变化时,粗模型容易识别不到用户说出的自定义唤醒词,或将用户说出的其他词语误识别为上述自定义唤醒词,影响用户使用体验。
S102、终端100根据自定义唤醒词文本和预设的韵律特征合成语音样本。
使用大量的内容为自定义唤醒词的音频数据训练粗模型,可以实现对粗模型的优化,从而得到自定义唤醒词识别准确率更高,识别效果更好的唤醒词识别模型,即细模型。然而,由于用户设定的自定义唤醒词是随机的,终端100无法通过麦克风采集或下载的方式,获取大量的内容为自定义唤醒词的音频数据。
这时,终端100可以利用用户输入的自定义唤醒词的文本和语音合成器,合成大量的内容为自定义唤醒词的音频数据。例如,在接收到用户输入的自定义唤醒词为“小花小花”之后,终端100可将“小花小花”的文本输入到语音合成器。然后,语音合成器可生成N 条内容为“小花小花”的音频数据。终端100可以控制合成的音频数据的数量,即N的取值。例如,N可以为1200等等,这样,终端100就可以得到1200条内容为“小花小花”的音频数据。
上述利用语音合成器合成的包含自定义唤醒词的音频数据称为合成语音样本。图2示例性示出了语音合成器合成语音样本的示意图。
首先,在使用语音合成器合成语音之前,终端100需要设定语音合成器的参数。上述参数包括目标语音内容,韵律特征控制参数。
目标语音内容用于指示语音合成器合成什么内容的语音。在本申请实施例中,输入语音合成器的自定义唤醒词文本指示了语音合成器的目标语音内容。韵律特征控制参数用于设定各种韵律特征。韵律特征用于反映说话人的说话方式情境,包括但不限于说话人说话时的情绪、停顿、抑扬顿挫等特征。韵律特征控制参数设定的各种韵律特征使得语音合成器合成的N条语音样本是丰富多样的,可以表示说话人在各种情绪状态下说出的自定义唤醒词。
例如:合成语音样本1、合成语音样本2的语音内容均为“小花小花”,其中,合成语音样本1还包括开心、快(停顿少)的特点;合成语音样本2还包括哭泣、较慢(停顿多)的特点。这样,合成语音样本1可以表示说话人在开心、快速地说出自定义唤醒词时的音频数据。合成语音样本2可以表示说话人在哭泣、较慢地说出自定义唤醒词时的音频数据。
在设定目标语音内容,韵律特征控制参数之后,语音合成器可合成相应的语音样本。
其中,一组参数组合可以合成一条或多条语音样本。一般的,一组参数组合可以合成多条语音样本。例如,在一次语音合成过程中,语音合成器接收到的参数包括:目标语音内容=“小花小花”,韵律特征控制参数=“开心、较快”。基于上述参数,终端100可以合成20条模拟用户在开心、较快语速的场景下说出的内容为“小花小花”的语音样本。这20条合成语音样本之间各存在一定的差异。上述差异是合成过程中固有的。
终端100还可以一次性设定多组参数,进而快速合成覆盖更多场景的样本数据。例如,语音合成器可以一次性接收到的10组参数。这10组参数的具体内容这里不再赘述。假设,语音合成器可针对每一组参数合成20条语音样本,这样,语音合成器最终可以合成200条语音样本。
经过语音合成器的处理,终端100可以获取到大量的内容为自定义唤醒词的音频数据,来模拟各种场景下采集到的说话人说出的自定义唤醒词音频。
进一步的,在上述合成的语音样本的基础上,语音合成器还可以对上述语音样本进行数据增强处理,以获得更多的语音样本。
上述数据增强技术包括但不限于时频遮掩(时域遮掩、频域遮掩)、速度增强、音量增强、音调增强、数据加噪等增强技术。其中,数据加噪是指对生成的语音样本附加噪声效果,使得加噪后的语音样本还模拟了说话人所处的使用环境。数据加噪所使用的噪声包括但不限于人声噪声、风声噪声、建筑噪声、交通噪声等等。或者,根据空间场景划分,数据加噪所使用的噪声还可是居家噪声、办公室噪声、商场噪声、公园噪声等等。本申请实施例对此不作限制。其中,各类噪声又可包括[I、II、III]等不同强度。例如I类人声噪声可表示人声噪声较小,III类人声噪声可表示人声噪声较大。
结合上述示例,在合成200条语音样本后,语音合成器可对上述合成语音样本中的任意一条合成语音样本进行一次或多次数据增强处理,从而将一条合成语音样本进一步扩充到多条合成语音样本。例如,语音合成器可对合成语音样本001分别进行5次数据增强处理,上述5次数据增强处理可以为上述介绍的时域遮掩、频域遮掩、速度增强、音量增强、音调增强以及数据加噪(人声、风声、建筑、交通等)中的任意组合。合成语音样本001是上述200条合成语音样本中的任意一条。这样,经过数据增强处理,语音合成器可根据合成语音样本001得到另外5条合成语音样本。于是,上述200条合成语音样本可进一步扩充到1200条。
S103、终端100基于合成语音样本对粗模型进行优化,生成适用于自定义唤醒词的细模型。
在经过语音合成器处理后,终端100可获得大量的内容为自定义唤醒词的合成语音样本。这时,终端100可利用上述大量的合成语音样本对当前使用的粗模型进行训练,使其学习到更多内容为自定义唤醒词的音频数据的特征。训练后的模型即适用于识别自定义唤醒词的细模型。
具体的,图3示例性示出了对粗模型进行优化的示意图。
如图3所示,左侧矩形框31可表示粗模型的网络结构31。网络结构31可包括7个数据处理层(“1”~“7”)。上述数据处理层可以为卷积神经网络中的卷积层,本申请实施例对此不作限制。上述数据处理层的数量为示例性的,也可以包括更多或更少的处理层。
网络结构31中的数据处理层可分为输入层和输出层。网络结构中在前的数据处理层可称为输入层,在后的数据处理层可称为输出层。如图3所示,示例性的网络结构31中的前3层为输入层(“1”~“3”),后4层为输出层(“4”~“7”)。
在对粗模型进行优化得到细模型的过程中,终端100会将合成语音样本输入网络结构31,并调整网络结构31的数据处理层,包括调整数据处理层的数量,和/或,调整数据处理层的参数等等。在本申请实施例中,终端100保持原始粗模型的输入层不变动,即不增加输入层的数量,也不变更输入层的参数。终端100只调整输出层的配置(数据处理层的数量,和/或,数据处理层的参数),以使得更加适用于识别各种场景下的自定义唤醒词。这样,终端100可以优化时的计算成本,提高模型优化过程中的模型训练效率。
在使用合成语音样本训练粗模型,调整粗模型的输出层的配置之后,终端100可以得到适用于自定义唤醒词的细模型。右侧矩形框32可表示细模型的网络结构32。其中,网络结构32中的输入层(“1”~“3”)的配置与网络结构31中的输入层一致,但是,网络结构32中的输出层(“4'”~“6'”)的配置与网络结构31中的输出层不同。上述输出层不同包括输出层中的数据处理层的数量不同,和/或,数据处理层的参数不同。
实施图3所示的方法,终端100可以在预置的粗模型的基础上,通过合成的语音样本的对该粗模型进行优化,得到适用于自定义唤醒词的细模型,从而实现在各种场景下检测用户是否说出自定义唤醒词,为用户提供更好的唤醒词识别服务。
特别的,相比于利用合成语音样本直接训练得到一个识别自定义唤醒词的方法,基于预置的粗模型的训练方法为终端100节省了一定的训练成本,包括算法成本、时间成本等 等,使得终端100可以在执行更少的计算的前提下,得到适用于自定义唤醒词的语音识别模型。
S104、终端100通过麦克风获取环境音频。
终端100的麦克风可实时地采集环境声音,生成环境音频。终端100可将采集到的环境音频输入到细模型中,识别上述环境音频中是否包括自定义唤醒词,即检测说话人是否说出自定义唤醒词。
S105、终端100判断是否识别到自定义唤醒词。
当识别到环境音频中包括自定义唤醒词时,终端100可唤醒自身,并进入语音控制模式。
例如,在灭屏状态或灭屏AOD(Always on Display)的状态下,在识别到自定义唤醒词后,终端100可点亮屏幕并显示语音助手图标,示意用户自身已被唤醒。在显示主界面或其他应用程序界面的状态下,在识别到自定义唤醒词后,终端100可显示语音助手图标等等。这样,用户可继续向终端100下发语音指令。终端100可根据识别到的语音指令执行相应地操作。
当输入的环境音频中不包括自定义唤醒词时,终端100会重新识别麦克风输入的新采集到的环境音频,直到识别到自定义唤醒词。
实施上述方法,终端100可以在用户设定自定义唤醒词后,自动生成内容为自定义唤醒词的语音样本。基于上述语音样本,终端100可对现有的唤醒词识别模型进行优化,从而得到准确率更高、环境适应性更强的唤醒词识别模型,以实现在各种场景下准确识别用户自定义唤醒词的功能。
在一些示例中,在基于合成语音样本得到的适用于自定义唤醒词的细模型之后,终端100还可在实施自定义唤醒词检测中,将真实采集到的内容为自定义唤醒词的音频数据输入到细模型中,进一步优化细模型,提升细模型的识别效率,提升用户使用体验。
如图4A所示,在执行S101~S105所示的步骤之后,终端100还可执行步骤S106、S107,将真实采集到的内容为自定义唤醒词的音频数据输入到细模型中,进一步优化细模型。
S106、终端100将唤醒终端100的有效音频数据加入到细模型中。
麦克风采集环境声音生成的环境音频中可能包括自定义唤醒词,也可能不包括自定义唤醒词。当环境音频包括自定义唤醒词时,细模型可识别到自定义唤醒词,于是,终端100可确认唤醒自身。上述包括自定义唤醒词并成功唤醒终端100的环境音频可称为有效音频数据。
相比于语音合成器合成的语音样本,有效音频数据是终端100真实采集到的说话人在某一场景下说出自定义唤醒词的音频数据。利用有效音频数据训练当前使用的基于合成语音样本生成的细模型,可以进一步提升该细模型的识别效果,使得终端100可以更准确更快速地检测用户是否说出自定义唤醒词。
因此,在每次确认识别到自定义唤醒词之后,终端100可将对应的有效音频数据加入到当前使用的细模型的训练集中。然后,细模型可使用语音合成器合成的语音样本和上述有效音频数据进行再训练,从而更新当前使用的细模型。
相比于更新前的细模型,更新后的细模型的识别准确率高。在各种环境中,终端100可以更加快速准确地识别到用户说出的自定义唤醒词。
S107、终端100确认有效语音样本数量是否足够且是否在更新时间内?
可以理解的,每加入一条有效音频数据到当前使用的细模型就更新该模型,是十分浪费终端100计算资源的。
因此,终端100可以监测新增的有效音频数据的数量。当新增的有效音频数据的数量满足数量阈值的要求时,终端100可利用上述新增的全部有效音频数据和合成语音样本更新当前使用的细模型。可选的,终端100也可全部使用有效音频数据进一步优化当前使用的细模型。
例如,上述数量阈值可以为100。当新增的有效音频数据的数量达到100条时,终端100可开始优化当前使用的细模型。终端100可利用上述100条新增的有效音频数据和前述合成的1200条合成语音样本优化当前使用的细模型,得到更新后的细模型,例如细模型2.0。然后,终端100可使用更新后的细模型(细模型2.0)检测自定义唤醒词。
当然,如果累计的有效音频的数量较大,终端100也可直接利用上述累计的有效音频数据优化当前使用的细模型,得到更新后的细模型。例如,上述数量阈值可以为1000。当新增的有效音频数据的数量达到1000条时,终端100可直接利用上述1000条新增的有效音频数据优化当前使用的细模型,而不再需要使用合成语音样本。
进一步的,在累计的有效音频数据的数量满足数量阈值的基础上,终端100还可检测当前时间是否符合更新时间的要求。这里,更新时间是指在预设的不会影响用户当前使用体验的空闲时间,例如凌晨1点~4点。上述更新时间还可以是用户指定的时间。
这样,终端100可以避免在用户正在使用该设备时更新细模型,从而避免超负荷导致系统卡顿或异常、影响用户使用体验。
在一些示例中,在开始更新前,终端100可以在屏幕上显示包含更新提示的弹窗。用户在看到上述弹窗之后,可确定识别自定义唤醒词的模型已更新,以后可以获取更好的自定义唤醒词识别服务。
当然,在一些示例中,在开始更新前,终端100可以在屏幕上显示包含更新提示和选择控件的弹窗。用户可以选择更新或不更新。进一步的,用户可以选择立即更新,或稍后更新,或设定一个时间进行更新,以避免在终端100繁忙的时候更新,影响用户使用体验。
在一些示例中,终端100还可从有效音频数据中提取说话人的韵律特征。然后,终端100可以结合上述提取的说话人的韵律特征和预设的韵律特征控制参数,并使用结合后的韵律特征控制合成语音样本。
参考图4B,在S106之后,终端100可从确定的包括自定义唤醒词并成功唤醒终端100的有效音频数据中提取说话人的韵律特征。然后,终端100可将上述提取的说话人的韵律 特征与语音合成器中预设的韵律特征控制参数结合,并使用上述提取的和预设的韵律特征合成新的语音样本。参考图2,这时,图2中的韵律特征既包括预设的韵律特征控制参数,还包括提取的韵律特征。这样,终端100可以得到更多的韵律特征,从而使得合成语音样本覆盖更多的不同情绪、不同停顿的说话方式。
然后,终端100可利用上述新的语音样本训练当前使用的细模型,进而实现进一步的细模型优化。这时,优化后的细模型的准确率更高,识别效果更好。
在一些示例中,终端100在识别唤醒词等语音指令时还会进行声纹识别,即识别说话人是否为机主。当识别到唤醒词且确认该唤醒词是机主说出的时,终端100才会被唤醒。
图5A示例性示出了本申请实施例提供的另一种语音识别方法的流程图。
S201、终端100确定自定义唤醒词,同时进行用户注册,确定机主的声纹信息。
终端100在设定自定义唤醒词时,除了像S101中介绍的确定自定义唤醒词的文本之外,还可以获取机主的声纹信息。声纹信息是指描述说话人身份的音频信息。一个用户的声纹信息的唯一的,用于标记该用户。
示例性的,在接收到上述自定义唤醒词的文本数据之后,终端100可指示当前用户进行用户注册。在进行用户注册的过程中,终端100可确定该用户的声纹信息(机主声纹信息)。例如,终端100可指示用户重复3次自定义唤醒词。在用户重复自定义唤醒词时,终端100的麦克风可采集对应的音频数据,即注册语音数据。上述注册语音数据可用于提取机主的声纹信息。
在一些示例中,终端100还可从以往采集的默认唤醒词的音频数据中提取机主的声纹信息。
S202、终端100根据自定义唤醒词文本和预设的韵律特征控制参数合成语音样本。
S203、终端100基于合成语音样本对粗模型进行优化,生成适用于自定义唤醒词的细模型。
S204、终端100通过麦克风获取环境音频。上述S202~S204可参考图1中S102~S104的介绍,这里不再赘述。
S205、终端100判断是否识别到自定义唤醒词。
在本申请实施例中,在接收到麦克风采集的环境音频后,优化后得到的细模型可识别首先识别上述音频中是否包括自定义唤醒词。当上述音频中不包括自定义唤醒词时,终端100会重新将新采集到的环境音频输入到细模型中,继续识别,直到识别到自定义唤醒词。
S206、终端100判断说话人是否为机主。
在本申请实施例中,在识别到自定义唤醒词后,终端100还会进行声纹验证,确定采集到的环境音频中包括的自定义唤醒词是否是机主说出的,即确定说话人是否为机主。
在确认说话人是机主后,终端100可确认唤醒自身,然后进入语音控制模式。当确认说话人不是机主时,终端100不会唤醒自身。这时,终端100会继续采集到的当前的环境音频,继续识别新采集的音频是否包括自定义唤醒词,是否为机主说出的,直到识别到机主说出自定义唤醒词。
可以理解的,终端100执行S205、S206的顺序还可交换,即先确认说话人是否为机主, 再确认语音内容是否包括自定义唤醒词。在一些示例中,终端100识别麦克风输入的环境音频后,同时输出声纹识别结果和自定义唤醒词识别结果。这时,当声纹识别结果和自定义唤醒词识别结果分别满足上述要求时,终端100确认唤醒自身。存在任意一个不满足要求,则终端100不唤醒自身。
S207、终端100将唤醒终端100的有效音频数据加入到细模型中。
这时,在S205、S206中,能够唤醒终端100的环境音频(包括自定义唤醒词且说话人为机主)可称为有效音频数据。在确定有效音频数据之后,终端100可将有效音频数据加入到细模型中,扩充细模型的语音样本,进而提升细模型的识别效果。
S208、终端100确认有效语音样本数量是否足够且是否在更新时间内?
参考S107,在加入有效音频数据之后,终端100可监测新增的有效音频数据的数量以及当前时间。当新增的有效音频数据的数量满足数量阈值的要求,且在更新时间内时,终端100可确定更新当前使用细模型。终端100可使用上述有效音频数据和/或合成语音样本更新当前使用细模型,得到准确率更高,识别效果更好的细模型。
结合图4B所示的方法,在确定有效音频数据之后,终端100也可从上述有效音频数据中提取韵律特征,结合预设的韵律特征控制参数合成新的语音样本,进而进一步更新当前使用的细模型,这里不再赘述。
在一些实施例中,在确定自定义唤醒词后,终端100也可不立即使用合成的语音样本更新粗模型。
参考图5B,在确定自定义唤醒词之后,终端100预置的粗模型可识别音频数据中是否为内容包括上述自定义唤醒词的音频数据。这时,终端100即可识别麦克风采集的环境音频是否为内容包括自定义唤醒词的音频数据,即检测是否有用户说出自定义唤醒词。当识别到自定义唤醒词且确认说话人为机主时,终端100可从确认唤醒自身,并确认上述包含自定义唤醒词且确认说话人为机主的音频为有效音频数据。
然后,终端100可从上述有效音频数据中提取韵律特征。利用上述提取的韵律特征和自定义唤醒词文本、预设的韵律特征控制参数合成语音样本。当合成的语音样本的数量足够,且在更新时间内时,终端100可将原始的粗模型更新为识别准确率更高、鲁棒性更好的细模型。
这样,终端100可以通过更少的计算操作得到一个适用于识别用户自定义唤醒词的识别模型,例如,减少了图1中S102所示的利用自定义唤醒词文本、预设的韵律特征控制参数合成语音样本的计算操作;减少了图1中S103所示的利用上述语音样本训练粗模型的计算操作等等。
在本申请实施例中:
在图1中,终端100接收的用户设定的自定义唤醒词(例如“小花小花”)可称为第一唤醒词。终端100在确定自定义唤醒词之后首先确定的识别自定义唤醒词的粗模型可称为第一语音识别模型,使用合成语音样本训练后得到的适应于识别自定义唤醒词的细模型可称为第二语音识别模型。在图4A中,利用有效音频数据和/或合成语音样本对当前使用的 细模型进行训练后得到的细模型,例如细模型2.0可称为第三语音识别模型。在图4B中,利用基于有效音频数据提取到的韵律特征更新合成语音样本,进而更新当前使用的细模型之后,得到的细模型(细模型2.0)也可称为第三语音识别模型。
在图2中,语音合成器中的韵律特征控制参数和噪声可称为控制参数。
对应使用有效音频数据和合成语音样本优化细模型的方法中,设定的有效音频数据数量阈值(例如100)可称为第一数量阈值。对应全部使用有效音频数据优化细模型的方法中,设定的有效音频数据数量阈值(例如1000)可称为第二数量阈值。
图6A-图6I、图7A-图7D示例性示出了终端100实施上述语音识别方法的一组用户界面。
首先,图6A-图6I示出了终端100为用户提供设定自定义唤醒词功接口的一组用户界面。
图6A示例性示出了终端100上的设置自定义唤醒词的用户界面61。
如图6A所示,该界面可包括多个设置选项,例如“应用”、“电池”等选项。在本申请实施例中,用户界面61中还包括“智慧助手”选项。该选项可用于设置语音助手、快捷动作等与快捷控制相关的设置项。
终端100可检测到作用于上述“智慧助手”选项的用户操作,例如点击操作。响应于上述操作,终端100可显示图6B所示的用户界面62。
用户界面62可包括多个用于设置语音助手、快捷动作等与快捷控制相关的设置项,例如“智慧语音”、“辅助视觉”、“智慧识屏”等设置项。其中,“智慧语音”可用于设置唤醒词、命令词等于语音控制相关的设置项。
终端100可检测到作用于上述“智慧语音”选项的用户操作,响应于上述操作,终端100可显示图6C所示的用户界面63。
用户界面63可包括“语音唤醒”和“智能服务”设置项。
“语音唤醒”可用于设置开启或关闭唤醒词识别功能,以及设置默认唤醒词或自定义唤醒词。“智能服务”可用于设置开启或关闭自学习更新功能。上述自学习更新功能是指:终端100根据用户使用唤醒词控制功能的实际情况,调整语音识别模型的结果和/或参数,以提升识别准确率的功能。
如用户界面63所示,此时,终端100的“语音唤醒”功能为关闭的,“智能服务”功能也是关闭的。首先,终端100可检测到作用于上述“语音唤醒”选项的用户操作,响应于上述操作,终端100可显示图6D所示的用户界面64。
用户界面64可包括控件641。
控件641可用于设置开启或关闭唤醒词识别功能。首先,控件641为关闭的(OFF),参考用户界面63中控件631。控件641关闭对应“语音唤醒”功能关闭。当检测到作用于控件641的用户操作时,控件641可变为开启的(ON)。这时,终端100启用“语音唤醒” 功能,即终端100开始识别用户是否说出唤醒词。
用户界面64还包括控件642,控件643。
控件642可用于设置默认唤醒词。控件643可用于设置自定义唤醒词。在开启“语音唤醒”功能时,终端100可首先选定默认唤醒词。用户可以通过作用于控件643上的操作,将默认唤醒词切换为自定义唤醒词。当检测到作用于控件643上的用户操作时,终端100可显示图6E所示用户界面65。
用户界面65可包括窗口651。窗口651可用于设置自定义唤醒词。
窗口651可包括输入框652。输入框652可用于接收用户输入的自定义唤醒词。窗口651还包括输入框652。当检测到作用于取消控件653上的用户操作时,终端100可取消使用自定义唤醒词并关闭窗口651。当检测到作用于确认控件654上的用户操作时,终端100可确定使用自定义唤醒词并关闭窗口651。
如图6E所示,终端100可接收到用户输入并确定使用自定义唤醒词“小花小花”的操作,响应于上述操作,终端100可显示图6F所示的用户界面66。
如用户界面66所示,此时,终端100可显示已选定自定义唤醒词,并在自定义唤醒词控件643中显示用户设定的自定义唤醒词的具体内容(“小花小花”)。然后,终端100可检测到作用于退出控件644上的用户操作,响应于上述操作,终端100可显示图6G所示的用户界面67。
此时,在用户界面67中,“语音唤醒”选项中可显示“已开启”,以提示用户已开启唤醒词识别功能。
然后,终端100还可检测到作用于用户界面67中“智能服务”选项的用户操作,响应于上述操作,终端100可显示图6H所示的用户界面68。在用户界面68中,控件631由关闭(OFF)状态变更为开启状态(ON)。这时,终端100可在后续识别自定义唤醒词的过程中,记录用户说出的自定义唤醒词,以及用户唤醒终端100之后的操作,进而改进语音识别模型,使得终端100可以更准确识别地自定义唤醒词。
在设定自定义唤醒词之后,终端100可在“智慧助手”设置界面显示用户设定的自定义唤醒词“小花小花”,参考图6I所示的用户界面69。这样,用户在每次打开上述界面时,可以清楚地了解到当前使用的唤醒词。
结合图4A中S106、S107的介绍,在终端100累计的成功唤醒终端100的有效音频数据满足预设数量后,终端100可使用上述有效音频数据更新当前使用的唤醒词识别模型。
在一些示例中,终端100可以自动更新,无需用户确认是否更新。在更新完之后,终端100可显示更新完成的通知,以提示用户享用更好的唤醒词识别服务。图7A示例性示出了终端100显示更新完成的通知的用户界面71。用户界面71中可包括通知711。通知711中可显示“已优化语音识别系统到2.0版本”,以提示用户。
在一些示例中,终端100可以询问用户是否更新。图7B示例性示出了终端100询问用 户是否更新的用户界面72。用户界面71中可包括通知721。通知721中可显示“检测到可更新到语音识别系统2.0”,以提示用户可以更新唤醒词识别模型。通知721还可包括取消控件722、更新控件723。当检测到作用于取消控件722的用户操作时,终端100可取消更新唤醒词识别模型。检测到作用于更新控件723的用户操作时,终端100可开始更新唤醒词识别模型。
进一步的,在一些示例中,在确定更新唤醒词识别模型后,终端100可以从用户处获取优选的更新时间。在检测到作用于更新控件723的用户操作后,如图7C所示的用户界面73,终端100可显示窗口731。窗口731可包括选项732和选项733。选项732可为用户提供设置自定义更新时间的功能。例如,终端100可在检测到作用于选项732上的操作之后,接收到用户设定的1小时的更新时间设置。然后,终端100可开始计时1个小时,并在计时1小时结束后开始更新唤醒词识别模型。选项733。可用于设定在空闲时间(夜晚,例如1:00-4:00)更新。
可选的,在更新唤醒词识别模型后,终端100可在“智慧助手”设置界面显示当前使用的自定义唤醒词识别模型的版本,参考图7D所示的用户界面74中示出的“小花小花V2.0最新版本”。这样,当用户打开“智慧助手”设置界面时,用户可以了解当前使用的唤醒词识别模型的版本,以及是否是最新的。当上述版本不是最新的是,用户可指示终端100更新到最新的,以获取更好唤醒词识别服务。
图8为本申请实施例提供的终端100的系统结构示意图。
分层架构将系统分成若干个层,每一层都有清晰的角色和分工。层与层之间通过软件接口通信。在一些实施例中,将系统分为五层,从上至下分别为应用程序层(应用层),应用程序框架层(框架床)、硬件抽象层、驱动层以及硬件层。
应用层可以包括一系列应用程序包,例如拨号应用、图库应用等等。在本申请实施例中,应用程序包还包括语音识别SDK(software development kit)。终端100的系统和终端100上安装的第三应用程序,可通过语音识别SDK获取唤醒词识别在内的语音识别功能。
框架层为应用层的应用程序提供应用编程接口(application programming interface,API)和编程框架。框架层包括一些预先定义的函数。在本申请实施例中,框架层可以包括麦克风服务接口和唤醒词识别服务接口。其中,唤醒词识别服务接口可为获取唤醒词识别服务的应用提供应用编程接口和编程框架。麦克风服务可用于为调用麦克风的应用提供应用编程接口和编程框架。
硬件抽象层为位于框架层以及驱动层之间的接口层,为操作系统提供虚拟硬件平台。本申请实施例中,硬件抽象层可以包括麦克风硬件抽象层以及唤醒词识别算法库。麦克风硬件抽象层可以提供麦克风1、麦克风2或更多的麦克风设备的虚拟硬件。唤醒词识别算法库可包括实现本申请实施例提供的唤醒词识别方法的运行代码和数据。
驱动层为硬件和软件之间的层。驱动层包括各种硬件的驱动。驱动层可以包括麦克风设备驱动、数字信号处理器驱动等。麦克风设备驱动用于驱动麦克风传感器采集声音信号,以及驱动音频信号处理器对声音信号进行预处理,得到音频数字信号。数 字信号处理器驱动用于驱动数字信号处理器处理音频数字信号。
硬件层包括传感器和音频信号处理器。其中,传感器包括麦克风1、麦克风2.传感器中包括的麦克风与麦克风硬件抽象层中包括的虚拟的麦克风一一对应。音频信号处理器可用于将麦克风采集的声音信号转化为音频数字信号。数字信号处理器可用于处理音频数字信号。
下面结合上述硬件结构以及系统结构,对本申请实施例中方法进行具体描述:
一般的,终端100在开机状态下,唤醒词唤醒功能是常开的。因此,在终端100开机时,语音识别SDK就会被启用。响应于启用语音识别SDK,语音识别SDK可调用唤醒词识别服务接口,获取唤醒词识别服务提供应用编程接口和编程框架。
一方面,唤醒词识别服务可调用框架层的麦克风服务,通过麦克风服务采集环境中的声音信号。其中,麦克风服务可通过调用麦克风硬件抽象层中的麦克风1,向硬件层的麦克风1传感器发送采集声音信号的指令。麦克风硬件抽象层将该指令发送到驱动层的麦克风设备驱动。麦克风设备驱动依据上述指令可以启动麦克风1,从而获取到环境中的声音信号,并通过音频信号处理器生成数字音频信号。
另一方面,唤醒词识别服务可初始化唤醒词识别算法。唤醒词识别算法可通过麦克风硬件抽象层获取音频信号处理器生成数字音频信号。然后,根据唤醒词识别算法中存储的数字音频信号处理方法,唤醒词识别算法可利用数字信号处理器对获取到的数字音频信号进行计算,从而确定是否检测到唤醒词(默认唤醒词/自定义唤醒词)。
可以理解的,结合前面图1、图4A、图4B、图5A以及图5B所示方法流程图,当唤醒词为默认唤醒词时,上述唤醒词算法库所使用的唤醒词识别模型为默认唤醒词的识别模型。当唤醒词为自定义唤醒词时,上述唤醒词算法库所使用的唤醒词识别模型为自定义唤醒词识别模型,即前述介绍的适用于识别自定义唤醒词的细模型。
最后,唤醒词识别算法可将识别结果传回唤醒词识别服务,进而传回应用层。当识别到唤醒词时,语音识别SDK可触发唤醒终端100,进入语音控制模式;反之,语音识别SDK不会触发唤醒终端100。
图9示出了终端100的硬件结构示意图。
终端100可以包括处理器110,外部存储器接口120,内部存储器121,通用串行总线(universal serial bus,USB)接口130,充电管理模块140,电源管理模块141,电池142,天线1,天线2,移动通信模块150,无线通信模块160,音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,传感器模块180,按键190,马达191,指示器192,摄像头193,显示屏194,以及用户标识模块(subscriber identification module,SIM)卡接口195等。其中传感器模块180可以包括压力传感器180A,陀螺仪传感器180B,气压传感器180C,磁传感器180D,加速度传感器180E,距离传感器180F,接近光传感器180G,指纹传感器180H,温度传感器180J,触摸传感器180K,环境光传感器180L,骨传导传感器180M等。
可以理解的是,本发明实施例示意的结构并不构成对终端100的具体限定。在本申请另一些实施例中,终端100可以包括比图示更多或更少的部件,或者组合某些部件,或者 拆分某些部件,或者不同的部件布置。图示的部件可以以硬件,软件或软件和硬件的组合实现。
处理器110可以包括一个或多个处理单元,例如:处理器110可以包括应用处理器(application processor,AP),调制解调处理器,图形处理器(graphics processing unit,GPU),图像信号处理器(image signal processor,ISP),控制器,视频编解码器,数字信号处理器(digital signal processor,DSP),基带处理器,和/或神经网络处理器(neural-network processing unit,NPU)等。其中,不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。
在本申请实施例中,处理器110包括应用处理器、音频信号处理器、数字信号处理器。其中,应用处理器可用于维持终端100上操作系统以及各类应用程序正常运行。音频信号处理器可用于将麦克风采集的声音信号转化为音频数字信号。数字信号处理器可用于处理音频数字信号,以实现本申请实施例提供的语音识别功能。
控制器可以根据指令操作码和时序信号,产生操作控制信号,完成取指令和执行指令的控制。
处理器110中还可以设置存储器,用于存储指令和数据。在一些实施例中,处理器110中的存储器为高速缓冲存储器。该存储器可以保存处理器110刚用过或循环使用的指令或数据。如果处理器110需要再次使用该指令或数据,可从所述存储器中直接调用。避免了重复存取,减少了处理器110的等待时间,因而提高了系统的效率。
在一些实施例中,处理器110可以包括一个或多个接口。接口可以包括集成电路(inter-integrated circuit,I2C)接口,集成电路内置音频(inter-integrated circuit sound,I2S)接口,脉冲编码调制(pulse code modulation,PCM)接口,通用异步收发传输器(universal asynchronous receiver/transmitter,UART)接口,移动产业处理器接口(mobile industry processor interface,MIPI),通用输入输出(general-purpose input/output,GPIO)接口,用户标识模块(subscriber identity module,SIM)接口,和/或通用串行总线(universal serial bus,USB)接口等。
可以理解的是,本发明实施例示意的各模块间的接口连接关系,只是示意性说明,并不构成对终端100的结构限定。在本申请另一些实施例中,终端100也可以采用上述实施例中不同的接口连接方式,或多种接口连接方式的组合。
充电管理模块140用于从充电器接收充电输入。充电管理模块140为电池142充电的同时,还可以通过电源管理模块141为电子设备供电。电源管理模块141用于连接电池142,充电管理模块140与处理器110。电源管理模块141接收电池142和/或充电管理模块140的输入,为处理器110,内部存储器121,显示屏194,摄像头193,和无线通信模块160等供电。
终端100的无线通信功能可以通过天线1,天线2,移动通信模块150,无线通信模块160,调制解调处理器以及基带处理器等实现。天线1和天线2用于发射和接收电磁波信号。移动通信模块150可以提供应用在终端100上的包括2G/3G/4G/5G等无线通信的解决方案。调制解调处理器可以包括调制器和解调器。其中,调制器用于将待发送的低频基带信号调制成中高频信号。解调器用于将接收的电磁波信号解调为低频基带信号。无线通信模块160可以提供应用在终端100上的包括无线局域网(wireless local area networks,WLAN)(如无线 保真(wireless fidelity,Wi-Fi)网络),蓝牙(bluetooth,BT),全球导航卫星系统(global navigation satellite system,GNSS),调频(frequency modulation,FM),近距离无线通信技术(near field communication,NFC),红外技术(infrared,IR)等无线通信的解决方案。
终端100通过GPU,显示屏194,以及应用处理器等实现显示功能。GPU为图像处理的微处理器,连接显示屏194和应用处理器。GPU用于执行数学和几何计算,用于图形渲染。处理器110可包括一个或多个GPU,其执行程序指令以生成或改变显示信息。
显示屏194用于显示图像,视频等。显示屏194包括显示面板。显示屏194包括显示面板。显示面板可以采用液晶显示屏(liquid crystal display,LCD)。显示面板还可以采用有机发光二极管(organic light-emitting diode,OLED),有源矩阵有机发光二极体或主动矩阵有机发光二极体(active-matrix organic light emitting diode,AMOLED),柔性发光二极管(flex light-emitting diode,FLED),miniled,microled,micro-oled,量子点发光二极管(quantum dot light emitting diodes,QLED)等制造。在一些实施例中,电子设备可以包括1个或N个显示屏194,N为大于1的正整数。
在本申请实施例中,响应于识别到唤醒词终端100点亮屏幕,以及终端100显示图6A-图6I、图7A-图7D所示的用户界面,依赖于GPU,显示屏194,以及应用处理器等提供的显示功能。
终端100可以通过ISP,摄像头193,视频编解码器,GPU,显示屏194以及应用处理器等实现拍摄功能。摄像头193用于捕获静态图像或视频。ISP用于处理摄像头193反馈的数据。视频编解码器用于对数字视频压缩或解压缩。终端100可以支持一种或多种视频编解码器。
NPU为神经网络(neural-network,NN)计算处理器,通过借鉴生物神经网络结构,例如借鉴人脑神经元之间传递模式,对输入信息快速处理,还可以不断的自学习。通过NPU可以实现终端100的智能认知等应用,例如:图像识别,人脸识别,语音识别,文本理解等。
在本申请实施例中,终端100利用使用过程中积累的唤醒词音频数据更新唤醒词识别模型,可通过NPU为神经网络计算处理器完成。
内部存储器121可以包括一个或多个随机存取存储器(random access memory,RAM)和一个或多个非易失性存储器(non-volatile memory,NVM)。
随机存取存储器可以包括静态随机存储器(static random-access memory,SRAM)、动态随机存储器(dynamic random access memory,DRAM)、同步动态随机存储器(synchronous dynamic random access memory,SDRAM)、双倍资料率同步动态随机存取存储器(double data rate synchronous dynamic random access memory,DDR SDRAM,例如第五代DDR SDRAM一般称为DDR5SDRAM)等。非易失性存储器可以包括磁盘存储器件、快闪存储器(flash memory)。
在本申请实施例中,实现本申请实施例所提供的语音识别方法的可执行代码可存放在终端100的NVM中,例如SD卡等。在终端100运行上述代码以提供唤醒词识别功能时,终端100可将上述代码加载到RAM中。
终端100在运行上述代码的过程中,可将麦克风采集并生成的音频信号数据存储在RAM或NVM的缓存中,其中,终端100确定为有效音频数据的音频可被终端100进一步存储在NVM中,以供后续优化唤醒词识别模型使用。
随机存取存储器可以由处理器110直接进行读写,可以用于存储操作系统或其他正在运行中的程序的可执行程序(例如机器指令),还可以用于存储用户及应用程序的数据等。
非易失性存储器也可以存储可执行程序和存储用户及应用程序的数据等,可以提前加载到随机存取存储器中,用于处理器110直接进行读写。
外部存储器接口120可以用于连接外部的非易失性存储器,实现扩展终端100的存储能力。外部的非易失性存储器通过外部存储器接口120与处理器110通信,实现数据存储功能。例如将音乐,视频等文件保存在外部的非易失性存储器中。
终端100可以通过音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,以及应用处理器等实现音频功能。例如音乐播放,录音等。
音频模块170用于将数字音频信息转换成模拟音频信号输出,也用于将模拟音频输入转换为数字音频信号。音频模块170还可以用于对音频信号编码和解码。在一些实施例中,音频模块170可以设置于处理器110中,或将音频模块170的部分功能模块设置于处理器110中。
扬声器170A,也称“喇叭”,用于将音频电信号转换为声音信号。终端100可以通过扬声器170A收听音乐,或收听免提通话。
受话器170B,也称“听筒”,用于将音频电信号转换成声音信号。当终端100接听电话或语音信息时,可以通过将受话器170B靠近人耳接听语音。
麦克风170C,也称“话筒”,“传声器”,用于将声音信号转换为电信号。当拨打电话或发送语音信息时,用户可以通过人嘴靠近麦克风170C发声,将声音信号输入到麦克风170C。终端100可以设置至少一个麦克风170C。在另一些实施例中,终端100可以设置两个麦克风170C,除了采集声音信号,还可以实现降噪功能。在另一些实施例中,终端100还可以设置三个,四个或更多麦克风170C,实现采集声音信号,降噪,还可以识别声音来源,实现定向录音功能等。
在本申请实施例中,终端100可通过麦克风170C采集环境音频。基于麦克风170C采集并生成的音频信号,终端100可检测是否包含唤醒词,进而确定是否唤醒自身、进入语音控制模式。
耳机接口170D用于连接有线耳机。耳机接口170D可以是USB接口130,也可以是3.5mm的开放移动电子设备平台(open mobile terminal platform,OMTP)标准接口,美国蜂窝电信工业协会(cellular telecommunications industry association of the USA,CTIA)标准接口。
压力传感器180A用于感受压力信号,可以将压力信号转换成电信号。陀螺仪传感器180B可以用于确定终端100的运动姿态。气压传感器180C用于测量气压。磁传感器180D包括霍尔传感器。终端100可以利用磁传感器180D检测翻盖皮套的开合。加速度传感器180E可检测终端100在各个方向上(一般为三轴)加速度的大小。距离传感器180F用于测量距离。终端100可以通过红外或激光测量距离。接近光传感器180G可以包括例如发光二 极管(LED)和光检测器,例如光电二极管。终端100使用光电二极管检测来自附近物体的红外反射光,以确定终端100附近没有物体。环境光传感器180L用于感知环境光亮度。指纹传感器180H用于采集指纹。温度传感器180J用于检测温度。
触摸传感器180K,也称“触控器件”。触摸传感器180K可以设置于显示屏194,由触摸传感器180K与显示屏194组成触摸屏,也称“触控屏”。触摸传感器180K用于检测作用于其上或附近的触摸操作。触摸传感器可以将检测到的触摸操作传递给应用处理器,以确定触摸事件类型。可以通过显示屏194提供与触摸操作相关的视觉输出。在另一些实施例中,触摸传感器180K也可以设置于终端100的表面,与显示屏194所处的位置不同。
在本申请实施例中,终端100检测用户作用于终端100屏幕上的点击、滑动等操作,依赖于触摸传感器180K。
骨传导传感器180M可以获取振动信号。按键190包括开机键,音量键等。终端100可以接收按键输入,产生与终端100的用户设置以及功能控制有关的键信号输入。马达191可以产生振动提示。马达191可以用于来电振动提示,也可以用于触摸振动反馈。指示器192可以是指示灯,可以用于指示充电状态,电量变化,也可以用于指示消息,未接来电,通知等。SIM卡接口195用于连接SIM卡。
本申请的说明书和权利要求书及附图中的术语“用户界面(user interface,UI)”,是应用程序或操作系统与用户之间进行交互和信息交换的介质接口,它实现信息的内部形式与用户可以接受形式之间的转换。应用程序的用户界面是通过java、可扩展标记语言(extensible markup language,XML)等特定计算机语言编写的源代码,界面源代码在终端设备上经过解析,渲染,最终呈现为用户可以识别的内容,比如图片、文字、按钮等控件。控件(control)也称为部件(widget),是用户界面的基本元素,典型的控件有工具栏(toolbar)、菜单栏(menu bar)、文本框(text box)、按钮(button)、滚动条(scrollbar)、图片和文本。界面中的控件的属性和内容是通过标签或者节点来定义的,比如XML通过<Textview>、<ImgView>、<VideoView>等节点来规定界面所包含的控件。一个节点对应界面中一个控件或属性,节点经过解析和渲染之后呈现为用户可视的内容。此外,很多应用程序,比如混合应用(hybrid application)的界面中通常还包含有网页。网页,也称为页面,可以理解为内嵌在应用程序界面中的一个特殊的控件,网页是通过特定计算机语言编写的源代码,例如超文本标记语言(hyper text markup language,GTML),层叠样式表(cascading style sheets,CSS),java脚本(JavaScript,JS)等,网页源代码可以由浏览器或与浏览器功能类似的网页显示组件加载和显示为用户可识别的内容。网页所包含的具体内容也是通过网页源代码中的标签或者节点来定义的,比如GTML通过<p>、<img>、<video>、<canvas>来定义网页的元素和属性。
用户界面常用的表现形式是图形用户界面(graphic user interface,GUI),是指采用图形方式显示的与计算机操作相关的用户界面。它可以是在终端设备的显示屏中显示的一个图标、窗口、控件等界面元素,其中控件可以包括图标、按钮、菜单、选项卡、文本框、对话框、状态栏、导航栏、Widget等可视的界面元素。
在本申请的说明书和所附权利要求书中所使用的那样,单数表达形式“一个”、“一种”、 “所述”、“上述”、“该”和“这一”旨在也包括复数表达形式,除非其上下文中明确地有相反指示。还应当理解,本申请中使用的术语“和/或”是指并包含一个或多个所列出项目的任何或所有可能组合。上述实施例中所用,根据上下文,术语“当…时”可以被解释为意思是“如果…”或“在…后”或“响应于确定…”或“响应于检测到…”。类似地,根据上下文,短语“在确定…时”或“如果检测到(所陈述的条件或事件)”可以被解释为意思是“如果确定…”或“响应于确定…”或“在检测到(所陈述的条件或事件)时”或“响应于检测到(所陈述的条件或事件)”。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线)或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如DVD)、或者半导体介质(例如固态硬盘)等。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,该流程可以由计算机程序来指令相关的硬件完成,该程序可存储于计算机可读取存储介质中,该程序在执行时,可包括如上述各方法实施例的流程。而前述的存储介质包括:ROM或随机存储记忆体RAM、磁碟或者光盘等各种可存储程序代码的介质。
以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以所述权利要求的保护范围为准。

Claims (14)

  1. 一种语音识别方法,应用于终端设备,其特征在于,所述方法包括:
    确定第一唤醒词,所述第一唤醒词是用户设定的;
    根据所述第一唤醒词和预设的控制参数合成语音样本,所述语音样本是语音内容包括第一唤醒词的音频数据,所述控制参数用于控制合成的语音样本中所表现出的说话方式和/或说话场景;
    利用所述合成的语音样本对第一语音识别模型进行训练得到第二语音识别模型;所述第一语音识别模型为训练前用于识别所述第一唤醒词的语音识别模型,所述第二语音识别模型为训练后用于识别所述第一唤醒词的语音识别模型;
    使用所述第二语音识别模型识别麦克风采集的音频数据;
    当从所述麦克风采集的音频数据中识别到所述第一唤醒词时,唤醒所述终端设备。
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    确定所述麦克风采集的音频数据中成功唤醒所述终端设备的音频数据为有效音频数据;
    利用所述有效音频数据和合成的语音样本对所述第二语音识别模型进行优化,得到第三语音识别模型;
    使用所述第三语音识别模型处理麦克风采集的音频数据。
  3. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    确定所述麦克风采集的音频数据中成功唤醒所述终端设备的音频数据为有效音频数据;
    利用所述有效音频数据对所述第二语音识别模型进行优化,得到第三语音识别模型;
    使用所述第三语音识别模型处理麦克风采集的音频数据。
  4. 根据权利要求3所述的方法,其特征在于,在利用所述有效音频数据和合成的语音样本对所述第二语音识别模型进行优化之前,所述方法还包括:
    确认所述有效音频数据的数量大于等于第一数量阈值,所述第一数量阈值为预设的。
  5. 根据权利要求4所述的方法,其特征在于,在利用所述有效音频数据对所述第二语音识别模型进行优化之前,所述方法还包括:
    确认所述有效音频数据的数量大于等于第二数量阈值,所述第二数量阈值为预设的。
  6. 根据权利要求4或5所述的方法,其特征在于,在对所述第二语音识别模型进行优化之前,所述方法还包括:
    确认当前时刻在预设的更新时间范围内。
  7. 根据权利要求2-6中任一项所述的方法,其特征在于,所述控制参数包括韵律特征;所述韵律特征用于控制合成的语音样本中说话人的说话方式,所述说话人的说话方式包括下一项或多项:说话人的说话时的情绪、停顿。
  8. 根据权利要求7所述的方法,其特征在于,所述根据所述第一唤醒词和预设的控制参数合成语音样本,具体包括:
    将所述第一唤醒词和预设的韵律特征输入语音合成器;
    利用所述语音合成器合成N条语音样本,所述N≥1。
  9. 根据权利要求8所述的方法,其特征在于,所述方法还包括:依次对所述N条语音样本进行数据增强处理,得到M条语音样本,述M≥N。
  10. 根据权利要求9所述的方法,其特征在于,所述控制参数还包括噪声参数,所述噪声参数用于控制合成的语音样本中说话人的说话场景,所述依次对所述N条语音样本进行数据增强处理,具体包括:通过所述噪声参数对所述N条语音样本进行数据加噪。
  11. 根据权利要求7-10中任一项所述的方法,其特征在于,所述方法还包括:
    从所述有效音频数据中提取韵律特征;
    利用所述第一唤醒词、所述控制参数中的韵律特征和提取的韵律特征更新合成的语音样本。
  12. 根据权利要求1-11中任一项所述的方法,其特征在于,
    所述第一语音识别模型的输入层与所述第二语音识别模型的输入层中包括的数据处理层的数量相同;所述第一语音识别模型的输入层与所述第二语音识别模型的输入层中对应的数据处理层的参数相同。
  13. 一种电子设备,其特征在于,包括一个或多个处理器和一个或多个存储器;其中,所述一个或多个存储器与所述一个或多个处理器耦合,所述一个或多个存储器用于存储计算机程序代码,所述计算机程序代码包括计算机指令,当所述一个或多个处理器执行所述计算机指令时,使得执行如权利要求1-12任一项所述的方法。
  14. 一种计算机可读存储介质,包括指令,其特征在于,当所述指令在电子设备上运行时,使得执行如权利要求1-12任一项所述的方法。
PCT/CN2022/140339 2022-04-29 2022-12-20 一种语音识别方法和电子设备 WO2023207149A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210468803.4 2022-04-29
CN202210468803.4A CN117012189A (zh) 2022-04-29 2022-04-29 一种语音识别方法和电子设备

Publications (1)

Publication Number Publication Date
WO2023207149A1 true WO2023207149A1 (zh) 2023-11-02

Family

ID=88517213

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/140339 WO2023207149A1 (zh) 2022-04-29 2022-12-20 一种语音识别方法和电子设备

Country Status (2)

Country Link
CN (1) CN117012189A (zh)
WO (1) WO2023207149A1 (zh)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110097876A (zh) * 2018-01-30 2019-08-06 阿里巴巴集团控股有限公司 语音唤醒处理方法和被唤醒设备
CN111081217A (zh) * 2019-12-03 2020-04-28 珠海格力电器股份有限公司 一种语音唤醒方法、装置、电子设备及存储介质
US20200184966A1 (en) * 2018-12-10 2020-06-11 Amazon Technologies, Inc. Wakeword detection
CN111354343A (zh) * 2020-03-09 2020-06-30 北京声智科技有限公司 语音唤醒模型的生成方法、装置和电子设备
CN114220423A (zh) * 2021-12-31 2022-03-22 思必驰科技股份有限公司 语音唤醒、定制唤醒模型的方法、电子设备和存储介质
CN114299933A (zh) * 2021-12-28 2022-04-08 北京声智科技有限公司 语音识别模型训练方法、装置、设备、存储介质及产品

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11282500B2 (en) * 2019-07-19 2022-03-22 Cisco Technology, Inc. Generating and training new wake words
CN111640426A (zh) * 2020-06-10 2020-09-08 北京百度网讯科技有限公司 用于输出信息的方法和装置
CN113012681B (zh) * 2021-02-18 2024-05-17 深圳前海微众银行股份有限公司 基于唤醒语音模型的唤醒语音合成方法及应用唤醒方法
CN113299275A (zh) * 2021-05-21 2021-08-24 阿里巴巴新加坡控股有限公司 实现语音交互的方法及系统和服务端、客户端及智能音箱

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110097876A (zh) * 2018-01-30 2019-08-06 阿里巴巴集团控股有限公司 语音唤醒处理方法和被唤醒设备
US20200184966A1 (en) * 2018-12-10 2020-06-11 Amazon Technologies, Inc. Wakeword detection
CN111081217A (zh) * 2019-12-03 2020-04-28 珠海格力电器股份有限公司 一种语音唤醒方法、装置、电子设备及存储介质
CN111354343A (zh) * 2020-03-09 2020-06-30 北京声智科技有限公司 语音唤醒模型的生成方法、装置和电子设备
CN114299933A (zh) * 2021-12-28 2022-04-08 北京声智科技有限公司 语音识别模型训练方法、装置、设备、存储介质及产品
CN114220423A (zh) * 2021-12-31 2022-03-22 思必驰科技股份有限公司 语音唤醒、定制唤醒模型的方法、电子设备和存储介质

Also Published As

Publication number Publication date
CN117012189A (zh) 2023-11-07

Similar Documents

Publication Publication Date Title
RU2766255C1 (ru) Способ голосового управления и электронное устройство
CN112397062A (zh) 语音交互方法、装置、终端及存储介质
WO2020239001A1 (zh) 一种哼唱识别方法及相关设备
CN113488042B (zh) 一种语音控制方法及电子设备
KR20190068133A (ko) 오디오 데이터에 포함된 음소 정보를 이용하여 어플리케이션을 실행하기 위한 전자 장치 및 그의 동작 방법
KR20200099380A (ko) 음성 인식 서비스를 제공하는 방법 및 그 전자 장치
WO2022267468A1 (zh) 一种声音处理方法及其装置
WO2022143258A1 (zh) 一种语音交互处理方法及相关装置
CN115881118A (zh) 一种语音交互方法及相关电子设备
CN114968018A (zh) 卡片显示方法、终端设备
KR20210116897A (ko) 외부 장치의 음성 기반 제어를 위한 방법 및 그 전자 장치
WO2021190225A1 (zh) 一种语音交互方法及电子设备
WO2021238371A1 (zh) 生成虚拟角色的方法及装置
CN115083401A (zh) 语音控制方法及装置
WO2023207185A1 (zh) 声纹识别方法、图形界面及电子设备
CN114650330A (zh) 一种添加操作序列的方法、电子设备和系统
CN114360546A (zh) 电子设备及其唤醒方法
WO2023207149A1 (zh) 一种语音识别方法和电子设备
CN113380240B (zh) 语音交互方法和电子设备
CN115641867A (zh) 语音处理方法和终端设备
CN114974213A (zh) 音频处理方法、电子设备及存储介质
CN114765026A (zh) 一种语音控制方法、装置及系统
CN115527547B (zh) 噪声处理方法及电子设备
CN115562535B (zh) 应用控制方法和电子设备
US20230154463A1 (en) Method of reorganizing quick command based on utterance and electronic device therefor

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22939958

Country of ref document: EP

Kind code of ref document: A1