WO2024093515A1 - 一种语音交互方法及相关电子设备 - Google Patents

一种语音交互方法及相关电子设备 Download PDF

Info

Publication number
WO2024093515A1
WO2024093515A1 PCT/CN2023/117410 CN2023117410W WO2024093515A1 WO 2024093515 A1 WO2024093515 A1 WO 2024093515A1 CN 2023117410 W CN2023117410 W CN 2023117410W WO 2024093515 A1 WO2024093515 A1 WO 2024093515A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
confidence
electronic device
confidence level
voice signal
Prior art date
Application number
PCT/CN2023/117410
Other languages
English (en)
French (fr)
Inventor
高飞
吴彪
王志超
夏日升
Original Assignee
荣耀终端有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 荣耀终端有限公司 filed Critical 荣耀终端有限公司
Publication of WO2024093515A1 publication Critical patent/WO2024093515A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/4401Bootstrapping
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/08Use of distortion metrics or a particular distance between probe pattern and reference templates
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Definitions

  • the present application relates to the field of voice interaction, and in particular to a voice interaction method and related electronic equipment.
  • Voice assistant is an intelligent application that helps users solve problems through intelligent interaction of intelligent dialogue and instant Q&A.
  • voice assistants there are three different types of voice assistants: chat type, Q&A type, and command type.
  • Chat type assistants are used to achieve the purpose of chatting and companionship. They use AI technology to communicate with users and perceive user emotions.
  • Q&A type assistants are used for knowledge acquisition, which can acquire knowledge or solve questions through dialogue.
  • the more common application is the intelligent customer service of various platforms.
  • Command type assistants are used for device control, which can control electronic devices through dialogue to achieve certain operations.
  • the more common applications are smart speakers, IOT devices, etc. For example, voice control: "Turn on the air conditioner and adjust it to 25 degrees.”
  • users can wake up the voice assistant without adding a specific wake-up word in the voice command, which makes the user's voice interaction with the electronic device more natural.
  • users can interact with electronic devices by voice, not using a specific wake-up word is also more in line with the user's habits.
  • the embodiment of the present application provides a voice interaction method and a related electronic device method, which solves the problem of voice interaction applications being mistakenly awakened.
  • an embodiment of the present application provides a voice interaction method, which is applied to an electronic device, wherein the electronic device includes a voice interaction application, and the method includes: receiving a first voice signal; in the case of determining that the first voice signal is to be subjected to voice detection, obtaining voice signal data based on the first voice signal; processing the voice signal data through a voice detection model to obtain a first confidence level and voice data, the first confidence level being used to characterize the probability that the first voice signal is a voice command sent by a user to the electronic device; acquiring acceleration data of the electronic device, and obtaining posture information of the electronic device based on the acceleration data; processing the posture information through a posture detection model to obtain a second confidence level and target posture information, the second confidence level being used to characterize the probability that the electronic device is in a hand-held raised state; processing the target posture information and voice data through an audio-posture detection fusion model to obtain a third confidence level, the third confidence level being used to characterize the probability that the electronic device is in a
  • the electronic device after receiving a voice signal, if the electronic device determines that the voice signal needs to be detected, the electronic device processes the voice signal data of the voice signal through the voice detection model, processes the posture information through the posture detection module, and processes the high-order feature data output by the posture detection module and the voice detection model through the audio-posture monitoring model.
  • the three models output three confidences respectively, and then judge whether the received voice signal is the target voice command for waking up the voice assistant based on the three confidences. If so, the voice assistant is woken up, and if not, the voice assistant is not woken up. Since the first confidence is calculated by the voice detection model, the second confidence is calculated by the posture detection model, and the third confidence is calculated by the audio-posture detection fusion model.
  • the first confidence can exclude the application scenario with only the hand-held lifting state
  • the second confidence can exclude the application scenario with only voice input.
  • the third confidence integrates the high-dimensional features of the voice information data and the posture information, which can characterize the real-time correlation between the voice input and the posture state of the electronic device. Therefore, by using the above-mentioned first confidence level, second confidence level and third confidence level to determine whether the first voice signal is the target voice command, the obtained judgment result is more accurate, which can reduce the probability of the voice assistant being mistakenly awakened and improve the user experience.
  • judging whether to start the voice interaction application based on the first confidence level, the second confidence level and the third confidence level specifically includes: when the first confidence level is greater than or equal to the first confidence threshold, setting the first confidence level to 1; when the first confidence level is less than the first confidence threshold, setting the first confidence level to 0; when the second confidence level is greater than or equal to the second confidence threshold, setting the second confidence level to 1; when the second confidence level is less than the second confidence threshold, setting the second confidence level to 0; when the third confidence level is greater than or equal to the third confidence threshold, setting the third confidence level to 1; when the third confidence level is less than the third confidence threshold, setting the third confidence level to 0; performing a logical AND operation on the first confidence level, the second confidence level and the third confidence level to obtain a judgment result; and judging whether to start the voice interaction application according to the judgment result.
  • the electronic device can decide whether to send the first voice signal to the voice interaction application based on the judgment result, so as to avoid the voice interaction application being woken up by mistake and reducing the user's experience.
  • determining whether to start a voice interaction application is based on a judgment result, specifically including: when the judgment result is 1, starting the voice interaction application; when the judgment result is 0, not starting the voice interaction application.
  • the electronic device also includes a voiceprint detection module, which determines whether to start the voice interaction application based on the judgment result, specifically including: when the judgment result is 0, the voice interaction application is not started; when the judgment result is 1, the first voice signal is detected by the voiceprint detection module to see whether it is the voice of the target user, the target user is the user of the electronic device; if the judgment is yes, the voice interaction application is started; if the judgment is no, the voice interaction application is not started.
  • a voiceprint detection module determines whether to start the voice interaction application based on the judgment result, specifically including: when the judgment result is 0, the voice interaction application is not started; when the judgment result is 1, the first voice signal is detected by the voiceprint detection module to see whether it is the voice of the target user, the target user is the user of the electronic device; if the judgment is yes, the voice interaction application is started; if the judgment is no, the voice interaction application is not started.
  • the voiceprint verification module sends the first voice signal to the voice assistant module only after determining that the first voice signal is a voice signal sent by the user himself. In this way, only the user of the electronic device can wake up the voice assistant, which ensures the privacy and security of the user while ensuring that the voice assistant is not triggered by mistake.
  • determining whether to start a voice interaction application is based on a first confidence level, a second confidence level, and a third confidence level, specifically including: calculating a first weight value of the first confidence level, a second weight value of the second confidence level, and a third weight value of the third confidence level; calculating a fused confidence level based on the first confidence level, the first weight value, the second confidence level, the second weight value, the third confidence level, and the third weight value; and determining whether to start a voice interaction application based on the fused confidence level.
  • whether to start a voice interaction application is determined based on the fused confidence level, specifically including: if the fused confidence level is greater than or equal to a first start threshold, starting the voice interaction application; if the fused confidence level is less than the first start threshold, not starting the voice interaction application.
  • the electronic device includes a display screen. If the fused confidence level is less than a first start threshold and is greater than or equal to a second start threshold, a prompt message is displayed on the display screen, and the prompt message is used to instruct the user to issue a voice command again; the second start threshold is less than the first start threshold.
  • the electronic device also includes a voiceprint detection module, which determines whether to start a voice interaction application based on the fused confidence level, specifically including: if the fused confidence level is less than a first start-up threshold, the voice interaction application is not started; if the fused confidence level is greater than or equal to the first start-up threshold, the first voice signal is detected by the voiceprint detection module to see if it is the voice of a target user, where the target user is a user of the electronic device; if the judgment is yes, the voice interaction application is started; if the judgment is no, the voice interaction application is not started.
  • a voiceprint detection module determines whether to start a voice interaction application based on the fused confidence level, specifically including: if the fused confidence level is less than a first start-up threshold, the voice interaction application is not started; if the fused confidence level is greater than or equal to the first start-up threshold, the first voice signal is detected by the voiceprint detection module to see if it is the voice of a target user, where the target
  • the voiceprint verification module sends the first voice signal to the voice assistant module only after determining that the first voice signal is a voice signal sent by the user himself. In this way, only the user of the electronic device can wake up the voice assistant. This ensures the privacy and security of users while preventing the voice assistant from being accidentally triggered.
  • the first voice signal before obtaining voice signal data based on the first voice signal, it also includes: obtaining the signal strength value of the voice signal, the acceleration variance D1 of the electronic device on the x-axis, the acceleration variance D2 of the electronic device on the y-axis, and the acceleration variance D3 of the electronic device on the z-axis; and judging whether the first voice signal needs voice detection based on the signal strength value, D1, D2 and D3.
  • the electronic device After receiving a voice signal, the electronic device first determines whether the voice signal needs voice detection through the wake-up-free first-level judgment module. For the voice signal that does not need voice detection, the process is terminated and the voice signal no longer needs to be processed. The voice signal is judged by the wake-up-free first-level judgment module, and most of the scenes that are not intended by the user are filtered out, thereby avoiding the wake-up-free of the voice assistant in the electronic device and saving the computing resources of the electronic device.
  • the speech data includes first speech data and second speech data, the first speech data is high-order speech feature information output by the convolutional layer of the speech detection model, and the second speech data is high-order speech feature information output by the fully connected layer of the speech detection model;
  • the target posture information includes first target posture information and second target posture information, the first target posture information is high-order speech feature information output by the convolutional layer of the posture detection model, and the second target posture information is high-order speech feature information output by the fully connected layer of the posture detection model.
  • an embodiment of the present application provides an electronic device, which includes: one or more processors, a display screen and a memory; the memory is coupled to the one or more processors, the memory is used to store computer program code, the computer program code includes computer instructions, and the one or more processors call the computer instructions to enable the electronic device to execute: when it is determined that the first voice signal is to be subjected to voice detection, voice signal data is obtained based on the first voice signal; the voice signal data is processed through a voice detection model to obtain a first confidence and voice data, the first confidence is used to characterize the probability that the first voice signal is a voice instruction sent to the electronic device by the user; acceleration data of the electronic device is obtained, and posture information of the electronic device is obtained based on the acceleration data; the posture information is processed through a posture detection model to obtain a second confidence and target posture information, the second confidence is used to characterize the probability that the electronic device is in a hand-held raised state; the target posture information and voice data are processed through an audio-posture detection
  • the one or more processors call the computer instructions to cause the electronic device to execute: when the first confidence is greater than or equal to the first confidence threshold, set the first confidence flag to 1; when the first confidence is less than the first confidence threshold, set the first confidence flag to 0; when the second confidence is greater than or equal to the second confidence threshold, set the second confidence flag to 1; when the second confidence is less than the second confidence threshold, set the second confidence flag to 0; when the third confidence is greater than or equal to the third confidence threshold, set the third confidence flag to 1; when the third confidence is less than the third confidence threshold, set the third confidence flag to 0; perform a logical AND operation on the first confidence flag, the second confidence flag, and the third confidence flag, Obtain a judgment result; and determine whether to start a voice interaction application based on the judgment result.
  • the one or more processors call the computer instruction to cause the electronic device to execute: when the judgment result is 1, start the voice interaction application; when the judgment result is 0, do not start the voice interaction application.
  • the one or more processors call the computer instructions to cause the electronic device to execute: when the judgment result is 0, the voice interaction application is not started; when the judgment result is 1, the first voice signal is detected by the voiceprint detection module to see whether it is the voice of the target user, and the target user is the user of the electronic device; if the judgment is yes, the voice interaction application is started; if the judgment is no, the voice interaction application is not started.
  • the one or more processors call the computer instructions to cause the electronic device to execute: calculate a first weight value of a first confidence level, a second weight value of a second confidence level, and a third weight value of a third confidence level; calculate a fused confidence level based on the first confidence level, the first weight value, the second confidence level, the second weight value, the third confidence level, and the third weight value; and determine whether to start a voice interaction application based on the fused confidence level.
  • the one or more processors call the computer instructions to cause the electronic device to execute: if the fused confidence level is greater than or equal to a first start-up threshold, start the voice interaction application; if the fused confidence level is less than the first start-up threshold, do not start the voice interaction application.
  • the one or more processors call the computer instructions to enable the electronic device to execute: if the fused confidence level is less than the first start threshold and greater than or equal to the second start threshold, control the display screen to display a prompt message, the prompt message is used to instruct the user to issue a voice command again; if the second start threshold is less than First start threshold.
  • the one or more processors call the computer instructions to cause the electronic device to execute: if the fused confidence level is less than a first start-up threshold, the voice interaction application is not started; if the fused confidence level is greater than or equal to the first start-up threshold, the first voice signal is detected by a voiceprint detection module to see if it is the voice of a target user, where the target user is the user of the electronic device; if the judgment is yes, the voice interaction application is started; if the judgment is no, the voice interaction application is not started.
  • the one or more processors call the computer instructions to cause the electronic device to execute: obtaining the signal strength value of the voice signal, the acceleration variance D1 of the electronic device on the x-axis, the acceleration variance D2 of the electronic device on the y-axis, and the acceleration variance D3 of the electronic device on the z-axis; and determining whether the first voice signal requires voice detection based on the signal strength value, D1, D2, and D3.
  • an embodiment of the present application provides an electronic device, comprising: a touch screen, a camera, one or more processors and one or more memories; the one or more processors are coupled to the touch screen, the camera, and the one or more memories, and the one or more memories are used to store computer program code, and the computer program code includes computer instructions.
  • the electronic device executes the method described in the first aspect or any possible implementation method of the first aspect.
  • an embodiment of the present application provides a chip system, which is applied to an electronic device, and the chip system includes one or more processors, which are used to call computer instructions to enable the electronic device to execute the method described in the first aspect or any possible implementation method of the first aspect.
  • an embodiment of the present application provides a computer program product comprising instructions, which, when executed on an electronic device, enables the electronic device to execute the method described in the first aspect or any possible implementation of the first aspect.
  • an embodiment of the present application provides a computer-readable storage medium, comprising instructions, which, when executed on an electronic device, causes the electronic device to execute the method described in the first aspect or any possible implementation of the first aspect.
  • FIG. 1A-FIG 1G are scene example diagrams of a group of voice interaction methods provided in an embodiment of the present application.
  • FIG2 is a system framework diagram of a voice interaction method provided in an embodiment of the present application.
  • FIG3 is a flow chart of a voice interaction method provided in an embodiment of the present application.
  • FIG4 is an example diagram of a user interface provided in an embodiment of the present application.
  • FIG5A is a flow chart of another voice interaction method provided in an embodiment of the present application.
  • FIG5B is a structural diagram of a voiceprint detection model provided in an embodiment of the present application.
  • FIG6 is a schematic diagram of the hardware structure of an electronic device 100 provided in an embodiment of the present application.
  • FIG. 7 is a software structure block diagram of the electronic device 100 provided in an embodiment of the present application.
  • a unit can be, but is not limited to, a process running on a processor, a processor, an object, an executable file, an execution thread, a program, and/or distributed between two or more computers.
  • these units can be executed from various computer-readable media having various data structures stored thereon.
  • Units can communicate, for example, through local and/or remote processes based on signals having one or more data packets (e.g., data from a second unit interacting with another unit in a local system, a distributed system, and/or a network.
  • the Internet interacts with other systems via signals).
  • a voice interaction application is taken as a voice assistant for example.
  • Voice assistant is an intelligent application that helps users solve problems through intelligent interaction of intelligent dialogue and instant Q&A.
  • voice assistants there are three different types of voice assistants: chat type, Q&A type, and command type.
  • Chat type assistants are used to achieve the purpose of chatting and companionship. They use AI technology to communicate with users and perceive user emotions.
  • Q&A type assistants are used for knowledge acquisition, which can acquire knowledge or solve questions through dialogue.
  • the more common application is the intelligent customer service of various platforms.
  • Command type assistants are used for device control, which can control electronic devices through dialogue to achieve certain operations.
  • the more common applications are smart speakers, IOT devices, etc. For example, voice control: "Turn on the air conditioner and adjust it to 25 degrees.”
  • FIG1A when a user approaches the electronic device 100 and issues a voice command “voice assistant, open the music application” to the electronic device 100, in response to the user's voice command, the electronic device 100 opens the music application and displays the music interface shown in FIG1B .
  • FIG1C when a user approaches the electronic device 100 and issues a voice command “voice assistant, query the meaning of ⁇ ” to the electronic device 100, in response to the user's voice command, the electronic device 100 queries the meaning of “ ⁇ ” on the Internet and displays the query result on the user interface shown in FIG1D .
  • the voice assistant There are two main ways for users to wake up the voice assistant: one is that each time before waking up the voice assistant, the user needs to add a specific voice wake-up word to the voice command.
  • the electronic device will wake up the voice assistant only when it detects the presence of the voice wake-up word in the user's voice command. Otherwise, the electronic device does not wake up the voice assistant.
  • the wake-up words for waking up the voice assistant are different. For example, if the wake-up word for the electronic device of manufacturer 1 is "X ⁇ ". Then, when you want to wake up the voice assistant in the electronic device of manufacturer 1, you need to add "X ⁇ " in front of the voice command. For example, X ⁇ , please open the music application. This method of adding a wake-up word in front of the voice command often makes the voice interaction between the user and the electronic device unnatural and does not conform to user habits.
  • Another method is that the user can wake up the voice assistant without adding a specific wake-up word in the voice command, that is, the user directly sends a voice command to the electronic device to wake up the voice assistant and instruct the voice assistant to perform corresponding operations. For example, as shown in FIG1E , when the user approaches the electronic device 100 and sends a voice command to the electronic device 100 "open the music application", in response to the user's voice command, the electronic device 100 opens the music application and displays the music interface shown in FIG1F .
  • a “voice-free wake-up” control may be included in the user interface for turning on the voice assistant function.
  • the electronic device 100 detects an input operation (e.g., a single click) for the “voice-free wake-up” control 101, in response to the operation, the electronic device 100 may turn on the “voice-free wake-up” function, that is, when the user sends a voice command to the electronic device 100, the voice assistant can be woken up without adding a specific wake-up word in the voice command.
  • a prompt message for prompting the user to approach the microphone may also be displayed on the user interface as shown in FIG. 1G. For example, speak the command 2 to 5 cm close to the microphone at the bottom of the mobile phone.
  • the user can wake up the voice assistant without adding a specific wake-up word in the voice command, which makes the user more natural in the process of voice interaction with the electronic device.
  • the user interacts with the electronic device by voice, it is more in line with the user's habits not to use a specific wake-up word.
  • this causes the voice assistant of the electronic device to be triggered by mistake. For example, the user is looking for something and puts the mobile phone on the table. If the user asks others where the things are, the electronic device may start the voice assistant to communicate with the user by voice because there is no wake-up word, which causes the voice assistant to be triggered by mistake.
  • the electronic device detects the voice signal sent by the user, and may also wake up the voice assistant, thereby causing the voice assistant to be triggered by mistake. Frequent false triggering of the voice assistant will bring inconvenience to the user, thereby reducing the user's experience.
  • an embodiment of the present application proposes a method for voice interaction, which includes: an electronic device obtains voice signal data and posture data, the voice signal data may include Mel-frequency cepstral coefficients of the voice signal received by multiple microphones of the electronic device, and energy differences of audio received by multiple microphones of the electronic device, and the posture data may include acceleration data in the x-axis direction, acceleration data in the y-axis direction, and acceleration data in the z-axis direction obtained by an acceleration sensor of the electronic device.
  • the electronic device uses the voice signal data as input to a voice detection model, the voice detection model processes the voice signal to obtain a first confidence level, and the electronic device uses the posture data as input to the posture detection model.
  • the posture detection model processes the posture data and outputs a second confidence level.
  • the electronic device uses the first voice data output by the convolution layer in the voice detection model and the second voice data output by the fully connected layer in the voice detection model as inputs to the voice-posture detection model.
  • the electronic device uses the first target posture data output by the convolution layer in the posture detection model and the second target posture data output by the fully connected layer in the posture detection model as inputs to the voice-posture model.
  • the voice-posture model processes based on the first voice data, the second voice data, the first target posture data and the second posture data, and outputs a third confidence level.
  • the electronic device determines whether to wake up the voice assistant based on the first confidence level, the second confidence level and the third confidence level.
  • the system architecture includes a wake-up-free judgment module and a voice assistant module.
  • the wake-up-free judgment module is located at the digital audio processor layer (DSP layer), and the wake-up-free judgment module includes a wake-up-free first-level judgment module and a wake-up-free second-level judgment module.
  • the voice assistant module is located at the application layer. After the wake-up-free judgment module receives the first voice signal, it first processes the first voice signal through the wake-up-free first-level judgment module to detect whether the first voice signal needs voice detection. If necessary, the first voice signal is sent to the wake-up-free second-level judgment module for voice detection. If it is detected that the first voice signal is a voice command sent to the electronic device, the wake-up-free judgment module sends the first voice signal to the voice assistant module, and then the voice assistant module performs the target operation according to the first voice signal.
  • DSP layer digital audio processor layer
  • the voice assistant module is located at the application layer. After the wake-up-free judgment module receives
  • FIG. 3 is a flow chart of a voice interaction method provided in an embodiment of the present application.
  • the electronic device receives external voice signals through a microphone, and the number of microphones possessed by the electronic device is N, where N is an integer greater than or equal to 2.
  • a wake-up-free judgment module and a voice assistant module are included.
  • the wake-up-free judgment module includes a first-level wake-up-free judgment module and a second-level wake-up-free judgment module
  • the second-level wake-up-free judgment module includes a voice detection model, a posture detection model, and a voice-posture detection model.
  • the embodiment of the present application takes N as 2 for example.
  • the specific process is as follows:
  • Step 301 An electronic device receives a first voice signal.
  • the first voice signal may be a voice signal sent by a user, or may be a voice signal sent by other sound sources.
  • An electronic device has one or more microphones, and the electronic device may receive external voice signals through the microphones.
  • Step 302 The electronic device sends the first voice signal to the wake-up-free judgment module.
  • Step 303 The wake-up-free judgment module processes the first voice signal through the wake-up-free primary judgment module to obtain a first judgment result.
  • the electronic device may send the first voice signal to the wake-up-free judgment module.
  • the wake-up-free judgment module may process the first voice signal through the wake-up-free first-level judgment module. Then, the wake-up-free first-level judgment module calculates the signal strength of the first voice signal based on the received first voice signal to determine the strength of the first voice signal. If the first voice signal is weak, it is determined that the first voice signal is not a voice command issued to the electronic device. After calculating the signal strength of the first voice signal, the wake-up-free first-level judgment module may output a first judgment result.
  • the first judgment result may be a first identifier or a second identifier.
  • the first judgment result is a first identifier, and the first identifier is used to characterize that the strength of the first voice signal is strong.
  • the first judgment result is a second identifier.
  • the first threshold value may be obtained based on historical values, empirical values, or experimental data, which is not limited in the present embodiment.
  • Step 304 The wake-up-free judgment module processes the acceleration data through the wake-up-free first-level judgment module to obtain a second judgment result.
  • the electronic device can send the acceleration data to the wake-up-free judgment module.
  • the wake-up-free judgment module can process the acceleration data through the wake-up-free first-level judgment module to obtain a second judgment result.
  • the acceleration data can be obtained by an acceleration sensor built into the electronic device, and the posture information can include the variance of the acceleration of the acceleration sensor on the x-axis, the variance of the acceleration on the y-axis, and the variance of the acceleration on the z-axis.
  • the electronic device determines whether the electronic device is in motion based on the variance of the acceleration corresponding to these three coordinate axes, thereby obtaining a second judgment result.
  • the second judgment result includes a third identifier and a fourth identifier, the third indicates that the electronic device is in motion, and the fourth identifier is used to indicate that the electronic device is in a stationary state.
  • the electronic device can determine whether the electronic device is in motion based on the variance of the acceleration corresponding to the above three coordinate axes, and the way to obtain the second judgment result can be: the electronic device can set variance thresholds for the three coordinate axes respectively, namely: the first variance threshold D1, the second variance threshold D2, and the third variance threshold D3.
  • D1 corresponds to the x-axis
  • D2 corresponds to the y-axis
  • D3 corresponds to the z-axis.
  • the first variance threshold, the second variance threshold, and the third variance threshold can be the same or different, and can be obtained based on historical values, experience values, or experimental data, and the embodiment of the present application is not limited.
  • the second judgment result includes the first identification. For example, if the variance of the acceleration corresponding to the x-axis is greater than or equal to D1, it is judged that the electronic device is in motion. If the variance of the acceleration corresponding to the three coordinate axes is less than the corresponding variance threshold, it is judged that the electronic device is not in motion.
  • the electronic device if among the variances of the accelerations corresponding to the three coordinate axes, as long as there are two acceleration variances greater than or equal to the corresponding variance threshold, the electronic device is judged to be in motion. For example, if the variance of the acceleration corresponding to the x-axis is greater than or equal to D1 and the variance of the acceleration corresponding to the y-axis is greater than or equal to D2, the electronic device is judged to be in motion. If among the variances of the accelerations corresponding to the three coordinate axes, only one acceleration variance is greater than or equal to the corresponding variance threshold, or the variances of the accelerations corresponding to the three coordinate axes are all less than the corresponding variance threshold, the electronic device is judged not to be in motion.
  • step 303 may be executed before step 304, may be executed after step 304, or may be executed simultaneously with step 304.
  • the embodiment of the present application does not limit the execution order of step 304 and step 303.
  • Step 305 The wake-up-free primary judgment module determines whether to perform voice detection on the first voice signal according to the first judgment result and the second judgment result.
  • the electronic device can determine whether to perform voice detection on the first voice signal based on the first judgment result and the second judgment result, that is, to detect whether the first voice signal is the target voice command to wake up the voice assistant of the electronic device. To perform voice detection, the electronic device executes step 306. If it is determined that voice detection is not performed on the first voice signal, the electronic device ends the process.
  • the method for the electronic device to determine whether to perform voice detection on the first voice signal may be: if the first judgment result includes the first identifier and the second judgment result includes the third identifier, the electronic device determines to perform voice detection on the first voice signal. Otherwise, the electronic device determines not to perform voice detection on the first voice signal.
  • the electronic device can perform a "logical AND" operation on the identifier in the first judgment result and the identifier in the second judgment result. If the operation result is 1, the electronic device determines to perform voice detection on the first voice signal. If the operation result is 0, the electronic device determines not to perform voice detection on the first voice signal.
  • the electronic device can filter out most of the scenes that are not intended by the user. For example, the scene that is far away from the microphone of the electronic device (the strength of the voice signal received by the electronic device is weak), or the scene where the user chats while playing the electronic device (the variance of the acceleration data of the accelerometer is small), etc. is filtered out.
  • the electronic device performs voice detection on the voice signal it receives, so as to make a more accurate judgment on whether the voice signal is an instruction to wake up the voice assistant.
  • the electronic device does not perform voice detection on the voice signal it receives, and ends the process.
  • the electronic device determines whether the first voice signal meets the conditions for voice detection, which can greatly solve the computing resources of the electronic device, thereby improving the working performance of the electronic device.
  • Step 306 The wake-up-free primary judgment module sends the first voice signal to the wake-up-free secondary judgment module.
  • the wake-up-free first-level judgment module determines to perform voice detection on the first voice signal
  • the wake-up-free first-level judgment module sends the first voice signal to the wake-up-free second-level judgment module, so that the wake-up-free second-level judgment module performs voice detection on the first voice signal.
  • Step 307 The wake-up-free secondary module obtains voice signal data of the first voice signal.
  • the wake-up-free module processes the first voice signal to obtain voice signal data of the first voice signal.
  • the voice signal data may include the Mel cepstral coefficient of the first voice signal, and the energy difference M between the voice signal received by the first microphone and the first voice signal received by the second microphone.
  • M is used to characterize the distance between the sound source (the sound source of the first voice signal) and the electronic device. The larger the M, the smaller the distance between the sound source and the electronic device; the smaller the M, the larger the distance between the sound source and the electronic device.
  • the electronic device can set an energy threshold H. When M is greater than or equal to H, it can be considered that the sound source is closer to the electronic device (for example, within 40 cm); when M is less than H, it can be considered that the sound source is far away from the electronic device (for example, beyond 40 cm).
  • the Mel cepstral coefficient is a voice signal feature that conforms to the auditory characteristics of the human ear, and captures more detailed features of the voice signal at low frequencies. In addition, when the user speaks to the electronic device at close range, there will be a pop sound at low frequencies. Therefore, the Mel cepstral coefficient is used as the input of the voice detection model, which can help the voice detection model extract the voice parameters of the first voice signal in the low-frequency domain.
  • Step 308 The wake-up-free secondary judgment module processes the voice signal data through a voice detection model to obtain a first confidence level, first voice data, and second voice data.
  • the wake-up-free secondary judgment module can process the voice signal data through a voice detection model to obtain a first confidence level, first voice data, and second voice data.
  • the voice detection model can be a trained convolutional neural network.
  • the convolutional neural network may include a convolutional layer and a fully connected layer.
  • the wake-up-free secondary judgment module processes the voice signal data through the voice detection model.
  • the convolution layer in the voice detection model first processes the voice signal to obtain and output the first voice data, which includes the high-order feature information of the Mel cepstral coefficients and the high-order feature information of M.
  • the fully connected layer of the voice detection model processes the voice signal data processed by the convolution layer to obtain the first confidence and the second voice data.
  • the second voice data includes the high-order feature information of the Mel cepstral coefficients and the high-order feature information of M
  • the first confidence is used to characterize the probability that the first voice signal is a voice command sent by the user to the electronic device.
  • Step 309 The wake-up-free secondary judgment module processes the posture information through the posture detection model to obtain the second confidence level, the first target posture information, and the second target posture information.
  • the wake-up-free secondary judgment module processes the posture information through the posture detection model, it can obtain acceleration data from the computing sensor, and the acceleration data includes the acceleration data of the electronic device on the x-axis, the acceleration data on the y-axis, and the acceleration data on the z-axis. Then, according to the acceleration data of the electronic device on these three coordinate axes, the posture information of the electronic device is calculated.
  • the posture information of the electronic device includes the absolute value of the acceleration data corresponding to the three coordinate axes of the x-axis, y-axis, and z-axis, and may also include the variance d1 of the acceleration data corresponding to the x-axis, the variance d2 of the acceleration data corresponding to the y-axis, and the variance d3 of the acceleration data corresponding to the z-axis.
  • It may also include the mean p1 of the acceleration data corresponding to the x-axis, the mean p2 of the acceleration data corresponding to the y-axis, and the mean p3 of the acceleration data corresponding to the z-axis, and may also include the difference between d1 and p1, the difference between d2 and p2, and the difference between d3 and p3.
  • the wake-up-free secondary judgment module can detect the posture information through the posture detection model to determine whether the electronic device is currently in a hand-held raised state, and can also determine the amplitude of the shaking of the electronic device in the hand-held raised state and other data.
  • the hand-held raised state can be understood as the user holding the electronic device in his hand.
  • the electronic device can match the current application scenario in combination with the first confidence and posture information, and determine whether the first voice signal is a voice command to wake up the voice assistant according to the application scenario.
  • the electronic device can process the posture information through the posture detection model to obtain the second confidence, the first target posture information and the second target posture information.
  • the posture detection model can be a trained convolutional neural network model, which can include a convolutional layer or a fully connected layer. Since the absolute values of the acceleration data corresponding to the three coordinate axes of the x-axis, y-axis, and z-axis, d1, d2, and d3 can represent whether the electronic device is in motion, p1, p2, and p3 can represent the amplitude of the movement of the electronic device, and the difference between d1 and p1, the difference between d2 and p2, and the difference between d3 and p3 can represent the motion state of the electronic device from other dimensions such as the smoothness of the movement.
  • the posture detection model can use the above-mentioned posture data to comprehensively judge whether the electronic device is in a hand-held and lifted state based on multiple aspects such as whether the electronic device is moving, the amplitude of the movement, and the smoothness of the movement, thereby improving the accuracy of the posture detection model's judgment.
  • the convolution layer in the posture detection model can first process the posture information and output the first target posture information, which includes the high-order feature information of the posture information. Then, the fully connected layer of the posture detection model processes the posture information processed by the convolution layer to obtain the second confidence and the second target posture information.
  • the second target posture information includes the high-order feature information of the posture information, and the second confidence is used to characterize the probability that the electronic device is in a hand-held raised state.
  • step 308 may be executed before step 309, step 308 may also be executed after step 309, and step 308 may be executed simultaneously with step 309.
  • the embodiment of the present application does not limit the execution order of step 308 and step 309.
  • Step 310 The wake-up-free secondary judgment module receives the first audio data, the second audio data, the first target posture information, The second target posture information is processed by the audio-posture detection fusion model to obtain the third confidence level.
  • the audio-posture detection fusion model can be a trained convolutional neural network model, which is used to detect the probability that the first voice signal received by the electronic device is a voice command and the electronic device is currently in a hand-held raised state.
  • a third confidence level is obtained. The third confidence level is used to characterize the probability that the first voice signal is a voice command and the electronic device is currently in a hand-held raised state, that is, to characterize the degree of matching between the posture state of the electronic device and the voice signal received by the electronic device.
  • Step 311 The wake-up-free secondary judgment module judges whether the first voice signal is a target voice command according to the first confidence level, the second confidence level and the third confidence level.
  • the target voice command is a command for waking up a voice assistant of the electronic device. If the electronic device determines that the first voice signal is the target voice command, step 312 is executed, otherwise, the process ends.
  • the electronic device determines whether the first voice signal is a target voice command according to the first confidence level, the second confidence level, and the third confidence level mainly in the following two methods:
  • the first method determine the first confidence identifier based on the first confidence, determine the second confidence identifier based on the second confidence, and determine the third confidence identifier based on the third confidence.
  • the first confidence is greater than or equal to the first confidence threshold
  • the first confidence identifier is 1, and when the first confidence is less than the first confidence threshold, the first confidence identifier is 0.
  • the second confidence is greater than or equal to the second confidence threshold
  • the second confidence identifier is 1, and when the second confidence is less than the second confidence threshold, the second confidence identifier is 0.
  • the third confidence is greater than or equal to the third confidence threshold
  • the third confidence identifier is 1, and when the third confidence is less than the third confidence threshold, the third confidence identifier is 0.
  • the electronic device performs a "logical AND (&)" operation on the first confidence identifier, the second confidence identifier, and the third confidence identifier to obtain a second judgment result. If the second judgment result is 1, the electronic device determines that the first voice signal is a target voice instruction, and if the second judgment result is 0, the electronic device determines that the first voice signal is not a target voice instruction.
  • the first confidence threshold, the second confidence threshold and the third confidence threshold can be obtained from historical values, empirical values or experimental data, which are not limited in the present embodiment. Preferably, the first confidence threshold, the second confidence threshold and the third confidence threshold can be 50%.
  • the second method The electronic device can determine the weight values of the first confidence level, the second confidence level, and the third confidence level through a formula. Then, based on the weight values of the three confidence levels, the three confidence levels are fused and calculated to obtain a fused confidence level, and then based on the fused confidence level, it is determined whether the first voice signal is the target voice command.
  • the electronic device may calculate the weight value of the first confidence level by using formula (1), and formula (1) is as follows:
  • fm is the first confidence level output by the speech detection model this time
  • the electronic device may calculate the weight value of the second confidence level by using formula (2), and formula (2) is as follows:
  • Lm is the second confidence level output by the pose detection model this time
  • k is the number of the first Q second confidence levels adjacent to the second confidence level output by the pose detection model this time.
  • abs is the absolute value function.
  • K is the confidence after fusion
  • R m is the third confidence of the audio-pose detection fusion model output this time.
  • the electronic device determines whether K is greater than or equal to the first start-up threshold. If it is greater than the first start-up threshold, the electronic device determines that the first voice signal is the target voice command; otherwise, the electronic device determines that the first voice signal is not the target voice command.
  • the first start-up threshold can be 60%.
  • the first confidence level is calculated by the voice detection model
  • the second confidence level is calculated by the posture detection model
  • the third confidence level is calculated by the audio-posture detection fusion model.
  • the first confidence level can exclude application scenarios where only the hand-held device is in the lifted state
  • the second confidence level can exclude application scenarios where only voice input is used.
  • the third confidence level combines the high-dimensional features of voice information data and posture information, and can characterize the real-time correlation between voice input and posture state of electronic devices. Therefore, the first confidence level, the second confidence level, and the third confidence level are used to determine whether the first voice signal is the target voice command, and the judgment result obtained is more accurate.
  • the electronic device when it is determined by the second method above that the first voice signal is not the target voice command, the electronic device can also determine whether to display the prompt information based on the calculated and fused confidence level. If K is less than the first confidence threshold and greater than or equal to the second confidence threshold (the second confidence threshold is less than the first confidence threshold), the electronic device can display a prompt interface as shown in Figure 4 to prompt the user of problems that occurred when sending voice (for example, the sound is too small). In this way, the user can know where the problem is and make timely improvements without waking up the voice assistant.
  • the first confidence threshold and the second confidence threshold can be obtained based on historical values, or based on empirical values, or based on experimental data, which is not limited in the embodiments of the present application.
  • the second startup threshold can be 50%.
  • Step 312 The wake-up-free secondary judgment module sends the first voice signal to the voice assistant module.
  • Step 313 The voice assistant module parses the first voice signal and performs a first operation according to the first voice signal.
  • the voice assistant module receives and parses the first voice signal, thereby obtaining the operation instruction, and performs the first operation according to the operation instruction.
  • the voice sent by the user to the electronic device is "Open the camera application, I want to take a photo”
  • the voice assistant module analyzes the first voice signal corresponding to the voice and can extract the instruction "Open the camera application”. Therefore, the voice assistant module can start the camera application according to the instruction, and the operation of the voice assistant module starting the camera application is the first operation.
  • the electronic device after receiving a voice signal, the electronic device first determines whether the voice signal needs voice detection through the wake-up-free first-level judgment module. For the voice signal that does not need voice detection, the process is terminated and the voice signal may no longer be processed. The voice signal is judged by the wake-up-free first-level judgment module, and most of the scenes that are not intended by the user are filtered out, thereby avoiding the wake-up-free of the voice assistant in the electronic device and saving time. Computing resources of electronic devices.
  • the electronic device processes the voice signal data of the voice signal through the voice detection model, processes the posture information through the posture detection module, and processes the high-order feature data output by the posture detection module and the voice detection model through the audio-posture monitoring model.
  • the three models output three confidences respectively, and then judge whether the received voice signal is the target voice command for waking up the voice assistant based on the three confidences. If so, the voice assistant is woken up, and if not, the voice assistant is not woken up. Since the first confidence is calculated by the voice detection model, the second confidence is calculated by the posture detection model, and the third confidence is calculated by the audio-posture detection fusion model.
  • the first confidence can exclude application scenarios with only hand-held lifting status
  • the second confidence can exclude application scenarios with only voice input.
  • the third confidence integrates the high-dimensional features of voice information data and posture information, which can characterize the real-time correlation between voice input and posture status of electronic devices. Therefore, by using the above-mentioned first confidence level, second confidence level and third confidence level to determine whether the first voice signal is the target voice command, the obtained judgment result is more accurate, which can reduce the probability of the voice assistant being mistakenly awakened and improve the user experience.
  • FIG5A is a flow chart of another voice interaction method provided in the embodiment of the present application.
  • the specific process is as follows:
  • Step 501 An electronic device receives a first voice signal.
  • Step 502 The electronic device sends the first voice signal to the wake-up-free determination module.
  • Step 503 The wake-up-free judgment module processes the first voice signal through the wake-up-free primary judgment module to obtain a first judgment result.
  • Step 504 The wake-up judgment module processes the acceleration data through the wake-up-free primary judgment module to obtain a second judgment result.
  • Step 505 The wake-up-free primary judgment module determines whether to perform voice detection on the first voice signal according to the first judgment result and the second judgment result.
  • Step 506 The wake-up-free primary judgment module sends the first voice signal to the wake-up-free secondary judgment module.
  • Step 507 The wake-up-free secondary module obtains voice signal data of the first voice signal.
  • Step 508 The wake-up-free secondary judgment module processes the voice signal data through a voice detection model to obtain a first confidence level, first voice data, and second voice data.
  • Step 509 The wake-up-free secondary judgment module processes the posture information through the posture detection model to obtain the second confidence level, the first target posture information, and the second target posture information.
  • Step 510 The wake-up-free secondary judgment module processes the first audio data, the second audio data, the first target posture information, and the second target posture information through an audio-posture detection fusion model to obtain a third confidence level.
  • Step 511 The wake-up-free secondary judgment module judges whether the first voice signal is a target voice command according to the first confidence level, the second confidence level and the third confidence level.
  • step 512 If yes, execute step 512, if no, end the process.
  • steps 501 to 511 reference may be made to steps 301 to 311 in the embodiment of FIG. 3 , which will not be described in detail herein.
  • Step 512 The wake-up-free secondary judgment module sends the first voice signal to the voiceprint verification module.
  • Step 513 The voiceprint verification module verifies whether the first voice signal is a voice signal emitted by the user of the electronic device.
  • the voiceprint verification module can be a trained neural network model.
  • the user can input the registration voice according to the prompt of the electronic device, for example, saying to the electronic device "I look really good today", "Play today's news, etc.”.
  • the electronic device can extract voice feature information (for example, the frequency of the voice signal, the loudness of the sound, the pitch and timbre of the sound, etc.) according to the registration voice input by the user, and use the extracted voice feature information as the input of the acoustic model.
  • the acoustic model processes the voice feature information, outputs the user's voiceprint feature information, and uses the voiceprint feature information as the input of the back-end judgment module.
  • the back-end judgment module processes the voiceprint feature information and outputs a difference function.
  • the difference function is used to measure the difference between the voiceprint feature information output by the acoustic model and the user's real voiceprint feature information. The larger the difference function, the greater the difference, and the smaller the difference function, the smaller the difference.
  • the electronic device adjusts the network structure or parameters of the acoustic model according to the difference function, so that the voiceprint feature information output by the acoustic model is infinitely close to the user's voiceprint feature information.
  • the voiceprint feature information is used to characterize the elements of the user's voice, and can include the pitch and timbre of the user's voice, and can also include the loudness of the user's voice.
  • the voiceprint verification module When the voiceprint verification module receives the first voice signal (input voice), the voiceprint verification module can extract the voice feature information in the first voice signal and use the voice feature information as the input of the acoustic model.
  • the acoustic model processes the voice feature information, outputs the voiceprint feature information corresponding to the first voice signal, and uses the voiceprint feature information as the input of the back-end judgment module.
  • the back-end judgment module judges whether the voiceprint feature information is consistent with the user's voiceprint feature information. If they are consistent, step 515 is executed. If they are inconsistent, the process ends.
  • Step 514 The voiceprint verification module sends the first voice signal to the voice assistant module.
  • Step 515 The voice assistant module parses the first voice signal and performs a first operation according to the first voice signal.
  • step 515 reference may be made to step 313 in the embodiment of FIG. 3 , which will not be described in detail here.
  • FIG. 6 is a schematic diagram of the hardware structure of the electronic device 100 provided in the embodiment of the present application.
  • the electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, a button 190, a motor 191, an indicator 192, a camera 193, a display screen 194, and a subscriber identification module (SIM) card interface 195.
  • SIM subscriber identification module
  • the sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, fingerprint sensor 180H, temperature sensor 180J, touch sensor 180K, ambient light sensor 180L, bone conduction sensor 180M, etc.
  • the structure illustrated in the embodiment of the present invention does not constitute a specific limitation on the electronic device 100.
  • the electronic device 100 may include more or fewer components than those shown in FIG6, or combine certain components, or separate certain components, or arrange the components differently.
  • the components shown in FIG6 may be implemented in hardware, software, or a combination of software and hardware.
  • the processor 110 may include one or more processing units, for example, the processor 110 may include an application processor (AP), a modem processor, a graphics processor (GPU), an image signal processor (ISP), a controller, a memory, a video codec, a digital signal processor (DSP), a baseband processor, and/or a neural-network processing unit (NPU), etc.
  • AP application processor
  • GPU graphics processor
  • ISP image signal processor
  • controller a memory
  • video codec a digital signal processor
  • DSP digital signal processor
  • NPU neural-network processing unit
  • Different processing units may be independent devices or integrated in one or more processors.
  • the wireless communication function of the electronic device 100 can be implemented through the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, the modem processor and the baseband processor.
  • Antenna 1 and antenna 2 are used to transmit and receive electromagnetic wave signals.
  • Each antenna in electronic device 100 can be used to cover a single or multiple communication frequency bands. Different antennas can also be reused to improve the utilization of antennas.
  • antenna 1 can be reused as a diversity antenna for a wireless local area network.
  • the antenna can be used in combination with a tuning switch.
  • the mobile communication module 150 can provide solutions for wireless communications including 2G/3G/4G/5G, etc., applied to the electronic device 100.
  • the mobile communication module 150 may include at least one filter, a switch, a power amplifier, a low noise amplifier (LNA), etc.
  • the mobile communication module 150 may receive electromagnetic waves from the antenna 1, and perform filtering, amplification, and other processing on the received electromagnetic waves, and transmit them to the modulation and demodulation processor for demodulation.
  • the mobile communication module 150 may also amplify the signal modulated by the modulation and demodulation processor, and convert it into electromagnetic waves for radiation through the antenna 1.
  • at least some of the functional modules of the mobile communication module 150 may be arranged in the processor 110.
  • at least some of the functional modules of the mobile communication module 150 may be arranged in the same device as at least some of the modules of the processor 110.
  • the wireless communication module 160 can provide wireless communication solutions including wireless local area networks (WLAN) (such as Wi-Fi network), Bluetooth (BT), BLE broadcast, global navigation satellite system (GNSS), frequency modulation (FM), near field communication (NFC), infrared (IR) and the like applied to the electronic device 100.
  • WLAN wireless local area networks
  • BT Bluetooth
  • GNSS global navigation satellite system
  • FM frequency modulation
  • NFC near field communication
  • IR infrared
  • the wireless communication module 160 can be one or more devices integrating at least one communication processing module.
  • the wireless communication module 160 receives electromagnetic waves via the antenna 2, modulates the electromagnetic wave signal and performs filtering, and sends the processed signal to the processor 110.
  • the wireless communication module 160 can also receive the signal to be sent from the processor 110, modulate the signal, amplify it, and convert it into electromagnetic waves for radiation through the antenna 2.
  • the electronic device 100 implements the display function through a GPU, a display screen 194, and an application processor.
  • the GPU is a microprocessor for image processing, which connects the display screen 194 and the application processor.
  • the GPU is used to perform mathematical and geometric calculations for graphics rendering.
  • the processor 110 may include one or more GPUs that execute program instructions to generate or change display information.
  • the display screen 194 is used to display images, videos, etc.
  • the display screen 194 includes a display panel.
  • the display panel can be a liquid crystal display (LCD), an organic light-emitting diode (OLED), Active-matrix organic light emitting diode or active-matrix organic light emitting diode (AMOLED), flexible light-emitting diode (FLED), Miniled, MicroLed, Micro-oLed, quantum dot light emitting diodes (QLED), etc.
  • the electronic device 100 may include 1 or N display screens 194, where N is a positive integer greater than 1.
  • the electronic device 100 can realize the shooting function through ISP, camera 193, video codec, GPU, display screen 194 and application processor.
  • the ISP is used to process the data fed back by the camera 193. For example, when taking a photo, the shutter is opened, and the light is transmitted to the camera photosensitive element through the lens. The light signal is converted into an electrical signal, and the camera photosensitive element transmits the electrical signal to the ISP for processing and converts it into an image visible to the naked eye.
  • the ISP can also perform algorithm optimization on the noise, brightness, and skin color of the image. The ISP can also optimize the exposure, color temperature and other parameters of the shooting scene. In some embodiments, the ISP can be set in the camera 193.
  • the digital signal processor is used to process digital signals, and can process not only digital image signals but also other digital signals. For example, when the electronic device 100 is selecting a frequency point, the digital signal processor is used to perform Fourier transform on the frequency point energy.
  • NPU is a neural network (NN) computing processor.
  • NN neural network
  • applications such as intelligent cognition of electronic device 100 can be realized, such as image recognition, face recognition, voice recognition, text understanding, etc.
  • the electronic device 100 can implement audio functions such as music playing and recording through the audio module 170, the speaker 170A, the receiver 170B, the microphone 170C, the headphone jack 170D, and the application processor.
  • the audio module 170 is used to convert digital audio information into analog audio signal output, and is also used to convert analog audio input into digital audio signals.
  • the audio module 170 can also be used to encode and decode audio signals.
  • the audio module 170 can be arranged in the processor 110, or some functional modules of the audio module 170 can be arranged in the processor 110.
  • the speaker 170A also called a "speaker" is used to convert an audio electrical signal into a sound signal.
  • the electronic device 100 can listen to music or listen to a hands-free call through the speaker 170A.
  • the receiver 170B also called a "earpiece" is used to convert audio electrical signals into sound signals.
  • the voice can be received by placing the receiver 170B close to the human ear.
  • Microphone 170C also called “microphone” or “microphone” is used to convert sound signals into electrical signals. When making a call or sending a voice message, the user can speak by putting their mouth close to microphone 170C to input the sound signal into microphone 170C.
  • the electronic device 100 can be provided with at least one microphone 170C. In other embodiments, the electronic device 100 can be provided with two microphones 170C, which can not only collect sound signals but also realize noise reduction function. In other embodiments, the electronic device 100 can also be provided with three, four or more microphones 170C to realize collection of sound signals, noise reduction, identification of sound sources, and realization of directional recording function, etc.
  • the pressure sensor 180A is used to sense the pressure signal and can convert the pressure signal into an electrical signal.
  • the pressure sensor 180A can be disposed on the display screen 194 .
  • the air pressure sensor 180C is used to measure air pressure.
  • the electronic device 100 calculates the altitude through the air pressure value measured by the air pressure sensor 180C to assist in positioning and navigation.
  • the magnetic sensor 180D includes a Hall sensor, and the electronic device 100 can detect the opening and closing of the flip leather case by using the magnetic sensor 180D.
  • the acceleration sensor 180E can detect the magnitude of the acceleration of the electronic device 100 in all directions (generally three axes). When the electronic device 100 is stationary, the magnitude and direction of gravity can be detected. It can also be used to identify the posture of the electronic device and is applied to applications such as horizontal and vertical screen switching and pedometers.
  • the fingerprint sensor 180H is used to collect fingerprints.
  • the electronic device 100 can use the collected fingerprint characteristics to implement fingerprint unlocking, access application locks, fingerprint photography, fingerprint call answering, etc.
  • the touch sensor 180K is also called a "touch panel”.
  • the touch sensor 180K can be set on the display screen 194, and the touch sensor 180K and the display screen 194 form a touch screen, also called a "touch screen”.
  • the touch sensor 180K is used to detect touch operations acting on or near it.
  • the touch sensor can pass the detected touch operation to the application processor to determine the type of touch event.
  • Visual output related to the touch operation can be provided through the display screen 194.
  • the touch sensor 180K can also be set on the surface of the electronic device 100, which is different from the position of the display screen 194.
  • the bone conduction sensor 180M can obtain a vibration signal. In some embodiments, the bone conduction sensor 180M can obtain a vibration signal of a vibrating bone block of a human vocal part.
  • the software system of the electronic device 100 can adopt a layered architecture, an event-driven architecture, a micro-core architecture, a microservice architecture, or a cloud architecture.
  • the embodiment of the present invention takes the Android system of the layered architecture as an example to exemplify the software structure of the electronic device 100.
  • FIG. 7 is a software structure block diagram of the electronic device 100 of the embodiment of the present application.
  • the layered architecture divides the software into several layers, and each layer has a clear role and division of labor.
  • the layers communicate with each other through software interfaces.
  • the Android system is divided into four layers, which are, from top to bottom, the application layer, the application framework layer, the hardware abstraction layer (HAL layer), the kernel layer, and the digital signal processing layer.
  • the application layer may include a series of application packages. As shown in FIG7 , the application package may include applications such as camera, gallery, calendar, call, map, navigation, WLAN, Bluetooth, voice assistant module, video, etc.
  • applications such as camera, gallery, calendar, call, map, navigation, WLAN, Bluetooth, voice assistant module, video, etc.
  • the voice assistant module is used to parse the user's voice commands and perform relevant operations according to the user's voice commands, thereby realizing voice interaction between the electronic device and the user.
  • the application framework layer provides application programming interface (API) and programming framework for the applications in the application layer.
  • the application framework layer includes some predefined functions. As shown in Figure 7, the application framework layer may include a window manager, a content provider, a view system, a phone manager, a resource manager, a notification manager, etc.
  • the window manager is used to manage window programs.
  • the window manager can obtain the display screen size, determine whether there is a status bar, lock the screen, capture the screen, etc.
  • Content providers are used to store and retrieve data and make it accessible to applications.
  • the data may include videos, images, audio, calls made and received, browsing history and bookmarks, phone books, etc.
  • the view system includes visual controls, such as controls for displaying text, controls for displaying images, etc.
  • the view system can be used to build applications.
  • a display interface can be composed of one or more views.
  • a display interface including a text notification icon can include a view for displaying text and a view for displaying images.
  • the phone manager is used to provide communication functions of the electronic device 100, such as management of call status (including connecting, hanging up, etc.).
  • the resource manager provides various resources for applications, such as localized strings, icons, images, layout files, video files, and so on.
  • the notification manager enables applications to display notification information in the status bar. It can be used to convey notification-type messages and can disappear automatically after a short stay without user interaction. For example, the notification manager is used to notify download completion, message reminders, etc.
  • the notification manager can also be a notification that appears in the system top status bar in the form of a chart or scroll bar text, such as notifications of applications running in the background, or a notification that appears on the screen in the form of a dialog window. For example, a text message is displayed in the status bar, a prompt sound is emitted, an electronic device vibrates, an indicator light flashes, etc.
  • the hardware abstraction layer includes a voiceprint verification module, which is used to determine whether a received voice signal is a voice signal sent by a user.
  • the kernel layer is the layer between hardware and software.
  • the kernel layer contains at least display driver, camera driver, audio driver, and sensor driver.
  • the digital signal processing layer includes a wake-up-free judgment module, which is used to judge whether the received voice signal is a voice signal to wake up the voice assistant in the electronic device.
  • the computer program product includes one or more computer instructions.
  • the computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device.
  • the computer instructions can be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium.
  • the computer instructions can be transmitted from one website, computer, server or data center to another website, computer, server or data center by wired (e.g., coaxial cable, optical fiber, digital subscriber line) or wireless (e.g., infrared, wireless, microwave, etc.) means.
  • the computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server or data center that includes one or more available media integrated.
  • the available medium can be a magnetic medium (e.g., a floppy disk, a hard disk, a tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a solid-state drive Solid State Disk), etc.
  • the aforementioned storage medium includes: ROM or random access memory RAM, magnetic disk or optical disk and other media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Security & Cryptography (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computational Linguistics (AREA)
  • Telephone Function (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

本申请提供一种语音交互方法及相关电子设备,该方法包括:接收第一语音信号;在确定第一语音信号要进行语音检测的情况下,基于第一语音信号得到语音信号数据;将语音信号数据通过语音检测模型处理,得到第一置信度;获取电子设备的加速度数据,并基于加速度数据得到电子设备的位姿信息;将位姿信息通过位姿检测模型进行处理,得到第二置信度;将目标位姿信息和语音数据通过音频-位姿检测融合模型进行处理,得到第三置信度;基于第一置信度、第二置信度和第三置信度判断是否启动语音交互应用。通过上述方法,能够避免电子设备的语音交互应用被误唤醒。

Description

一种语音交互方法及相关电子设备
本申请要求于2022年11月04日提交中国专利局、申请号为202211376580.5、发明名称为“一种语音交互方法及相关电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及语音交互领域,尤其涉及一种语音交互方法及相关电子设备。
背景技术
随着智能电子设备技术的不断发展,在许多电子设备上具有语音助手的功能,以实现用户与电子设备之间的交互。语音助手是一款智能型的应用,通过智能对话与即时问答的智能交互,实现帮助用户解决问题。一般,语音助手有三种不同的助手类型:闲聊型、问答型、指令型。闲聊型助手用于实现闲聊陪伴的目的,通过AI的技术来与用户进行对话,感知用户情绪。问答型助手用于知识获取,通过对话的方式来获取知识,或者解决疑问,比较常见的应用则是各个平台的智能客服。指令型助手用于设备控制,通过对话的方式来控制电子设备,实现某种操作,比较常见的应用有智能音响、IOT设备等,比如,语音控制:“打开空调,然后调成25度”。
对于一些不需要唤醒词唤醒语音助手的应用场景,用户可以不用在语音指令中添加特定的唤醒词即可唤醒语音助手,这使得用户在与电子设备进行语音交互的过程中,可以更加自然。此外,用户在与电子设备进行语音交互时,不使用特定的唤醒词,也更加符合用户的习惯。
因此,如何在不需要唤醒词就可以与语音助手进行语音交互的情况下,降低语音助手被误唤醒的概率,是技术人员日益关注的问题。
发明内容
本申请实施例提供了一种语音交互方法及相关电子设备方法,该方法解决了语音交互应用被误唤醒的问题。
第一方面,本申请实施例提供了一种语音交互方法,应用于电子设备,该电子设备包括语音交互应用,方法包括:接收第一语音信号;在确定第一语音信号要进行语音检测的情况下,基于第一语音信号得到语音信号数据;将语音信号数据通过语音检测模型处理,得到第一置信度和语音数据,第一置信度用于表征第一语音信号为用户发送给电子设备的语音指令的概率;获取电子设备的加速度数据,并基于加速度数据得到电子设备的位姿信息;将位姿信息通过位姿检测模型进行处理,得到第二置信度和目标位姿信息,第二置信度用于表征电子设备处于手持抬起状态的概率;将目标位姿信息和语音数据通过音频-位姿检测融合模型进行处理,得到第三置信度,第三置信度用于表征电子设备处于手持抬起状态且第一语音信号为用户发送给电子设备的语音指令的概率;基于第一置信度、第二置信 度和第三置信度判断是否启动语音交互应用。
在上述实施例中,电子设备在接收到一段语音信号后,若判断语音信号需要进行语音检测,电子设备将语音信号的语音信号数据通过语音检测模型进行处理,将位姿信息通过位姿检测模块进行处理,将位姿检测模块和语音检测模型输出的高阶特征数据通过音频-位姿监测模型进行处理,这三个模型分别输出三个置信度,再基于这三个置信度判断其接收的语音信号是否为唤醒语音助手的目标语音指令。若是,则唤醒语音助手,若不是,则不唤醒语音助手。由于第一置信度是由语音检测模型计算得到的,第二置信度是由位姿检测模型计算得到的,第三置信度是由音频-位姿检测融合模型计算得到的。通过第一置信度可以排除仅有手持抬起状态的应用场景,通过第二置信度可以排除仅有语音输入的应用场景,第三置信度融合了语音信息数据和位姿信息的高维特征,可以表征电子设备语音输入和位姿状态的实时相关性。因此,通过上述第一置信度、第二置信度以及第三置信度判断第一语音信号是否为目标语音指令,得到的判断结果更加准确,可以降低语音助手被误唤醒的概率,提高了用户体验。
结合第一方面,在一种可能实现的方式中,基于所述第一置信度、所述第二置信度和所述第三置信度判断是否启动所述语音交互应用,具体包括:在第一置信度大于或等于第一置信阈值的情况下,将第一置信标识设置为1;在第一置信度小于第一置信阈值的情况下,将第一置信标识设置为0;在第二置信度大于或等于第二置信阈值的情况下,将第二置信标识设置为1;在第二置信度小于第二置信阈值的情况下,将第二置信标识设置为0;在第三置信度大于或等于第三置信阈值的情况下,将第三置信标识设置为1;在第三置信度小于第三置信阈值的情况下,将第三置信标识设置为0;将第一置信标识、第二置信标识以及第三置信标识进行逻辑与运算,得到判决结果;根据判决结果判断是否启动语音交互应用。
这样,电子设备可以根据判决结果决定是否将第一语音信号发送给语音交互应用,避免语音交互应用误唤醒,降低用户的使用体验。
结合第一方面,在一种可能实现的方式中,根据判决结果判断是否启动语音交互应用,具体包括:在判决结果为1的情况下,启动语音交互应用;在判决结果为0的情况下,不启动语音交互应用。
结合第一方面,在一种可能实现的方式中,该电子设备还包括声纹检测模块,根据判决结果判断是否启动语音交互应用,具体包括:在判决结果为0的情况下,不启动语音交互应用;在判决结果为1的情况下,将第一语音信号通过声纹检测模块检测是否为目标用户的声音,目标用户为电子设备的用户;若判断为是,启动语音交互应用;若判断为否,不启动语音交互应用。
这样,声纹验证模块判断第一语音信号为用户本人发出的语音信号后,才将第一语音信号发送给语音助手模块。通过这种方法,只有电子设备的用户才能唤醒语音助手,在保证了语音助手被误触发的前提下,保障了用户的隐私性和安全性。
结合第一方面,在一种可能实现的方式中,基于第一置信度、第二置信度和第三置信度判断是否启动语音交互应用,具体包括:计算第一置信度的第一权重值、第二置信度的第二权重值、第三置信度的第三权重值;基于第一置信度、第一权重值、第二置信度、第二权重值、第三置信度、第三权重值,计算得到融合后的置信度;基于融合后的置信度判断是否启动语音交互应用。
结合第一方面,在一种可能实现的方式中,计算第一置信度的第一权重值、第二置信度的第二权重值、第三置信度的第三权重值,具体包括:根据公式计算第一权重值,W1为第一权重值,abs为绝对值函数,fm为语音检测模型本次输出的第一置信度,k为与本次输出的第一置信度最相邻的前Q个第一置信度的编号;根据公式计算第二权重值,W2为第二权重值,Lm为位姿检测模型本次输出的第二置信度,k为与本次输出的第二置信度最相邻的前Q个第二置信度的编号;根据公式W3=1-W1-W2计算第三权重值,W3为第三权重值。
结合第一方面,在一种可能实现的方式中,基于第一置信度、第一权重值、第二置信度、第二权重值、第三置信度、第三权重值,计算得到融合后的置信度,具体包括:根据公式K=fm·W1+Lm·W2+Rm·W3计算融合后的置信度;其中,K为融合后的置信度,Rm为第三置信度。
结合第一方面,在一种可能实现的方式中,基于融合后的置信度判断是否启动语音交互应用,具体包括:若融合后的置信度大于或等于第一启动阈值,启动语音交互应用;若融合后的置信度小于第一启动阈值,不启动语音交互应用。
结合第一方面,在一种可能实现的方式中,该电子设备包括显示屏,若融合后的置信度小于第一启动阈值,且大于或等于第二启动阈值,在显示屏上显示提示信息,提示信息用于指示用户再次发出语音指令;第二启动阈值小于第一启动阈值。
结合第一方面,在一种可能实现的方式中,该电子设备还包括声纹检测模块,基于融合后的置信度判断是否启动语音交互应用,具体包括:若融合后的置信度小于第一启动阈值,不启动语音交互应用;若融合后的置信度大于或等于第一启动阈值,将第一语音信号通过声纹检测模块检测是否为目标用户的声音,目标用户为电子设备的用户;若判断为是,启动语音交互应用;若判断为否,不启动语音交互应用。
这样,声纹验证模块判断第一语音信号为用户本人发出的语音信号后,才将第一语音信号发送给语音助手模块。通过这种方法,只有电子设备的用户才能唤醒语音助手,在保 证了语音助手被误触发的前提下,保障了用户的隐私性和安全性。
结合第一方面,在一种可能实现的方式中,基于第一语音信号得到语音信号数据之前,还包括:获取语音信号的信号强度值、电子设备在x轴上的加速度方差D1、电子设备在y轴上的加速度方差D2、电子设备在z轴上的加速度方差D3;基于信号强度值、D1、D2以及D3判断第一语音信号是否需要进行语音检测。
这样,电子设备在接收到一段语音信号后,首先通过免唤醒一级判断模块判断该语音信号是否需要进行语音检测,对于不需要进行语音检测的语音信号就结束流程,可以不再对该语音信号进行处理,通过免唤醒一级判断模块对语音信号进行判断,过滤掉了大部分非用户意图的场景,从而避免了电子设备中的语音助手的免唤醒,也节约了电子设备的计算资源。
结合第一方面,在一种可能实现的方式中,语音数据包括第一语音数据和第二语音数据,第一语音数据为语音检测模型的卷积层输出的高阶语音特征信息,第二语音数据为语音检测模型的全连接层输出的高阶语音特征信息;目标位姿信息包括第一目标位姿信息和第二目标位姿信息,第一目标位姿信息为位姿检测模型的卷积层输出的高阶语音特征信息,第二目标位姿信息为位姿检测模型的全连接层输出的高阶语音特征信息。
第二方面,本申请实施例提供了一种电子设备,该电子设备包括:一个或多个处理器、显示屏和存储器;该存储器与该一个或多个处理器耦合,该存储器用于存储计算机程序代码,该计算机程序代码包括计算机指令,该一个或多个处理器调用该计算机指令以使得该电子设备执行:在确定第一语音信号要进行语音检测的情况下,基于第一语音信号得到语音信号数据;将语音信号数据通过语音检测模型处理,得到第一置信度和语音数据,第一置信度用于表征第一语音信号为用户发送给电子设备的语音指令的概率;获取电子设备的加速度数据,并基于加速度数据得到电子设备的位姿信息;将位姿信息通过位姿检测模型进行处理,得到第二置信度和目标位姿信息,第二置信度用于表征电子设备处于手持抬起状态的概率;将目标位姿信息和语音数据通过音频-位姿检测融合模型进行处理,得到第三置信度,第三置信度用于表征电子设备处于手持抬起状态且第一语音信号为用户发送给电子设备的语音指令的概率;基于第一置信度、第二置信度和第三置信度判断是否启动语音交互应用。
结合第二方面,在一种可能实现的方式中,该一个或多个处理器调用该计算机指令以使得该电子设备执行:在第一置信度大于或等于第一置信阈值的情况下,将第一置信标识设置为1;在第一置信度小于第一置信阈值的情况下,将第一置信标识设置为0;在第二置信度大于或等于第二置信阈值的情况下,将第二置信标识设置为1;在第二置信度小于第二置信阈值的情况下,将第二置信标识设置为0;在第三置信度大于或等于第三置信阈值的情况下,将第三置信标识设置为1;在第三置信度小于第三置信阈值的情况下,将第三置信标识设置为0;将第一置信标识、第二置信标识以及第三置信标识进行逻辑与运算, 得到判决结果;根据判决结果判断是否启动语音交互应用。
结合第二方面,在一种可能实现的方式中,该一个或多个处理器调用该计算机指令以使得该电子设备执行:在判决结果为1的情况下,启动语音交互应用;在判决结果为0的情况下,不启动语音交互应用。
结合第二方面,在一种可能实现的方式中,该一个或多个处理器调用该计算机指令以使得该电子设备执行:在判决结果为0的情况下,不启动语音交互应用;在判决结果为1的情况下,将第一语音信号通过声纹检测模块检测是否为目标用户的声音,目标用户为电子设备的用户;若判断为是,启动语音交互应用;若判断为否,不启动语音交互应用。
结合第二方面,在一种可能实现的方式中,该一个或多个处理器调用该计算机指令以使得该电子设备执行:计算第一置信度的第一权重值、第二置信度的第二权重值、第三置信度的第三权重值;基于第一置信度、第一权重值、第二置信度、第二权重值、第三置信度、第三权重值,计算得到融合后的置信度;基于融合后的置信度判断是否启动语音交互应用。
结合第二方面,在一种可能实现的方式中,该一个或多个处理器调用该计算机指令以使得该电子设备执行:根据公式计算第一权重值,W1为第一权重值,abs为绝对值函数,fm为语音检测模型本次输出的第一置信度,k为与本次输出的第一置信度最相邻的前Q个第一置信度的编号;根据公式计算第二权重值,W2为第二权重值,Lm为位姿检测模型本次输出的第二置信度,k为与本次输出的第二置信度最相邻的前Q个第二置信度的编号;根据公式W3=1-W1-W2计算第三权重值,W3为第三权重值。
结合第二方面,在一种可能实现的方式中,该一个或多个处理器调用该计算机指令以使得该电子设备执行:根据公式K=fm·W1+Lm·W2+Rm·W3计算融合后的置信度;其中,K为融合后的置信度,Rm为第三置信度。
结合第二方面,在一种可能实现的方式中,该一个或多个处理器调用该计算机指令以使得该电子设备执行:若融合后的置信度大于或等于第一启动阈值,启动语音交互应用;若融合后的置信度小于第一启动阈值,不启动语音交互应用。
结合第二方面,在一种可能实现的方式中,该一个或多个处理器调用该计算机指令以使得该电子设备执行:若融合后的置信度小于第一启动阈值,且大于或等于第二启动阈值,控制显示屏显示提示信息,提示信息用于指示用户再次发出语音指令;第二启动阈值小于 第一启动阈值。
结合第二方面,在一种可能实现的方式中,该一个或多个处理器调用该计算机指令以使得该电子设备执行:若融合后的置信度小于第一启动阈值,不启动语音交互应用;若融合后的置信度大于或等于第一启动阈值,将第一语音信号通过声纹检测模块检测是否为目标用户的声音,目标用户为电子设备的用户;若判断为是,启动语音交互应用;若判断为否,不启动语音交互应用。
结合第二方面,在一种可能实现的方式中,该一个或多个处理器调用该计算机指令以使得该电子设备执行:获取语音信号的信号强度值、电子设备在x轴上的加速度方差D1、电子设备在y轴上的加速度方差D2、电子设备在z轴上的加速度方差D3;基于信号强度值、D1、D2以及D3判断第一语音信号是否需要进行语音检测。
第三方面,本申请实施例提供了一种电子设备,包括:触控屏、摄像头、一个或多个处理器和一个或多个存储器;所述一个或多个处理器与所述触控屏、所述摄像头、所述一个或多个存储器耦合,所述一个或多个存储器用于存储计算机程序代码,计算机程序代码包括计算机指令,当所述一个或多个处理器执行所述计算机指令时,使得所述电子设备执行如第一方面或第一方面的任意一种可能实现的方式所述的方法。
第四方面,本申请实施例提供了一种芯片系统,该芯片系统应用于电子设备,该芯片系统包括一个或多个处理器,该处理器用于调用计算机指令以使得该电子设备执行如第一方面或第一方面的任意一种可能实现的方式所述的方法。
第五方面,本申请实施例提供了一种包含指令的计算机程序产品,当该计算机程序产品在电子设备上运行时,使得该电子设备执行如第一方面或第一方面的任意一种可能实现的方式所述的方法。
第六方面,本申请实施例提供了一种计算机可读存储介质,包括指令,当该指令在电子设备上运行时,使得该电子设备执行如第一方面或第一方面的任意一种可能实现的方式所述的方法。
附图说明
图1A-图1G是本申请实施例提供的一组语音交互方法的场景示例图;
图2是本申请实施例提供的一种语音交互方法的系统框架图;
图3是本申请实施例提供的一种语音交互方法的流程图;
图4是本申请实施例提供的一种用户界面示例图;
图5A是本申请实施例提供的另一种语音交互方法流程图;
图5B是本申请实施例提供的一种声纹检测模型结构图;
图6是本申请实施例提供的电子设备100的硬件结构示意图;
图7是本申请实施例提供的电子设备100的软件结构框图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述。显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或者特性可以包含在本实施例申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是相同的实施例,也不是与其它实施例互斥的独立的或是备选的实施例。本领域技术人员可以显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请的说明书和权利要求书及所述附图中术语“第一”、“第二”、“第三”等是区别于不同的对象,而不是用于描述特定顺序。此外,术语“包括”和“具有”以及它们的任何变形,意图在于覆盖不排他的包含。例如,包含了一系列步骤或单元,或者可选地,还包括没有列出的步骤或单元,或者可选地还包括这些过程、方法、产品或设备固有的其它步骤或单元。
附图中仅示出了与本申请相关的部分而非全部内容。在更加详细地讨论示例性实施例之前,应当提到的是,一些示例性实施例被描述成作为流程图描绘的处理或方法。虽然流程图将各项操作(或步骤)描述成顺序的处理,但是其中的许多操作可以并行地、并发地或者同时实施。此外,各项操作的顺序可以被重新安排。当其操作完成时所述处理可以被终止,但是还可以具有未包括在附图中的附加步骤。所述处理可以对应于方法、函数、规程、子例程、子程序等等。
在本说明书中使用的术语“部件”、“模块”、“系统”、“单元”等用于表示计算机相关的实体、硬件、固件、硬件和软件的组合、软件或执行中的软件。例如,单元可以是但不限于在处理器上运行的进程、处理器、对象、可执行文件、执行线程、程序和/或分布在两个或多个计算机之间。此外,这些单元可从在上面存储有各种数据结构的各种计算机可读介质执行。单元可例如根据具有一个或多个数据分组(例如来自与本地系统、分布式系统和/或网络间的另一单元交互的第二单元数据。例如,通过信号与其它系统交互的互联网)的信号通过本地和/或远程进程来通信。
在本申请实施例中,以语音交互应用为语音助手为例,进行说明。
随着智能电子设备技术的不断发展,在许多电子设备上具有语音助手的功能,以实现用户与电子设备之间的交互。语音助手是一款智能型的应用,通过智能对话与即时问答的智能交互,实现帮助用户解决问题。一般,语音助手有三种不同的助手类型:闲聊型、问答型、指令型。闲聊型助手用于实现闲聊陪伴的目的,通过AI的技术来与用户进行对话,感知用户情绪。问答型助手用于知识获取,通过对话的方式来获取知识,或者解决疑问,比较常见的应用则是各个平台的智能客服。指令型助手用于设备控制,通过对话的方式来控制电子设备,实现某种操作,比较常见的应用有智能音响、IOT设备等,比如,语音控制:“打开空调,然后调成25度”。
如图1A所示,当用户靠近电子设备100,并对电子设备100发出语音指令“语音助手,打开音乐应用”,响应用户的语音指令,电子设备100打开音乐应用,并显示如图1B所示的音乐界面。或者,如图1C所示,当用户靠近电子设备100,并对电子设备100发出语音指令“语音助手,查询曲高和寡的意思”,响应用户的语音指令,电子设备100在网络上查询“曲高和寡”的意思,并将查询结果显示在如图1D所示的用户界面上。
用户唤醒语音助手的方式主要有两种:一种是用户每次在唤醒语音助手之前,都需要在语音指令中加入特定的语音唤醒词,电子设备在检测到用户的语音指令中存在语音唤醒词,电子设备才会唤醒语音助手。否则,电子设备不唤醒语音助手。对于不同厂商的电子设备而言,唤醒语音助手的唤醒词是不同的。例如,若厂商1的电子设备的唤醒词是“X爱同学”。那么,当要唤醒厂商1的电子设备中的语音助手时,需要在语音指令前面加上“X爱同学”。例如,X爱同学,请打开音乐应用。这种通过在语音指令前面加唤醒词的方法常常使得用户与电子设备之间的语音交互不自然,不符合用户习惯。
另外一种是用户不需要在语音指令中添加特定唤醒词即可唤醒语音助手,即:用户直接向电子设备发送语音指令来唤醒语音助手,并指示语音助手进行相应的操作。示例性的,如图1E所示,当用户靠近电子设备100,并对电子设备100发出语音指令“打开音乐应用”,响应用户的语音指令,电子设备100打开音乐应用,并显示如图1F所示的音乐界面。
在一种可能实现的方式中,在开启语音助手功能的用户界面中可以包括“免语音唤醒”控件。如图1G所示,当电子设备100检测到针对“免语音唤醒”控件101的输入操作(例如,单击),响应该操作,电子设备100可以开启“免语音唤醒”功能,即:用户向电子设备100发送语音指令时,不用在语音指令中添加特定的唤醒词,即可唤醒语音助手。可选地,在电子设备100检测到针对“免语音唤醒”控件101的输入操作之后,还可以在如图1G的用户界面上显示用于提示用户靠近麦克风的提示信息。例如,靠近手机底部麦克风2~5厘米处说出指令。
对于上述第二种语音助手的唤醒方法,用户可以不用在语音指令中添加特定的唤醒词即可唤醒语音助手,这使得用户在与电子设备进行语音交互的过程中,可以更加自然。此外,用户在与电子设备进行语音交互时,不使用特定的唤醒词,也更加符合用户的习惯。但是,由于没有特定的唤醒词来唤醒语音助手,这就使得在电子设备的语音助手有被误触发的现象。例如,用户寻找东西,将手机放在桌子上,若此时用户询问其他人东西放在哪里,由于没有唤醒词,电子设备可能启动语音助手与用户进行语音交流,这就发生了语音助手的误触发。或者,当用户在开会场景下,用户将手机放在桌子上,用户进行发言时,电子设备检测到用户发出的语音信号,也可能唤醒语音助手,从而造成语音助手的误触发。语音助手的频繁误触发,会给用户带来不便,从而降低用户的使用体验。
因此,为了解决上述问题,本申请实施例提出了一种语音交互的方法,该方法包括:电子设备获取语音信号数据和位姿数据,语音信号数据可以包括电子设备多个麦克风接收的语音信号的梅尔倒谱系数、电子设备多个麦克风接收的音频的能量差,位姿数据可以包括电子设备的加速度传感器获取的在x轴方向的加速度数据、在y轴方向的加速度数据、在z轴方向的加速度数据。电子设备将语音信号数据作为语音检测模型的输入,语音检测模型将语音信号进行处理得到第一置信度,电子设备将位姿数据作为位姿检测模型的输入, 位姿检测模型将位姿数据进行处理,输出第二置信度。电子设备将语音检测模型中的卷积层输出的第一语音数据以及语音检测模型中的全连接层输出的第二语音数据作为语音-位姿检测模型的输入,电子设备将位姿检测模型中的卷积层输出的第一目标位姿数据以及位姿检测模型中的全连接层输出的第二目标位姿数据作为语音-位姿模型的输入。语音-位姿模型基于第一语音数据、第二语音数据、第一目标位姿数据以及第二位姿数据进行处理,输出第三置信度。电子设备基于第一置信度、第二置信度以及第三置信度确定是否唤醒语音助手。
下面,结合图2对本申请实施例提供的一种语音交互方法的系统框架进行介绍。如图2所示,在系统架构中包括免唤醒判断模块和语音助手模块。免唤醒判断模块位于数字音频处理器层(DSP层),免唤醒判断模块包括免唤醒一级判断模块和免唤醒二级判断模块。语音助手模块位于应用程序层。免唤醒判断模块接收到第一语音信号后,首先将第一语音信号通过免唤醒一级判断模块进行处理,检测第一语音信号是否需要进行语音检测。若需要,将第一语音信号发送给免唤醒二级判断模块进行语音检测。若检测到第一语音信号为发送给电子设备的语音指令,免唤醒判断模块将第一语音信号发送给语音助手模块,然后,再由语音助手模块根据第一语音信号进行目标操作。
下面,对本申请实施例提供的一种语音交互方法的流程进行介绍。请参见图3,图3是本申请实施例提供的一种语音交互方法的流程图,在图3中,电子设备通过麦克风接收外界的语音信号,电子设备具有的麦克风的数量为N,N为大于或等于2的整数。在图3所示的电子设备中,包括免唤醒判断模块和语音助手模块。其中,免唤醒判断模块包括免唤醒一级判断模块和免唤醒二级判断模块,免唤醒二级判断模块包括语音检测模型、位姿检测模型以及语音-位姿检测模型。为了便于叙述,本申请实施例以N为2进行举例说明。具体流程如下:
步骤301:电子设备接收第一语音信号。
具体地,第一语音信号可以为用户发出的语音信号,也可以为其它音源发出的语音信号。电子设备存在一个或多个麦克风,电子设备可以通过麦克风接收外界的语音信号。
步骤302:电子设备将第一语音信号发送给免唤醒判断模块。
步骤303:免唤醒判断模块将第一语音信号通过免唤醒一级判断模块进行处理,得到第一判断结果。
具体地,电子设备在接收到第一语音信号后,可以将第一语音信号发送给免唤醒判断模块。免唤醒判断模块在接收到第一语音信号后,可以将第一语音信号通过免唤醒一级判断模块进行处理。然后,免唤醒一级判断模块基于接收的第一语音信号,计算第一语音信号的信号强度,以判断第一语音信号的强弱。若第一语音信号弱,则确定第一语音信号不是对电子设备发出的语音指令。免唤醒一级判断模块在计算出第一语音信号的信号强度后,可以输出第一判断结果。其中,第一判断结果可以为第一标识或第二标识,当第一语音信号的信号强度大于或等于第一阈值时,第一判断结果为第一标识,第一标识用于表征第一语音信号的强度强。当第一语音信号的信号强度小于第一阈值时,第一判断结果为第二标 识,第二标识用于表征第一语音信号的强度弱。其中,第一阈值可以基于历史值得到,也可以基于经验值得到,还可以基于实验数据得到,本申请实施例不做限制。
步骤304:免唤醒判断模块将加速度数据通过免唤醒一级判断模块进行处理,得到第二判断结果。
具体地,电子设备在接收到第一语音信号后,可以将加速度数据发送给免唤醒判断模块。免唤醒判断模块在接收到加速度数据后,可以将加速度数据通过免唤醒一级判断模块进行处理,得到第二判断结果。其中,加速度数据可以通过内置在电子设备中的加速度传感器得到,位姿信息可以包括加速度传感器在x轴上的加速度的方差,在y轴上的加速度的方差,在z轴上的加速度的方差。然后,电子设备基于这三个坐标轴对应的加速度的方差判断电子设备是否处于运动过程,从而得到第二判断结果。其中,第二判断结果包括第三标识和第四标识,第三表示用于电子设备处于运动状态,第四标识用于指示电子设备处于静止状态。
示例性的,电子设备可以基于上述三个坐标轴对应的加速度的方差判断电子设备是否处于运动过程,得到第二判断结果的方式可以为:电子设备可以为这三个坐标轴分别设置方差阈值,分别为:第一方差阈值D1、第二方差阈值D2、第三方差阈值D3。D1与x轴对应,D2与y轴对应,D3与z轴对应。第一方差阈值、第二方差阈值以及第三方差阈值可以相同,也可以不相同,可以根据历史值得到,也可以根据经验值得到,还可以根据实验数据得到,本申请实施例不做限制。若在这三个坐标轴对应的加速度的方差中,只要存在一个加速度的方差大于或等于对应的方差阈值,则判断电子设备处于运动状态,第二判断结果包括第一标识。例如,若x轴对应的加速度的方差大于或等于D1,则判断电子设备处于运动状态。若这三个坐标轴对应的加速度的方差均小于对应的方差阈值,则判断电子设备不处于运动状态。
在一种可能实现的方式中,若在这三个坐标轴对应的加速度的方差中,只要存在两个加速度的方差大于或等于对应的方差阈值,则判断电子设备处于运动状态。例如,若x轴对应的加速度的方差大于或等于D1且y轴对应的加速度的方差大于或等于D2,则判断电子设备处于运动状态。若在这三个坐标轴对应的加速度的方差中,仅存在一个加速度的方差大于或等于对应的方差阈值,或者这三个坐标轴对应的加速度的方差均小于对应的方差阈值,则判断电子设备不处于运动状态。
在一种可能实现的方式中,在这三个坐标轴对应的三个加速度的方差中,全部的方差均大于或等于对应的方差阈值,则判断电子设备处于运动状态。反之,则判断电子设备不处于运动状态。
应当理解的是,步骤303可以在步骤304之前执行,也可以在步骤304之后执行,还可以和步骤304同时执行,本申请实施例对于步骤304和步骤303的执行顺序不做限制。
步骤305:免唤醒一级判断模块根据第一判断结果和第二判断结果是否对第一语音信号进行语音检测。
具体地,免唤醒一级判断模块在计算出第一判断结果和第二判断结果后,电子设备可以根据第一判断结果和第二判断结果判断是否要对第一语音信号进行语音检测,即:检测第一语音信号是否为唤醒电子设备语音助手的目标语音指令。若判断出要对第一语音信号 进行语音检测,电子设备执行步骤306,若判断出不对第一语音信号进行语音检测,电子设备结束流程。
电子设备判断是否对第一语音信号进行语音检测的方法可以为:若在第一判断结果中包括第一标识且在第二判断结果中包括第三标识的情况下,电子设备确定对第一语音信号进行语音检测。反之,电子设备确定不对第一语音信号进行语音检测。
示例性的,假设第一标识和第三标识为1,第二标识和第四标识为0,电子设备可以将第一判断结果中的标识和第二判断结果中的标识进行“逻辑与”运算,若运算结果为1,则电子设备确定对第一语音信号进行语音检测,若运算结果为0,则电子设备确定不对第一语音信号进行语音检测。
电子设备基于第一语音信号的信号强度以及加速度传感器的加速度方差,可以过滤掉大部分不是用户意图的场景。例如,过滤掉距离电子设备麦克风距离较远的场景(电子设备接收的语音信号强度弱),或者用户边玩电子设备边聊天(加速度传感器的加速度数据的方差较小)的场景等。对于属于用户意图的场景,电子设备对其接收的语音信号进行语音检测,以便对该语音信号是否为唤醒语音助手的指令,进行更加精确地判断。对于不属于用户意图的场景,电子设备不对其接收的语音信号进行语音检测,结束流程。由于对语音信号进行语音检测会消耗大量的计算资源。因此,在对接收的语音信号进行语音检测之前,电子设备判断第一语音信号是否满足语音检测的条件,可以大大解决电子设备的计算资源,从而提升电子设备的工作性能。
步骤306:免唤醒一级判断模块将第一语音信号发送给免唤醒二级判断模块。
具体地,免唤醒一级判断模块在确定对第一语音信号进行语音检测后,免唤醒一级判断模块将第一语音信号发送给免唤醒二级判断模块,以便免唤醒二级判断模块对第一语音信号进行语音检测。
步骤307:免唤醒二级模块获取第一语音信号的语音信号数据。
具体地,免唤醒在接收到免唤醒以及模块发送的第一语音信号后,会对第一语音信号进行处理,从而得到第一语音信号的语音信号数据。
其中,语音信号数据可以包括第一语音信号的梅尔倒谱系数、第一麦克风接收的语音信号与第二麦克风接收第一语音信号的能量差M。M用于表征音源(第一语音信号的声源)与电子设备的距离。M越大,代表音源与电子设备的距离越小;M越小,代表音源与电子设备的距离越大。电子设备可以设置能量阈值H,当M大于或等于H时,可以认为音源离电子设备较近(例如,40cm以内),当M小于H时,可以认为音源离电子设备较远(例如,40cm以外)。梅尔倒谱系数为符合人耳听觉特性的语音信号特征,更多捕捉语音信号在低频的细节特征,此外,用户近距离在对电子设备说话时,在低频会有Pop音。因此,梅尔倒谱系数作为语音检测模型的输入,可以利于语音检测模型提取第一语音信号在低频频域的语音参数。
步骤308:免唤醒二级判断模块将所述语音信号数据通过语音检测模型进行处理,得到第一置信度、第一语音数据和第二语音数据。
具体地,免唤醒二级判断模块可以将语音信号数据通过语音检测模型进行处理,得到第一置信度、第一语音数据和第二语音数据。语音检测模型可以为训练好的卷积神经网络, 在该卷积神经网络中可以包括卷积层,还可以包括全连接层。
免唤醒二级判断模块将语音信号数据通过语音检测模型进行处理,语音检测模型中的卷积层先对语音信号进行处理,得到并输出第一语音数据,第一语音数据包括梅尔倒谱系数的高阶特征信息以及M的高阶特征信息。然后,语音检测模型的全连接层对卷积层处理后的语音信号数据进行处理,得到第一置信度和第二语音数据。其中,第二语音数据包括梅尔倒谱系数的高阶特征信息以及M的高阶特征信息,第一置信度用于表征第一语音信号为用户对电子设备发送的语音指令的概率。
步骤309:免唤醒二级判断模块将位姿信息通过位姿检测模型进行处理,得到第二置信度、第一目标位姿信息、第二目标位姿信息。
可选地,免唤醒二级判断模块将位姿信息通过位姿检测模型进行处理之前,可以向计算度传感器获取加速度数据,该加速度数据包括电子设备在x轴上的加速度数据、在y轴上的加速度数据,在z轴上的加速度数据。然后,根据电子设备在这三个坐标轴上的加速度数据,计算得到电子设备的位姿信息。其中,电子设备的位姿信息包括x轴、y轴、z轴这三个坐标轴对应的加速度数据的绝对值,也可以包括x轴对应的加速度数据的方差d1、y轴对应的加速度数据的方差d2、z轴对应的加速度数据的方差d3,也可以包括x轴对应的加速度数据的均值p1、y轴对应的加速度数据的均值p2、z轴对应的加速度数据的均值p3,还可以包括d1与p1的差分值、d2与p2的差分值、d3与p3的差分值。
免唤醒二级判断模块在得到位姿信息后,可以将该位姿信息通过位姿检测模型进行检测,从而判断电子设备当前是否处于手持抬起状态,还可以判断电子设备在手持抬起状态下晃动的幅度等数据。其中,手持抬起状态可以理解为用户将电子设备拿在手上。电子设备可以结合第一置信度和位姿信息匹配当前的应用场景,并根据应用场景确定第一语音信号是否为唤醒语音助手的语音指令。电子设备可以将位姿信息通过位姿检测模型进行处理,得到第二置信度、第一目标位姿信息和第二目标位姿信息。
位姿检测模型可以为训练好的卷积神经网络模型,在该卷积神经网络模型中可以包括卷积层,也可以包括全连接层。由于,在x轴、y轴、z轴这三个坐标轴对应的加速度数据的绝对值、d1、d2以及d3可以表征电子设备是否处于运动状态,p1、p2以及p3可以表征电子设备运动的幅度,d1与p1的差分值、d2与p2的差分值、d3与p3的差分值可以从运动的平稳度等其他其它维度来表征电子设备的运动状态。因此,位姿检测模型可以通过上述位姿数据,可以基于电子设备是否运动、运动幅度以及运动的平稳度等多个方面综合判断电子设备是否处于手持抬起状态,提高位姿检测模型判断的准确率。
位姿检测模型中的卷积层可以先对位姿信息进行处理,并输出第一目标位姿信息,第一目标位姿信息包括位姿信息的高阶特征信。然后,位姿检测模型的全连接层对卷积层处理后的位姿信息进行处理,得到第二置信度和第二目标位姿信息。其中,第二目标位姿信息包括位姿信息的高阶特征信息,第二置信度用于表征电子设备处于手持抬起状态的概率。
应当理解的是,步骤308可以在步骤309之前执行,步骤308也可以在步骤309之后执行,步骤308可以和步骤309同时执行,本申请实施例对步骤308和步骤309的执行顺序不做限制。
步骤310:免唤醒二级判断模块将第一音频数据、第二音频数据、第一目标位姿信息、 第二目标位姿信息通过音频-位姿检测融合模型进行处理,得到第三置信度。
具体地,音频-位姿检测融合模型可以为训练好的卷积神经网络模型,该神经网络模型用于检测电子设备接收的第一语音信号为语音指令且电子设备当前处于手持抬起状态的概率。电子设备将第一音频数据、第二音频数据、第一目标位姿信息、第二目标位姿信息通过音频-位姿检测融合模型进行处理后,得到第三置信度。第三置信度用于表征第一语音信号为语音指令且电子设备当前处于手持抬起状态的概率,即:表征电子设备的位姿状态和电子设备的接收的语音信号的匹配程度。第三置信度越高,第一语音信号为语音指令且电子设备当前处于手持抬起状态的概率就越高,即电子设备存在语音输入与电子设备处于手持抬起状态的实时相关性就越高。
步骤311:免唤醒二级判断模块根据第一置信度、第二置信度以及第三置信度判断第一语音信号是否为目标语音指令。
具体地,目标语音指令为唤醒电子设备的语音助手的指令。若电子设备判断第一语音信号为目标语音指令,执行步骤312,否则,结束流程。
电子设备根据第一置信度、第二置信度以及第三置信度判断第一语音信号是否为目标语音指令主要有以下两种方法:
第一种方法:基于第一置信度确定第一置信标识,基于第二置信度确定第二置信标识,基于第三置信度确定第三置信标识。当第一置信度大于或等于第一置信阈值时,第一置信标识为1,当第一置信度小于第一置信阈值时,第一置信标识为0。当第二置信度大于或等于第二置信阈值时,第二置信标识为1,当第二置信度小于第二置信阈值时,第二置信标识为0。当第三置信度大于或等于第三置信阈值时,第三置信标识为1,当第三置信度小于第三置信阈值时,第三置信标识为0。然后,电子设备将第一置信标识、第二置信标识以及第三置信标识进行“逻辑与(&)”运算,得到第二判断决结果。若第二判决结果为1,则电子设备判断第一语音信号为目标语音指令,若第二判决结果为0,则电子设备判断第一语音信号不为目标语音指令。其中,第一置信阈值、第二置信阈值以及第三置信阈值可以由历史值得到,也可以由经验值得到,还可以由实验数据得到,本申请实施例不做限制。优选的,第一置信阈值、第二置信阈值以及第三置信阈值可以为50%。
第二种方法:电子设备可以通过公式来确定第一置信度、第二置信度以及第三置信度的权重值。然后,基于这三个置信度的权重值,对这三个置信度进行融合计算,得到融合后的置信度,再基于融合后的置信度判断第一语音信号是否为目标语音指令。
示例性的,电子设备可以通过公式(1)计算第一置信度的权重值,公式(1)如下所示:
其中,fm为语音检测模型本次输出的第一置信度,k为与本次语音检测模型输出的第一置信度相邻的前Q个第一置信度的编号。例如,当k=1时,fk为语音检测模型上一次输出的第一置信度;k=2时,fk为语音检测模型上上次输出的第一置信度……以此类推。abs为绝对值函数。
电子设备可以通过公式(2)计算第二置信度的权重值,公式(2)如下所示:
其中,Lm为位姿检测模型本次输出的第二置信度,k为与本次位姿检测模型输出的第二置信度相邻的前Q个第二置信度的编号。例如,当k=1时,fk为位姿检测模型上一次输出的第二置信度;k=2时,fk为位姿检测模型上上次输出的第二置信度……以此类推。abs为绝对值函数。
电子设备可以通过公式(3)计算第三置信度的权重值,公式(3)如下所示:
W3=1-W1-W2   (3)
然后,电子设备可以根据公式(4)计算融合后的置信度K,公式(4)如下所示:
K=fm·W1+Lm·W2+Rm·W3   (4)
其中,K为融合后的置信度,Rm为音频-位姿检测融合模型本次输出的第三置信度。在计算出K之后,电子设备判断K是否大于或等于第一启动阈值,若大于第一启动阈值,电子设备判断第一语音信号为目标语音指令;反之,电子设备判断第一语音信号不为目标语音指令。优选的,第一启动阈值可以为60%。
由于第一置信度是由语音检测模型计算得到的,第二置信度是由位姿检测模型计算得到的,第三置信度是由音频-位姿检测融合模型计算得到的。通过第一置信度可以排除仅有手持抬起状态的应用场景,通过第二置信度可以排除仅有语音输入的应用场景,第三置信度融合了语音信息数据和位姿信息的高维特征,可以表征电子设备语音输入和位姿状态的实时相关性。因此,通过上述第一置信度、第二置信度以及第三置信度判断第一语音信号是否为目标语音指令,得到的判断结果更加准确。
在一种可能实现的方式中,在通过上述第二种方法判断第一语音信号不为目标语音指令的情况下,电子设备还可以基于计算的、融合后的置信度判断是否显示提示信息。若K小于第一置信阈值且大于或等于第二置信阈值(第二置信阈值小于第一置信阈值),电子设备可以显示如图4所示提示界面,用于提示用户发送语音时出现的问题(例如,声音太小)。这样,以便在用户在未唤醒语音助手的情况下,知道问题在哪儿,并及时改进。其中,第一置信阈值和第二置信阈值可以基于历史值得到,也可以基于经验值得到,还可以基于实验数据得到,本申请实施例不做限制。优选的,第二启动阈值可以为50%。
步骤312:免唤醒二级判断模块将第一语音信号发送给语音助手模块。
步骤313:语音助手模块解析第一语音信号,并根据第一语音信号进行第一操作。
具体地,免唤醒二级判断模块再将第一语音信号发送给语音助手模块后,语音助手模块接收并解析第一语音信号,从而获取操作指令,并根据操作指令进行第一操作。
示例性的,用户向电子设备发送的语音为“打开相机应用,我要拍照”,语音助手模块解析该语音对应的第一语音信号,可以提取出“打开相机应用”的指令。因此,语音助手模块可以根据该指令启动相机应用,语音助手模块启动相机应用的操作就是第一操作。
在本申请实施例中,电子设备在接收到一段语音信号后,首先通过免唤醒一级判断模块判断该语音信号是否需要进行语音检测,对于不需要进行语音检测的语音信号就结束流程,可以不再对该语音信号进行处理,通过免唤醒一级判断模块对语音信号进行判断,过滤掉了大部分非用户意图的场景,从而避免了电子设备中的语音助手的免唤醒,也节约了 电子设备的计算资源。若判断语音信号需要进行语音检测,电子设备将语音信号的语音信号数据通过语音检测模型进行处理,将位姿信息通过位姿检测模块进行处理,将位姿检测模块和语音检测模型输出的高阶特征数据通过音频-位姿监测模型进行处理,这三个模型分别输出三个置信度,再基于这三个置信度判断其接收的语音信号是否为唤醒语音助手的目标语音指令。若是,则唤醒语音助手,若不是,则不唤醒语音助手。由于第一置信度是由语音检测模型计算得到的,第二置信度是由位姿检测模型计算得到的,第三置信度是由音频-位姿检测融合模型计算得到的。通过第一置信度可以排除仅有手持抬起状态的应用场景,通过第二置信度可以排除仅有语音输入的应用场景,第三置信度融合了语音信息数据和位姿信息的高维特征,可以表征电子设备语音输入和位姿状态的实时相关性。因此,通过上述第一置信度、第二置信度以及第三置信度判断第一语音信号是否为目标语音指令,得到的判断结果更加准确,可以降低语音助手被误唤醒的概率,提高了用户体验。
在上述图3实施例中,对本申请实施例提供的一种语音交互方法的流程进行了介绍。下面,结合附图,介绍本申请实施例提供的另一种语音交互方法。在该方法中,免唤醒判断模块在确定第一语音信号为目标语音指令后,免唤醒判断模块将第一语音信号发送给声纹验证模块。在声纹验证模块判断第一语音信号为用户本人发出的语音信号后,才将第一语音信号发送给语音助手模块。通过这种方法,只有电子设备的用户才能唤醒语音助手,在保证了语音助手被误触发的前提下,保障了用户的隐私性和安全性。
下面,结合图5A,对本申请实施例提出的另一种语音交互方法进行介绍。请参见图5A,图5A是本申请实施例提供的另一种语音交互方法流程图,具体流程如下:
步骤501:电子设备接收第一语音信号。
步骤502:电子设备将第一语音信号发送给免唤醒判断模块。
步骤503:免唤醒判断模块将第一语音信号通过免唤醒一级判断模块进行处理,得到第一判断结果。
步骤504:唤醒判断模块将加速度数据通过免唤醒一级判断模块进行处理,得到第二判断结果。
步骤505:免唤醒一级判断模块根据第一判断结果和第二判断结果是否对第一语音信号进行语音检测。
步骤506:免唤醒一级判断模块将第一语音信号发送给免唤醒二级判断模块。
步骤507:免唤醒二级模块获取第一语音信号的语音信号数据。
步骤508:免唤醒二级判断模块将所述语音信号数据通过语音检测模型进行处理,得到第一置信度、第一语音数据和第二语音数据。
步骤509:免唤醒二级判断模块将位姿信息通过位姿检测模型进行处理,得到第二置信度、第一目标位姿信息、第二目标位姿信息。
步骤510:免唤醒二级判断模块将第一音频数据、第二音频数据、第一目标位姿信息、第二目标位姿信息通过音频-位姿检测融合模型进行处理,得到第三置信度。
步骤511:免唤醒二级判断模块根据第一置信度、第二置信度以及第三置信度判断第一语音信号是否为目标语音指令。
若为是,执行步骤512,若为否,结束流程。
步骤501-步骤511可以参见上述图3实施例中的步骤301-步骤311,在此不再赘述。
步骤512:免唤醒二级判断模块将第一语音信号发送给声纹验证模块。
步骤513:声纹验证模块验证第一语音信号是否为所述电子设备的用户发出的语音信号。
具体地,声纹验证模块可以为一个已训练好的神经网络模型。如图5B所示,用户可以按照电子设备的提示输入注册语音,例如,对电子设备说出“我今天真好看”、“播放今天的新闻等”。电子设备可以根据用户输入的注册语音提取语音特征信息(例如,语音信号的频率、声音的响度、声音的音调、音色等),并将提取出的语音特征信息作为声学模型的输入。声学模型对语音特征信息进行处理,输出用户的声纹特征信息,并将声纹特征信息作为后端判决模块的输入,后端判断模块对声纹特征信息进行处理,输出差异函数。所述差异函数用于衡量声学模型输出的声纹特征信息与用户真实的声纹特征信息的从差异程度,差异函数越大,差异程度越大,差异函数越小,差异程度越小。然后,电子设备根据差异函数调整声学模型的网络结构或参数,从而使得声学模型输出的声纹特征信息无限接近用户的声纹特征信息。其中,声纹特征信息用于表征用户声音的要素,可以包括用户声音的音调、音色,还可以包括用户声音的响度等。
当声纹验证模块接收到第一语音信号(输入语音)后,声纹验证模块可以提取第一语音信号中的语音特征信息,并将语音特征信息作为声学模型的输入。声学模型对语音特征信息进行处理,输出第一语音信号对应的声纹特征信息,并将该声纹特征信息作为后端判决模块的输入,后端判断模块判断声纹特征信息是否与用户的声纹特征信息一致,若一致,执行步骤515,若不一致,结束流程。
步骤514:声纹验证模块将第一语音信号发送给语音助手模块。
步骤515:语音助手模块解析第一语音信号,并根据第一语音信号进行第一操作。
步骤515可以参见上述图3实施例中的步骤313,在此不再赘述。
需要说明的是,对与上述方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但本领域技术人员应当知悉,本发明并不受所描述的动作顺序的限制。其次,本领域技术人员也应当知悉,说明书中所述的实施例均属优选实施例,所涉及的动作并不一定是本发明所必须的。
下面对电子设备100的结构进行介绍。请参阅图6,图6是本申请实施例提供的电子设备100的硬件结构示意图。
电子设备100可以包括处理器110,外部存储器接口120,内部存储器121,通用串行总线(universal serial bus,USB)接口130,充电管理模块140,电源管理模块141,电池142,天线1,天线2,移动通信模块150,无线通信模块160,音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,传感器模块180,按键190,马达191,指示器192,摄像头193,显示屏194,以及用户标识模块(subscriber identification module,SIM)卡接口195等。其中传感器模块180可以包括压力传感器180A,陀螺仪传感器180B,气压传感器180C,磁传感器180D,加速度传感器180E,距离传感器180F,接近光传感器 180G,指纹传感器180H,温度传感器180J,触摸传感器180K,环境光传感器180L,骨传导传感器180M等。
可以理解的是,本发明实施例示意的结构并不构成对电子设备100的具体限定。在本申请另一些实施例中,电子设备100可以包括比图6示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图6示的部件可以以硬件,软件或软件和硬件的组合实现。
处理器110可以包括一个或多个处理单元,例如:处理器110可以包括应用处理器(application processor,AP),调制解调处理器,图形处理器(graphics processing unit,GPU),图像信号处理器(image signal processor,ISP),控制器,存储器,视频编解码器,数字信号处理器(digital signal processor,DSP),基带处理器,和/或神经网络处理器(neural-network processing unit,NPU)等。其中,不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。
电子设备100的无线通信功能可以通过天线1,天线2,移动通信模块150,无线通信模块160,调制解调处理器以及基带处理器等实现。
天线1和天线2用于发射和接收电磁波信号。电子设备100中的每个天线可用于覆盖单个或多个通信频带。不同的天线还可以复用,以提高天线的利用率。例如:可以将天线1复用为无线局域网的分集天线。在另外一些实施例中,天线可以和调谐开关结合使用。
移动通信模块150可以提供应用在电子设备100上的包括2G/3G/4G/5G等无线通信的解决方案。移动通信模块150可以包括至少一个滤波器,开关,功率放大器,低噪声放大器(low noise amplifier,LNA)等。移动通信模块150可以由天线1接收电磁波,并对接收的电磁波进行滤波,放大等处理,传送至调制解调处理器进行解调。移动通信模块150还可以对经调制解调处理器调制后的信号放大,经天线1转为电磁波辐射出去。在一些实施例中,移动通信模块150的至少部分功能模块可以被设置于处理器110中。在一些实施例中,移动通信模块150的至少部分功能模块可以与处理器110的至少部分模块被设置在同一个器件中。
无线通信模块160可以提供应用在电子设备100上的包括无线局域网(wireless local area networks,WLAN)(如Wi-Fi网络),蓝牙(BlueTooth,BT),BLE广播,全球导航卫星系统(global navigation satellite system,GNSS),调频(frequency modulation,FM),近距离无线通信技术(near field communication,NFC),红外技术(infrared,IR)等无线通信的解决方案。无线通信模块160可以是集成至少一个通信处理模块的一个或多个器件。无线通信模块160经由天线2接收电磁波,将电磁波信号调频以及滤波处理,将处理后的信号发送到处理器110。无线通信模块160还可以从处理器110接收待发送的信号,对其进行调频,放大,经天线2转为电磁波辐射出去。
电子设备100通过GPU,显示屏194,以及应用处理器等实现显示功能。GPU为图像处理的微处理器,连接显示屏194和应用处理器。GPU用于执行数学和几何计算,用于图形渲染。处理器110可包括一个或多个GPU,其执行程序指令以生成或改变显示信息。
显示屏194用于显示图像,视频等。显示屏194包括显示面板。显示面板可以采用液晶显示屏(liquid crystal display,LCD),有机发光二极管(organic light-emitting diode,OLED), 有源矩阵有机发光二极体或主动矩阵有机发光二极体(active-matrix organic light emitting diode的,AMOLED),柔性发光二极管(flex light-emitting diode,FLED),Miniled,MicroLed,Micro-oLed,量子点发光二极管(quantum dot light emitting diodes,QLED)等。在一些实施例中,电子设备100可以包括1个或N个显示屏194,N为大于1的正整数。
电子设备100可以通过ISP,摄像头193,视频编解码器,GPU,显示屏194以及应用处理器等实现拍摄功能。
ISP用于处理摄像头193反馈的数据。例如,拍照时,打开快门,光线通过镜头被传递到摄像头感光元件上,光信号转换为电信号,摄像头感光元件将所述电信号传递给ISP处理,转化为肉眼可见的图像。ISP还可以对图像的噪点,亮度,肤色进行算法优化。ISP还可以对拍摄场景的曝光,色温等参数优化。在一些实施例中,ISP可以设置在摄像头193中。
数字信号处理器用于处理数字信号,除了可以处理数字图像信号,还可以处理其他数字信号。例如,当电子设备100在频点选择时,数字信号处理器用于对频点能量进行傅里叶变换等。
NPU为神经网络(neural-network,NN)计算处理器,通过借鉴生物神经网络结构,例如借鉴人脑神经元之间传递模式,对输入信息快速处理,还可以不断的自学习。通过NPU可以实现电子设备100的智能认知等应用,例如:图像识别,人脸识别,语音识别,文本理解等。
电子设备100可以通过音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,以及应用处理器等实现音频功能。例如音乐播放,录音等。
音频模块170用于将数字音频信息转换成模拟音频信号输出,也用于将模拟音频输入转换为数字音频信号。音频模块170还可以用于对音频信号编码和解码。在一些实施例中,音频模块170可以设置于处理器110中,或将音频模块170的部分功能模块设置于处理器110中。
扬声器170A,也称“喇叭”,用于将音频电信号转换为声音信号。电子设备100可以通过扬声器170A收听音乐,或收听免提通话。
受话器170B,也称“听筒”,用于将音频电信号转换成声音信号。当电子设备100接听电话或语音信息时,可以通过将受话器170B靠近人耳接听语音。
麦克风170C,也称“话筒”,“传声器”,用于将声音信号转换为电信号。当拨打电话或发送语音信息时,用户可以通过人嘴靠近麦克风170C发声,将声音信号输入到麦克风170C。电子设备100可以设置至少一个麦克风170C。在另一些实施例中,电子设备100可以设置两个麦克风170C,除了采集声音信号,还可以实现降噪功能。在另一些实施例中,电子设备100还可以设置三个,四个或更多麦克风170C,实现采集声音信号、降噪、还可以识别声音来源,实现定向录音功能等。
压力传感器180A用于感受压力信号,可以将压力信号转换成电信号。在一些实施例中,压力传感器180A可以设置于显示屏194。
气压传感器180C用于测量气压。在一些实施例中,电子设备100通过气压传感器180C测得的气压值计算海拔高度,辅助定位和导航。
磁传感器180D包括霍尔传感器。电子设备100可以利用磁传感器180D检测翻盖皮套的开合。
加速度传感器180E可检测电子设备100在各个方向上(一般为三轴)加速度的大小。当电子设备100静止时可检测出重力的大小及方向。还可以用于识别电子设备姿态,应用于横竖屏切换,计步器等应用。
指纹传感器180H用于采集指纹。电子设备100可以利用采集的指纹特性实现指纹解锁,访问应用锁,指纹拍照,指纹接听来电等。
触摸传感器180K,也称“触控面板”。触摸传感器180K可以设置于显示屏194,由触摸传感器180K与显示屏194组成触摸屏,也称“触控屏”。触摸传感器180K用于检测作用于其上或附近的触摸操作。触摸传感器可以将检测到的触摸操作传递给应用处理器,以确定触摸事件类型。可以通过显示屏194提供与触摸操作相关的视觉输出。在另一些实施例中,触摸传感器180K也可以设置于电子设备100的表面,与显示屏194所处的位置不同。
骨传导传感器180M可以获取振动信号。在一些实施例中,骨传导传感器180M可以获取人体声部振动骨块的振动信号。
电子设备100的软件系统可以采用分层架构,事件驱动架构,微核架构,微服务架构,或云架构。本发明实施例以分层架构的Android系统为例,示例性说明电子设备100的软件结构。图7是本申请实施例的电子设备100的软件结构框图。分层架构将软件分成若干个层,每一层都有清晰的角色和分工。层与层之间通过软件接口通信。在一些实施例中,将Android系统分为四层,从上至下分别为应用程序层,应用程序框架层,硬件抽象层(HAL层),内核层、以及数字信号处理层。
应用程序层可以包括一系列应用程序包。如图7所示,应用程序包可以包括相机,图库,日历,通话,地图,导航,WLAN,蓝牙,语音助手模块,视频等应用程序。
语音助手模块用于解析用户的语音指令,并根据用户的语音指令进行相关操,从而实现电子设备与用户之间的语音交互。
应用程序框架层为应用程序层的应用程序提供应用编程接口(application programming interface,API)和编程框架。应用程序框架层包括一些预先定义的函数。如图7所示,应用程序框架层可以包括窗口管理器,内容提供器,视图系统,电话管理器,资源管理器,通知管理器等。
窗口管理器用于管理窗口程序。窗口管理器可以获取显示屏大小,判断是否有状态栏,锁定屏幕,截取屏幕等。
内容提供器用来存放和获取数据,并使这些数据可以被应用程序访问。所述数据可以包括视频,图像,音频,拨打和接听的电话,浏览历史和书签,电话簿等。
视图系统包括可视控件,例如显示文字的控件,显示图片的控件等。视图系统可用于构建应用程序。显示界面可以由一个或多个视图组成的。例如,包括短信通知图标的显示界面,可以包括显示文字的视图以及显示图片的视图。
电话管理器用于提供电子设备100的通信功能。例如通话状态的管理(包括接通,挂断等)。
资源管理器为应用程序提供各种资源,比如本地化字符串,图标,图片,布局文件,视频文件等等。
通知管理器使应用程序可以在状态栏中显示通知信息,可以用于传达告知类型的消息,可以短暂停留后自动消失,无需用户交互。比如通知管理器被用于告知下载完成,消息提醒等。通知管理器还可以是以图表或者滚动条文本形式出现在系统顶部状态栏的通知,例如后台运行的应用程序的通知,还可以是以对话窗口形式出现在屏幕上的通知。例如在状态栏提示文本信息,发出提示音,电子设备振动,指示灯闪烁等。
硬件抽象层包括声纹验证模块,声纹验证模块用于判断接收的语音信号是否为用户发出的语音信号。
内核层是硬件和软件之间的层。内核层至少包含显示驱动,摄像头驱动,音频驱动,传感器驱动。
数字信号处理层包括免唤醒判断模块,免唤醒判断模块用于判断接收的语音信号是否为要唤醒电子设备中,语音助手的语音信号。
需要说明的是,对于上述方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但本领域技术人员应当知悉,本发明并不受所描述的动作顺序的限制。其次,本领域技术人员也应当知悉,说明书中所述的实施例均属优选实施例,所涉及的动作并不一定是本发明所必须的。本申请的实施方式可以任意进行组合,以实现不同的技术效果。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线)或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘Solid State Disk)等。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,可以由计算机程序来指令相关的硬件完成,该程序可存储于计算机可读取存储介质中,该程序在执行时,可包括如上述各方法实施例的流程。而前述的存储介质包括:ROM或随机存储记忆体RAM、磁碟或者光盘等各种可存储程序代码的介质。
总之,以上所述仅为本发明技术方案的实施例,并非用于限定本发明的保护范围。凡根据本发明的揭露,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (14)

  1. 一种语音交互方法,其特征在于,应用于电子设备,所述电子设备包括语音交互应用,所述方法包括:
    接收第一语音信号;
    在确定所述第一语音信号要进行语音检测的情况下,基于所述第一语音信号得到语音信号数据;
    将所述语音信号数据通过语音检测模型处理,得到第一置信度和语音数据,所述第一置信度用于表征所述第一语音信号为用户发送给所述电子设备的语音指令的概率;
    获取所述电子设备的加速度数据,并基于所述加速度数据得到所述电子设备的位姿信息;
    将所述位姿信息通过位姿检测模型进行处理,得到第二置信度和目标位姿信息,所述第二置信度用于表征所述电子设备处于手持抬起状态的概率;
    将所述目标位姿信息和所述语音数据通过音频-位姿检测融合模型进行处理,得到第三置信度,所述第三置信度用于表征所述电子设备处于手持抬起状态且所述第一语音信号为用户发送给所述电子设备的语音指令的概率;
    基于所述第一置信度、所述第二置信度和所述第三置信度判断是否启动所述语音交互应用。
  2. 如权利要求1所述的方法,其特征在于,所述基于所述第一置信度、所述第二置信度和所述第三置信度判断是否启动所述语音交互应用,具体包括:
    在所述第一置信度大于或等于第一置信阈值的情况下,将第一置信标识设置为1;
    在所述第一置信度小于第一置信阈值的情况下,将所述第一置信标识设置为0;
    在所述第二置信度大于或等于第二置信阈值的情况下,将第二置信标识设置为1;
    在所述第二置信度小于第二置信阈值的情况下,将所述第二置信标识设置为0;
    在所述第三置信度大于或等于第三置信阈值的情况下,将第三置信标识设置为1;
    在所述第三置信度小于第三置信阈值的情况下,将所述第三置信标识设置为0;
    将所述第一置信标识、第二置信标识以及第三置信标识进行逻辑与运算,得到判决结果;
    根据所述判决结果判断是否启动所述语音交互应用。
  3. 如权利要求2所述的方法,其特征在于,所述根据所述判决结果判断是否启动所述语音交互应用,具体包括:
    在所述判决结果为1的情况下,启动所述语音交互应用;
    在所述判决结果为0的情况下,不启动所述语音交互应用。
  4. 如权利要求2所述的方法,其特征在于,所述电子设备还包括声纹检测模块,所述根据所述判决结果判断是否启动所述语音交互应用,具体包括:
    在所述判决结果为0的情况下,不启动所述语音交互应用;
    在所述判决结果为1的情况下,将所述第一语音信号通过声纹检测模块检测是否为目标用户的声音,所述目标用户为所述电子设备的用户;
    若判断为是,启动所述语音交互应用;
    若判断为否,不启动所述语音交互应用。
  5. 如权利要求1所述的方法,其特征在于,所述基于所述第一置信度、所述第二置信度和所述第三置信度判断是否启动所述语音交互应用,具体包括:
    计算所述第一置信度的第一权重值、所述第二置信度的第二权重值、所述第三置信度的第三权重值;
    基于所述第一置信度、所述第一权重值、所述第二置信度、所述第二权重值、所述第三置信度、所述第三权重值,计算得到融合后的置信度;
    基于所述融合后的置信度判断是否启动所述语音交互应用。
  6. 如权利要求5所述的方法,其特征在于,所述计算所述第一置信度的第一权重值、所述第二置信度的第二权重值、所述第三置信度的第三权重值,具体包括:
    根据公式计算所述第一权重值,所述W1为所述第一权重值,所述abs为绝对值函数,所述fm为所述语音检测模型本次输出的第一置信度,所述k为与本次输出的第一置信度最相邻的前Q个第一置信度的编号;
    根据公式计算所述第二权重值,所述W2为所述第二权重值,所述Lm为所述位姿检测模型本次输出的第二置信度,所述k为与本次输出的第二置信度最相邻的前Q个第二置信度的编号;
    根据公式W3=1-W1-W2计算所述第三权重值,所述W3为所述第三权重值。
  7. 如权利要求6所述的方法,其特征在于,所述基于所述第一置信度、所述第一权重值、所述第二置信度、所述第二权重值、所述第三置信度、所述第三权重值,计算得到融合后的置信度,具体包括:
    根据公式K=fm·W1+Lm·W2+Rm·W3计算所述融合后的置信度;
    其中,所述K为所述融合后的置信度,所述Rm为所述第三置信度。
  8. 如权利要求5-7任一项所述的方法,其特征在于,所述基于所述融合后的置信度判断是否启动所述语音交互应用,具体包括:
    若所述融合后的置信度大于或等于第一启动阈值,启动所述语音交互应用;
    若所述融合后的置信度小于第一启动阈值,不启动所述语音交互应用。
  9. 如权利要求8所述的方法,其特征在于,所述电子设备包括显示屏,所述若所述融 合后的置信度小于第一启动阈值,且大于或等于第二启动阈值,在所述显示屏上显示提示信息,所述提示信息用于指示用户再次发出语音指令;所述第二启动阈值小于所述第一启动阈值。
  10. 如权利要求5-7任一项所述的方法,其特征在于,所述电子设备还包括声纹检测模块,所述基于所述融合后的置信度判断是否启动所述语音交互应用,具体包括:
    若所述融合后的置信度小于第一启动阈值,不启动所述语音交互应用;
    若所述融合后的置信度大于或等于第一启动阈值,将所述第一语音信号通过声纹检测模块检测是否为目标用户的声音,所述目标用户为所述电子设备的用户;
    若判断为是,启动所述语音交互应用;
    若判断为否,不启动所述语音交互应用。
  11. 如权利要求1-10任一项所述的方法,其特征在于,所述基于所述第一语音信号得到语音信号数据之前,还包括:
    获取所述语音信号的信号强度值、所述电子设备在x轴上的加速度方差D1、所述电子设备在y轴上的加速度方差D2、所述电子设备在z轴上的加速度方差D3;
    基于所述信号强度值、所述D1、所述D2以及所述D3判断所述第一语音信号是否需要进行语音检测。
  12. 如权利要求1-11任一项所述的方法,其特征在于,所述语音数据包括第一语音数据和第二语音数据,所述第一语音数据为所述语音检测模型的卷积层输出的高阶语音特征信息,所述第二语音数据为所述语音检测模型的全连接层输出的高阶语音特征信息;
    所述目标位姿信息包括第一目标位姿信息和第二目标位姿信息,所述第一目标位姿信息为所述位姿检测模型的卷积层输出的高阶语音特征信息,所述第二目标位姿信息为所述位姿检测模型的全连接层输出的高阶语音特征信息。
  13. 一种电子设备,其特征在于,包括:存储器、处理器和触控屏;其中:
    所述触控屏用于显示内容;
    所述存储器,用于存储计算机程序,所述计算机程序包括程序指令;
    所述处理器用于调用所述程序指令,使得所述电子设备执行如权利要求1-12任一项所述的方法。
  14. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机程序,该计算机程序被处理器执行时,实现如权利要求1-12任意一项所述的方法。
PCT/CN2023/117410 2022-11-04 2023-09-07 一种语音交互方法及相关电子设备 WO2024093515A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211376580.5 2022-11-04
CN202211376580.5A CN115881118B (zh) 2022-11-04 2022-11-04 一种语音交互方法及相关电子设备

Publications (1)

Publication Number Publication Date
WO2024093515A1 true WO2024093515A1 (zh) 2024-05-10

Family

ID=85759434

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/117410 WO2024093515A1 (zh) 2022-11-04 2023-09-07 一种语音交互方法及相关电子设备

Country Status (2)

Country Link
CN (1) CN115881118B (zh)
WO (1) WO2024093515A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115881118B (zh) * 2022-11-04 2023-12-22 荣耀终端有限公司 一种语音交互方法及相关电子设备
CN117711395A (zh) * 2023-06-30 2024-03-15 荣耀终端有限公司 语音交互方法及电子设备

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106358061A (zh) * 2016-11-11 2017-01-25 四川长虹电器股份有限公司 电视语音遥控系统及方法
US20180204569A1 (en) * 2017-01-17 2018-07-19 Ford Global Technologies, Llc Voice Assistant Tracking And Activation
CN109376669A (zh) * 2018-10-30 2019-02-22 南昌努比亚技术有限公司 智能助手的控制方法、移动终端及计算机可读存储介质
US10515623B1 (en) * 2016-12-23 2019-12-24 Amazon Technologies, Inc. Non-speech input to speech processing system
CN111048089A (zh) * 2019-12-26 2020-04-21 广东思派康电子科技有限公司 提高智能穿戴设备语音唤醒成功率的方法、电子设备、计算机可读存储介质
CN111933112A (zh) * 2020-09-21 2020-11-13 北京声智科技有限公司 唤醒语音确定方法、装置、设备及介质
CN112863508A (zh) * 2020-12-31 2021-05-28 思必驰科技股份有限公司 免唤醒交互方法和装置
CN115223561A (zh) * 2022-07-28 2022-10-21 创维集团智能科技有限公司 手持设备的语音唤醒控制方法及相关设备
CN115881118A (zh) * 2022-11-04 2023-03-31 荣耀终端有限公司 一种语音交互方法及相关电子设备

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1988010413A1 (en) * 1987-06-09 1988-12-29 Central Institute For The Deaf Speech processing apparatus and methods
WO2008069519A1 (en) * 2006-12-04 2008-06-12 Electronics And Telecommunications Research Institute Gesture/speech integrated recognition system and method
US20140358552A1 (en) * 2013-05-31 2014-12-04 Cirrus Logic, Inc. Low-power voice gate for device wake-up
CN104660792A (zh) * 2013-11-21 2015-05-27 腾讯科技(深圳)有限公司 唤醒应用的方法及装置
US9779725B2 (en) * 2014-12-11 2017-10-03 Mediatek Inc. Voice wakeup detecting device and method
CA3000244A1 (en) * 2018-04-04 2019-10-04 Op-Hygiene Ip Gmbh Fluid pump with whistle
CN108712566B (zh) * 2018-04-27 2020-10-30 维沃移动通信有限公司 一种语音助手唤醒方法及移动终端
KR20210011146A (ko) * 2019-07-22 2021-02-01 이동욱 비음성 웨이크업 신호에 기반한 서비스 제공 장치 및 그 방법
CN111651041B (zh) * 2020-05-27 2024-03-12 上海龙旗科技股份有限公司 移动设备的抬起唤醒方法及系统
CN113823288A (zh) * 2020-06-16 2021-12-21 华为技术有限公司 一种语音唤醒的方法、电子设备、可穿戴设备和系统
CN113377206A (zh) * 2021-07-05 2021-09-10 安徽淘云科技股份有限公司 词典笔抬起唤醒方法、装置和设备
CN113689857B (zh) * 2021-08-20 2024-04-26 北京小米移动软件有限公司 语音协同唤醒方法、装置、电子设备及存储介质

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106358061A (zh) * 2016-11-11 2017-01-25 四川长虹电器股份有限公司 电视语音遥控系统及方法
US10515623B1 (en) * 2016-12-23 2019-12-24 Amazon Technologies, Inc. Non-speech input to speech processing system
US20180204569A1 (en) * 2017-01-17 2018-07-19 Ford Global Technologies, Llc Voice Assistant Tracking And Activation
CN109376669A (zh) * 2018-10-30 2019-02-22 南昌努比亚技术有限公司 智能助手的控制方法、移动终端及计算机可读存储介质
CN111048089A (zh) * 2019-12-26 2020-04-21 广东思派康电子科技有限公司 提高智能穿戴设备语音唤醒成功率的方法、电子设备、计算机可读存储介质
CN111933112A (zh) * 2020-09-21 2020-11-13 北京声智科技有限公司 唤醒语音确定方法、装置、设备及介质
CN112863508A (zh) * 2020-12-31 2021-05-28 思必驰科技股份有限公司 免唤醒交互方法和装置
CN115223561A (zh) * 2022-07-28 2022-10-21 创维集团智能科技有限公司 手持设备的语音唤醒控制方法及相关设备
CN115881118A (zh) * 2022-11-04 2023-03-31 荣耀终端有限公司 一种语音交互方法及相关电子设备

Also Published As

Publication number Publication date
CN115881118B (zh) 2023-12-22
CN115881118A (zh) 2023-03-31

Similar Documents

Publication Publication Date Title
RU2766255C1 (ru) Способ голосового управления и электронное устройство
US20220223150A1 (en) Voice wakeup method and device
WO2024093515A1 (zh) 一种语音交互方法及相关电子设备
EP4064284A1 (en) Voice detection method, prediction model training method, apparatus, device, and medium
CN110503959B (zh) 语音识别数据分发方法、装置、计算机设备及存储介质
CN111819533B (zh) 一种触发电子设备执行功能的方法及电子设备
US11537360B2 (en) System for processing user utterance and control method of same
WO2021052139A1 (zh) 手势输入方法及电子设备
CN114173204A (zh) 一种提示消息的方法、电子设备和系统
CN114173000B (zh) 一种回复消息的方法、电子设备和系统、存储介质
WO2021169370A1 (zh) 服务元素的跨设备分配方法、终端设备及存储介质
CN111681655A (zh) 语音控制方法、装置、电子设备及存储介质
WO2023273321A1 (zh) 一种语音控制方法及电子设备
WO2022161077A1 (zh) 语音控制方法和电子设备
CN115333941A (zh) 获取应用运行情况的方法及相关设备
CN113742460A (zh) 生成虚拟角色的方法及装置
WO2023130931A1 (zh) 服务异常提醒方法、电子设备及存储介质
WO2023071940A1 (zh) 跨设备的导航任务的同步方法、装置、设备及存储介质
CN113380240B (zh) 语音交互方法和电子设备
WO2022007757A1 (zh) 跨设备声纹注册方法、电子设备及存储介质
CN111028846B (zh) 免唤醒词注册的方法和装置
CN115083401A (zh) 语音控制方法及装置
CN116524919A (zh) 设备唤醒方法、相关装置及通信系统
CN111681654A (zh) 语音控制方法、装置、电子设备及存储介质
CN113867851A (zh) 电子设备操作引导信息录制方法、获取方法和终端设备