WO2020244402A1 - 基于麦克风信号的语音交互唤醒电子设备、方法和介质 - Google Patents

基于麦克风信号的语音交互唤醒电子设备、方法和介质 Download PDF

Info

Publication number
WO2020244402A1
WO2020244402A1 PCT/CN2020/092067 CN2020092067W WO2020244402A1 WO 2020244402 A1 WO2020244402 A1 WO 2020244402A1 CN 2020092067 W CN2020092067 W CN 2020092067W WO 2020244402 A1 WO2020244402 A1 WO 2020244402A1
Authority
WO
WIPO (PCT)
Prior art keywords
user
electronic device
voice
microphone
speaking
Prior art date
Application number
PCT/CN2020/092067
Other languages
English (en)
French (fr)
Inventor
史元春
喻纯
Original Assignee
清华大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 清华大学 filed Critical 清华大学
Publication of WO2020244402A1 publication Critical patent/WO2020244402A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech

Definitions

  • This application generally relates to the field of voice input, and more specifically, to smart electronic devices and voice input triggering methods.
  • voice input After pressing (or holding down) a certain (or certain) physical buttons of the mobile device, voice input is activated.
  • the device needs to have a screen; the trigger element occupies the screen content; the limitation of the software UI may lead to a cumbersome trigger method; it is easy to trigger by mistake.
  • the device activates voice input after detecting the corresponding wake-up word.
  • a specific word such as product nickname
  • an electronic device equipped with multiple microphones.
  • the electronic device has a memory and a central processing unit.
  • the memory stores computer-executable instructions that can be executed when executed by the central processing unit.
  • the operations are as follows: analyze the sound signals collected by multiple microphones; determine whether the user is talking to the electronic device at close range; in response to determining that the user is talking to the electronic device at close range, the sound signals collected by the microphones are processed as the user's voice input.
  • a plurality of microphones constitute a microphone array system.
  • the judging whether the user is speaking into the electronic device at close range includes: calculating the position of the user's mouth relative to the microphone array by using the time difference between the sound signals arriving at each microphone on the array; When the distance is less than a certain threshold, it is determined that the user is talking to the electronic device at a close distance.
  • the distance threshold is 10 cm.
  • the processing the voice signal as the user's voice input includes: performing different processing on the user's voice input according to the distance between the speaker's mouth and the electronic device.
  • the judging whether the user is speaking into the electronic device at close range includes: judging whether the sound signal collected by at least one microphone contains the voice signal of the user speaking; in response to determining that the sound signal collected by the at least one microphone contains the user Speaking voice signal, extract the voice signal from the voice signal collected by the microphone; determine whether the amplitude difference of the voice signal extracted from the voice signal collected by different microphones exceeds a predetermined threshold; in response to determining that the amplitude difference exceeds the predetermined threshold, confirm the user Talking to the electronic device at close range.
  • the electronic device can also be operated to: define the microphone with the largest voice signal amplitude among the multiple microphones as the response microphone; and perform different processing on the user's voice input according to the different response microphones.
  • the judging whether the user is speaking into the electronic device at close range includes: using a machine learning model trained in advance to process the sound signals of multiple microphones to judge whether the user is speaking into the electronic device at close range.
  • the voice spoken by the user includes: the voice that the user speaks at a normal volume; the voice that the user speaks at a low volume; the voice that the user speaks in a non-voicing manner.
  • the voice spoken by the user includes: the voice of the user speaking at a normal volume, the voice of the user speaking at a low volume, and the voice of the user speaking in a non-voicing manner.
  • the electronic device is further operable: in response to determining that the user is speaking to the electronic device at close range; determining that the user is speaking in one of the following ways, including: the user's voice at a normal volume; the user with a low volume The voice of speaking; the voice that the user speaks in a way that the vocal cords are silent; and according to the different results of the judgment, the voice signal is processed differently.
  • the different processing is activating different applications to process voice input.
  • the judged characteristics include volume, frequency spectrum characteristics, energy distribution, and the like.
  • the processing of using the sound signal as the user's voice input includes one or more of the following: storing the sound signal on a storage medium on the electronic device; sending the sound signal through the Internet; sending the voice in the sound signal
  • the signal is recognized as text and stored on the storage medium of the electronic device; the voice signal in the sound signal is recognized as text and sent through the Internet; the voice signal in the sound signal is recognized as text, the user’s voice command is understood, and the corresponding operating.
  • the electronic device is also operable to identify a specific user through voiceprint analysis, and only process the voice signal containing the voice of the specific user.
  • the electronic device is one of a smart phone, a smart watch, a smart ring, and a tablet computer.
  • a voice input triggering method executed by an electronic device equipped with multiple microphones.
  • the electronic device has a memory and a central processing unit.
  • the memory stores computer executable instructions, and the computer executable instructions When executed by the central processing unit, the voice input trigger method can be executed.
  • the voice input trigger method includes: analyzing the sound signals collected by multiple microphones; judging whether the user is talking to the electronic device at close range; in response to determining that the user is approaching Talk to the electronic device at a distance, and process the sound signal collected by the microphone as the user's voice input.
  • a computer-readable medium having computer-executable instructions stored thereon.
  • the computer-executable instructions can execute a voice interactive wake-up method when executed by a computer.
  • the voice interactive wake-up method includes: analysis Sound signals collected by multiple microphones; determine whether the user is speaking into the electronic device at close range; in response to determining that the user is speaking into the electronic device at close range, the sound signal collected by the microphone is processed as the user's voice input
  • an electronic device equipped with a microphone.
  • the electronic device has a memory and a central processing unit.
  • the memory stores computer-executable instructions.
  • the computer-executable instructions can be executed by the central processing unit as follows: Operation: Analyze the sound signal collected by the microphone, identify whether the sound signal contains human speech and whether it contains wind noise generated by the airflow generated by human speech hitting the microphone, and respond to determining that the sound signal contains human speech and user speech The generated air flow hits the wind noise generated by the microphone, and the sound signal is processed as the user's voice input.
  • the voice spoken by the user includes: the voice of the user speaking at a normal volume, the voice of the user speaking at a low volume, and the voice of the user speaking in a non-voicing manner.
  • the electronic device is further operable: in response to determining that the user is speaking to the electronic device at a close distance, it is determined that the user is speaking in one of the following ways, including: the user speaking at a normal volume, and the user speaking at a low volume
  • Speaking sound is the sound that the user speaks in a way that the vocal cords are silent; according to the different judgment results, the sound signal is processed differently.
  • the different processing is activating different applications to process voice input.
  • the characteristics used in the judgment include volume, frequency spectrum characteristics, energy distribution, and the like.
  • the processing of using the sound signal as the user's voice input includes one or more of the following: storing the sound signal on a storage medium on the electronic device; sending the sound signal through the Internet; sending the voice in the sound signal
  • the signal is recognized as text and stored on the storage medium of the electronic device; the voice signal in the sound signal is recognized as text and sent through the Internet; the voice signal in the sound signal is recognized as text, the user’s voice command is understood, and the corresponding operating.
  • the electronic device is also operable to identify a specific user through voiceprint analysis, and only process the voice signal containing the voice of the specific user.
  • the electronic device is one of a smart phone, a smart watch, and a smart ring.
  • the electronic device is further operable to use a neural network model to determine whether the voice signal of the user is included in the voice signal and the wind noise generated by the airflow generated by the speech hitting the microphone.
  • the electronic device is further operable to recognize whether the sound signal contains the voice spoken by a human and whether it contains the wind noise generated by the airflow generated by the human speech hitting the microphone includes recognizing whether the voice signal contains the voice spoken by the user; in response to determining the sound The signal contains the user's spoken voice, recognizes the phonemes in the voice, and expresses the voice signal as a phoneme sequence; for each phoneme in the phoneme sequence, determine whether the phoneme is an exhaled phoneme, that is: there is airflow from the mouth when the user utters the phoneme Out; cut the sound signal into a sequence of sound segments according to a fixed window length; use frequency characteristics to identify whether each sound segment contains wind noise; identify the exhaled phoneme in the speech phoneme sequence and the segment in the sound segment sequence as wind noise For comparison, the non-exhaled phoneme and wind noise segment in the phoneme sequence are compared at the same time.
  • the sound signal contains wind noise generated by the airflow generated by the user's speech hitting the microphone.
  • recognizing whether the sound signal contains human speech and whether it contains wind noise generated by the airflow generated by human speech hitting the microphone includes: recognizing the sound characteristics of wind noise contained in the sound signal; and in response to determining that the sound signal contains wind noise , Recognize that the sound signal contains a voice signal; In response to determining that the sound signal contains a voice signal, identify the phoneme sequence corresponding to the voice signal; For the wind noise feature in the sound signal, calculate the wind noise feature intensity at each moment; For the phoneme sequence For each phoneme, the exhalation intensity of the phoneme is obtained according to the pre-defined data model; the consistency of the wind noise feature and the phoneme sequence is analyzed based on the Gaussian Mixture Bayesian model. When the coincidence degree is higher than a certain threshold, it is determined that the sound signal contains The wind noise generated by the airflow generated by the user's speech hitting the microphone.
  • an electronic device equipped with a microphone has a memory and a central processing unit.
  • the memory stores computer executable instructions.
  • the computer executable instructions When executed by the central processing unit, the following operations can be performed: Determine whether the voice signal collected by the microphone contains a voice signal; in response to confirming that the voice signal collected by the microphone contains a voice signal, determine whether the user is speaking in a low voice, that is, speaking at a lower than normal volume; in response to determining that the user is doing Speak in a low voice, and process the sound signal as a voice input without any wake-up operation.
  • the whispered speech includes two ways of whispering without vocal cords and whispering with vocal cords.
  • the electronic device is also operated: in response to determining that the user is speaking in a low voice; judging whether the user is speaking in a low voice without vocal cords or speaking in a low voice with vocal cords; Different treatment.
  • the different processing is to activate different applications to respond to voice input.
  • the signal characteristics used to determine whether the user is speaking in a low voice include volume, frequency spectrum characteristics, and energy distribution.
  • the signal characteristics used when judging that the user is making a low-voiced speech without vocal cords or when making a low-voiced speech with vocal cords include volume, frequency spectrum characteristics, and energy distribution.
  • the judging whether the user is speaking in a low voice includes: using a machine learning model to process a sound signal collected by a microphone to determine whether the user is speaking in a low voice.
  • the machine learning model is a convolutional neural network model or a recurrent neural network model.
  • the judging whether the user is doing a low-voiced speech without vocal cords or a low-voiced speech with vocal cords includes: using a machine learning model to process the sound signal collected by a microphone to determine that the user is doing low-voiced speech without vocal cords Or speaking in a low voice with vocal cords.
  • the machine learning model is a convolutional neural network model or a recurrent neural network model.
  • the processing of using the sound signal as the user's voice input includes one or more of the following: storing the sound signal on a storage medium on the electronic device; sending the sound signal through the Internet; sending the voice in the sound signal
  • the signal is recognized as text and stored on the storage medium of the electronic device; the voice signal in the sound signal is recognized as text and sent through the Internet; the voice signal in the sound signal is recognized as text, the user’s voice command is understood, and the corresponding operating.
  • the specific user is identified through voiceprint analysis, and only the voice signal containing the voice of the specific user is processed.
  • the electronic device is a smart phone, a smart watch, a smart ring, etc.
  • an intelligent electronic device equipped with a microphone interacts with a user based on voice input as follows: process the sound signal captured by the microphone to determine whether there is a voice signal in the sound signal; In response to confirming that there is a voice signal in the sound signal, it is further determined based on the sound signal collected by the microphone whether the distance between the smart electronic device and the user’s mouth is less than a predetermined threshold; in response to determining that the distance between the electronic device and the user’s mouth is less than the predetermined threshold, the microphone is collected The sound signal is processed as a voice input.
  • the predetermined threshold is 3 cm.
  • the predetermined threshold is 1 cm.
  • the proximity light sensor is used to determine whether an object is approaching the electronic device.
  • the distance sensor at the microphone of the electronic device, and the distance between the electronic device and the user's mouth is directly measured by the distance sensor.
  • the distance between the smart electronic device and the user's mouth is less than a predetermined threshold value through the characteristics of the sound signal collected by the microphone.
  • the voice signal includes one or a combination of the following items: the sound produced by the user speaking at a normal volume; the sound produced by the user speaking in a low voice; the sound produced by the user's vocal cord without speaking.
  • the electronic device is further operable: in response to determining that the user is speaking to the electronic device at close range; determining that the user is speaking in one of the following ways, including: the user's voice at a normal volume; the user with a low volume The voice of speaking; the voice that the user speaks in a way that the vocal cords are silent; and according to the different results of the judgment, the voice signal is processed differently.
  • the different processing is activating different applications to process voice input.
  • the characteristics used in the judgment include volume, frequency spectrum characteristics, energy distribution, and the like.
  • the features used when judging whether the distance between the smart electronic device and the user's mouth is less than a predetermined threshold include time-domain features and frequency-domain features in the sound signal, including volume and spectral energy.
  • the judging whether the distance between the smart electronic device and the user's mouth is less than a predetermined threshold includes: extracting a voice signal from a sound signal collected by a microphone through a filter; judging whether the energy of the voice signal exceeds a certain threshold; responding to If the strength of the voice signal exceeds a certain threshold, it is determined that the distance between the electronic device and the user's mouth is less than the predetermined threshold.
  • the judging whether the distance between the smart electronic device and the user's mouth is less than a predetermined threshold includes: using a deep neural network model to process data collected by a microphone to determine whether the distance between the smart electronic device and the user's mouth is less than a predetermined threshold.
  • the judging whether the distance between the smart electronic device and the user's mouth is less than a predetermined threshold includes: recording the user's voice signal when the user is not making a voice input; and comparing the voice signal currently collected by the microphone with the voice signal when the user is not making a voice input. Comparison; if it is determined that the volume of the voice signal currently collected by the microphone exceeds a certain threshold of the volume of the voice signal when no voice input is made, it is determined that the distance between the smart electronic device and the user's mouth is less than the predetermined threshold.
  • the processing of using the sound signal as the user's voice input includes one or more of the following: storing the sound signal on a storage medium on the electronic device; sending the sound signal through the Internet; sending the voice in the sound signal
  • the signal is recognized as text and stored on the storage medium of the electronic device; the voice signal in the sound signal is recognized as text and sent through the Internet; the voice signal in the sound signal is recognized as text, the user’s voice command is understood, and the corresponding operating.
  • the electronic device also recognizes a specific user through voiceprint analysis, and only processes the voice signal containing the voice of the specific user.
  • the electronic device is a smart phone, a smart watch, a smart ring, etc.
  • the mobile devices here include, but are not limited to, mobile phones, head-mounted displays, watches, and smaller smart wearable devices such as smart rings and watches.
  • the use efficiency is higher. It can be used with one hand. No need to switch between different user interfaces/applications, no need to hold down a button, just lift your hand to your mouth to use it.
  • the radio quality is high.
  • the device's voice recorder is at the user's mouth, and the voice input signal received is clear, and it is less affected by environmental sounds.
  • Fig. 1 is a schematic flowchart of a voice input interaction method according to an embodiment of the present application.
  • Fig. 2 shows an overall flow chart of a method for triggering a voice input using a difference in sound signals received by multiple microphones for an electronic device equipped with multiple microphones according to another embodiment of the present application.
  • Fig. 3 shows an overall flowchart of a voice input trigger method based on whispered speech recognition for an electronic device with a built-in microphone according to an embodiment of the present application.
  • Figure 4 describes the overall flow chart of the voice input trigger method based on the distance judgment of the sound signal of the microphone
  • Fig. 5 is a schematic front view of the upper microphone of the mobile phone close to the mouth in the triggering posture according to an embodiment of the present application.
  • Fig. 6 is a schematic side view of the upper microphone of the mobile phone close to the mouth in the triggering posture according to the embodiment of the present application.
  • Fig. 7 is a schematic diagram of placing the lower microphone of the mobile phone close to the mouth in the triggering posture according to an embodiment of the present application.
  • Fig. 8 is a schematic diagram of placing the smart watch microphone close to the mouth in the triggering posture according to an embodiment of the present application.
  • the present disclosure is aimed at the voice input trigger of the smart electronic device, and determines whether to trigger the voice input application based on the internal characteristics of the sound captured by the configured microphone. There is no need for traditional physical button triggering, interface element triggering, and wake word detection, and the interaction is more natural. Putting the device in front of the mouth triggers voice input, which conforms to user habits and cognition.
  • Voice input trigger based on the characteristics of wind noise when a human speaks, specifically, by recognizing the voice and wind noise sound when a human speaks to directly initiate the voice input and receive the sound signal As voice input processing; 2. Voice input trigger based on the difference of sound signals received by multiple microphones; 3. Voice input trigger based on whispered speech recognition; 4. Voice input trigger based on distance judgment of microphone's voice signal.
  • the sound signal collected by the microphone contains two sound components, one is the sound produced by the human vocal cord and oral vibration, and the other is produced by human speech The sound of wind noise generated by air hitting the microphone.
  • the voice input application of the electronic device can be triggered based on this characteristic.
  • Fig. 1 shows a schematic flowchart of a voice input interaction method 100 according to an embodiment of the present application.
  • step S101 the sound signal collected by the microphone is analyzed to identify whether the sound signal contains human speech and whether it contains wind noise generated by the airflow generated by the human speech hitting the microphone,
  • step S102 in response to determining that the sound signal contains the voice of a person's speech and the wind noise generated by the airflow generated by the user's speech hitting the microphone, the sound signal is processed as the user's voice input.
  • the voice input interaction method of the embodiment of the present application is particularly suitable for voice input without vocal cords when the privacy requirements are relatively high.
  • the voice spoken by the user may include: the voice of the user speaking at a normal volume, the voice of the user speaking at a low volume, and the voice of the user speaking in a non-voicing manner.
  • the above-mentioned different ways of speaking can be recognized, and different feedbacks can be generated according to the recognition results.
  • normal speaking is to control the voice assistant of the mobile phone
  • whispered speaking is to control WeChat
  • voiceless speaking is to make voice transcription notes.
  • the processing of using a sound signal as the user's voice input includes one or more of the following:
  • it also includes identifying a specific user through voiceprint analysis, and processing only the voice signal containing the voice of the specific user.
  • the electronic device is one of a smart phone, a smart watch, and a smart ring.
  • a neural network model is used to determine whether the voice signal of the user is included in the voice signal and the wind noise generated by the airflow hitting the microphone. This is just an example, other machine learning algorithms can be used.
  • recognizing whether the voice signal contains human speech and whether it contains wind noise generated by the airflow generated by human speech and hitting the microphone includes:
  • the voice signal contains the voice spoken by the user, recognize the phonemes in the voice, and represent the voice signal as a phoneme sequence;
  • recognizing whether the voice signal contains human speech and whether it contains wind noise generated by the airflow generated by human speech and hitting the microphone includes:
  • Fig. 2 shows an overall flow chart of a method for triggering a voice input using a difference in sound signals received by multiple microphones for an electronic device equipped with multiple microphones according to another embodiment of the present application.
  • An electronic device such as an electronic device with multiple microphones built in a mobile phone.
  • the electronic device has a memory and a central processing unit.
  • the memory stores computer executable instructions.
  • the voice input trigger of this embodiment can be executed. method.
  • step S201 the sound signals collected by multiple microphones are analyzed.
  • the multiple microphones include at least 3 microphones to form a microphone array system, and the spatial position of the sound source relative to the smart device can be estimated by the time difference between the sound signal reaching each microphone.
  • the sound signal here includes, for example, the amplitude and frequency of the sound signal.
  • step S202 based on the sound signals collected by multiple microphones, it is determined whether the user is speaking to the electronic device at a close range.
  • determining whether the user is speaking into the electronic device at close range includes:
  • the distance between the user's mouth and the electronic device is less than a certain threshold, it is determined that the user is speaking to the electronic device at close range.
  • the distance threshold is 10 cm.
  • step S203 in response to determining that the user is speaking to the electronic device at a close distance, the sound signal collected by the microphone is processed as the user's voice input.
  • processing the sound signal as the user's voice input includes:
  • the user's voice input is processed differently. For example, when the distance is 0-3cm, the voice assistant is activated to respond to the user's voice input; when the distance is 3-10cm, the WeChat application is activated to respond to the user's voice input and send voice messages to friends;
  • determining whether the user is speaking into the electronic device at close range includes:
  • the user's voice input is processed differently. For example, when the response microphone is the microphone at the bottom of the smartphone, activate the voice assistant on the smartphone; when the response microphone is the microphone at the top of the smartphone, activate the voice recorder function to record the user's voice to the storage device;
  • determining whether the user is speaking into the electronic device at close range includes: using a machine learning model trained in advance to process sound signals from multiple microphones to determine whether the user is speaking into the electronic device at close range. Generally, prepare training sample data, and then use the training sample data to train the selected machine learning model. In actual application (sometimes called testing), the sound signals captured by multiple microphones (as test samples) are input into the machine learning model , The output obtained indicates whether the user is talking to the electronic device at close range.
  • the machine learning model is, for example, deep learning neural network, support vector machine, decision tree, etc.
  • the voice spoken by the user includes: the user speaks at a normal volume, the user speaks at a low volume, and the user speaks in a silent manner.
  • the processing of using the sound signal as the user's voice input includes one or more of the following: storing the sound signal in a storage medium on the electronic device; sending the sound signal through the Internet; The voice signal is recognized as text and stored on the storage medium of the electronic device; the voice signal in the voice signal is recognized as text and sent out through the Internet; the voice signal in the voice signal is recognized as text, and the user’s voice command is understood, Take the appropriate action.
  • it also includes identifying a specific user through voiceprint analysis, and processing only the voice signal containing the voice of the specific user.
  • the electronic device is a smart phone, smart watch, smart ring, tablet computer, etc.
  • This embodiment uses the difference in sound signals between different built-in microphones to identify whether the user is speaking to the electronic device at a close distance, and then decides whether to start the voice input, which has the advantages of reliable recognition and simple calculation method.
  • Whispering refers to the way in which the speaking volume is less than the volume of normal speaking (such as a normal conversation with others).
  • the sound produced mainly includes the sound of air passing through the throat, the sound of the mouth, and the sound of the tongue and teeth in the mouth.
  • the sound produced includes not only the sound produced in the whispering mode of vocal cord vibration without vibration, but also the sound produced by vocal cord vibration.
  • the vibration of the vocal cords is less in the process of low-voice speech with vocal cord vibration, and the vocal cord vibration sound produced is smaller.
  • the vocal cords do not vibrate
  • the sound produced by whispering and the sound produced by vocal cord vibration have different frequency ranges and can be distinguished.
  • Vocal cord vibration low voice speech and vocal cord vibration normal volume speech can be distinguished by the volume threshold.
  • the specific threshold can be set in advance or set by the user.
  • Example method filter the voice signal collected by the microphone, and extract two parts of the signal, which are the sound component V1 generated by the vibration of the vocal cords and the sound V2 generated by the air through the throat, the mouth, and the tongue and teeth in the mouth.
  • V1 and V2 When the energy ratio of V1 and V2 is less than a certain threshold, it is determined that the user is speaking in a low voice.
  • whispering can only be detected when the user is close to the microphone, such as when the distance is less than 30 cm.
  • the definition of whispering in close-range situations as voice input is an interactive method that is easy to learn and understand and convenient for users to operate. It can avoid explicit wake-up operations, such as pressing a specific wake-up button or through voice Wake word. And this method will not be triggered by mistake in most practical use cases.
  • Fig. 3 shows an overall flow chart of a voice input trigger method based on whispered speech recognition for an electronic device equipped with a microphone according to an embodiment of the present application.
  • An electronic device equipped with a microphone has a memory and a central processing unit, and computer executable instructions are stored on the memory. When the computer executable instructions are executed by the central processing unit, the voice input trigger method according to the embodiments of the present application can be executed.
  • step S301 it is determined whether the sound signal collected by the microphone contains a voice signal.
  • step S302 in response to confirming that the voice signal collected by the microphone contains a voice signal, it is determined whether the user is speaking in a low voice, that is, speaking at a lower than normal volume.
  • step S303 in response to determining that the user is speaking in a low voice, the sound signal is processed as a voice input without any wake-up operation.
  • Whispering can include two ways of whispering without vocal cords and whispering with vocal cords.
  • the voice input triggering method may further include: in response to determining that the user is speaking in a low voice, judging whether the user is making a low voice without vocal cords or a low voice speaking with vocal cords, according to different judgment results, The sound signal is processed differently.
  • different processing is to hand over voice input to different applications for processing.
  • normal speaking is to control the voice assistant of the mobile phone
  • speaking in a low voice is to control WeChat
  • speaking without vocal cords is to make voice transcription notes.
  • the signal characteristics used to determine whether the user is speaking in a low voice may include volume, frequency spectrum characteristics, energy distribution, and so on.
  • the signal characteristics used by the user when making a low-voiced speech without vocal cords or when making a low-voiced speech with vocal cords include volume, spectral characteristics, energy distribution, and the like.
  • determining whether the user is speaking in a low voice may include: using a machine learning model to process the sound signal collected by the microphone to determine whether the user is speaking in a low voice.
  • the machine learning model may be a convolutional neural network model or a recurrent neural network model.
  • judging whether the user is doing low-voicing without vocal cords or low-voicing with vocal cords includes: using a machine learning model to process the sound signals collected by the microphone to determine whether the user is doing low-voicing without vocal cords or Talk in a low voice with vocal cords.
  • the processing of using a sound signal as the user's voice input includes one or more of the following:
  • the voice input triggering method may also include: identifying a specific user through voiceprint analysis, and processing only the voice signal containing the voice of the specific user.
  • the electronic device may be a smart phone, smart watch, smart ring, etc.
  • the voice input trigger based on the distance judgment of the sound signal of the microphone
  • the following describes the overall flowchart of the voice input trigger method based on the distance judgment of the sound signal of the microphone with reference to FIG. 4.
  • step 401 the sound signal captured by the microphone is processed to determine whether there is a voice signal in the sound signal.
  • step 402 in response to confirming that the voice signal exists in the voice signal, it is further determined whether the distance between the smart electronic device and the user's mouth is less than a predetermined threshold based on the voice signal collected by the microphone.
  • step 403 in response to determining that the distance between the electronic device and the user's mouth is less than a predetermined threshold, the sound signal collected by the microphone is processed as a voice input.
  • the predetermined threshold is 10 cm.
  • the voice signal may include one or a combination of the following items: the sound produced by the user speaking at a normal volume; the sound produced by the user speaking in a low voice; the sound produced by the user's vocal cord without speaking.
  • the features used when judging whether the distance between the smart electronic device and the user's mouth is less than a predetermined threshold includes time domain features and frequency domain features in the sound signal, including volume and spectral energy.
  • the judging whether the distance between the smart electronic device and the user's mouth is less than a predetermined threshold includes: using a deep neural network model to process data collected by a microphone to determine whether the distance between the smart electronic device and the user's mouth is less than a predetermined threshold.
  • the judging whether the distance between the smart electronic device and the user's mouth is less than a predetermined threshold includes: recording the user's voice signal when the user is not making a voice input, and comparing the voice signal currently collected by the microphone with the voice when the user is not making a voice input. Signals are compared, and if it is determined that the volume of the voice signal currently collected by the microphone exceeds a certain threshold of the volume of the voice signal when no voice input is made, it is determined that the distance between the smart electronic device and the user's mouth is less than the predetermined threshold.
  • the processing of using the sound signal as the user's voice input includes one or more of the following: storing the sound signal in a storage medium on the electronic device; sending the sound signal through the Internet; The voice signal is recognized as text and stored on the storage medium of the electronic device; the voice signal in the voice signal is recognized as text and sent out through the Internet; the voice signal in the voice signal is recognized as text, and the user’s voice command is understood, Take the appropriate action.
  • the voice input triggering further includes identifying a specific user through voiceprint analysis, and processing only the voice signal containing the voice of the specific user.
  • the electronic device is a smart phone, smart watch, smart ring, etc.
  • Figures 5 to 8 show a few cases where the user places the microphone of the smart electronic portable device close to the mouth, and the voice of the user will be used as the voice input.
  • Figure 5 and Figure 6 are respectively the case where there is a microphone at the top of the mobile phone.
  • Fig. 7 is the case where there is a microphone at the lower end of the mobile phone. Similar to the above-mentioned upper end with a microphone, the two postures are not mutually exclusive. If there are microphones at the upper and lower ends of the mobile phone, either posture can implement an interactive scheme.
  • Fig. 8 shows the situation when the corresponding device is a smart watch, which is similar to the situation when the above-mentioned device is a mobile phone.
  • the above description of the trigger gesture is exemplary, not exhaustive, and is not limited to the various devices and microphones disclosed.
  • the short-distance unique characteristics of the voice such as microphone popping sound , Near-field wind noise, blowing sound, energy, spectrum characteristics, time domain characteristics, etc.
  • a multi-microphone array to receive voice input and trigger voice input, by comparing and analyzing the difference of the voice input signal received by different microphones, by separating the near-field voice signal from the environment, identifying and detecting whether the voice signal is Including voice, using multi-microphone array sound source localization technology to determine whether the distance between the position of the user’s mouth of the voice signal and the device is less than a predetermined threshold, and using voiceprint recognition to determine whether the voice input source belongs to a serviceable user, combining the above points To determine whether to use the signal as a voice input.
  • the smart electronic portable device analyzes the voice signal and detects that the pronunciation position is near itself, that is, the mobile device is located closer to the user's mouth.
  • the smart electronic portable device uses the sound signal as a voice input, and according to the task and The context is different, combined with natural language processing technology to understand the user's voice input and complete the corresponding task.
  • the microphone is not limited to the foregoing examples, but may include one or a combination of the following items: a built-in single microphone in the device; a built-in dual microphone in the device; a built-in multi-microphone array in the device; an external wireless microphone; and an external wired microphone.
  • the smart electronic portable device can be a mobile phone equipped with a binaural Bluetooth headset, a wired headset with a microphone, or other microphone sensors.
  • the smart electronic portable device can be a watch, a smart wearable device among smart rings and watches.
  • the smart electronic portable device is a head-mounted smart display device equipped with a microphone or a multi-microphone group.
  • a feedback output may be made, and the feedback output includes one of vibration, voice, and image, or a combination thereof.
  • the use efficiency is higher. It can be used with one hand. No need to switch between different user interfaces/applications, no need to hold down a button, just lift your hand to your mouth to use it.
  • the radio quality is high.
  • the device's voice recorder is at the user's mouth, and the voice input signal received is clear, and it is less affected by environmental sounds.

Abstract

一种基于麦克风信号的语音交互唤醒电子设备、方法和介质。电子设备具有存储器和中央处理器,存储器上存储有计算机可执行指令,计算机可执行指令被中央处理器执行时能够执行如下操作:分析多个麦克风采集的声音信号(S201);判断用户是否正在近距离对着电子设备说话(S202);响应于确定用户正在近距离对着电子设备说话,将麦克风采集的声音信号作为用户的语音输入处理(S203)。交互方法适用于用户在携带智能电子设备时进行语音输入,操作自然且简单,简化了语音输入的步骤,降低交互负担和难度,使得交互更加自然。

Description

基于麦克风信号的语音交互唤醒电子设备、方法和介质
本申请要求于2019年06月03日提交至中国专利局、申请号为201910475972.9、发明名称为“基于麦克风信号的语音交互唤醒电子设备、方法和介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请总的来说涉及语音输入领域,且更为具体地,涉及智能电子设备、语音输入触发方法。
背景技术
随着计算机技术的发展,语音识别算法日益成熟,语音输入因其在交互方式上的高自然性与有效性而正变得越来越重要。用户可以通过语音与移动设备(手机、手表等)进行交互,完成指令输入、信息查询、语音聊天等多种任务。
而在何时触发语音输入这一点上,现有的解决方案都有一些缺陷:
1.物理按键触发
按下(或按住)移动设备的某个(或某些)物理按键后,激活语音输入。
该方案的缺点是:需要物理按键;容易误触发;需要用户按键。
2.界面元素触发
点击(或按住)移动设备的屏幕上的界面元素(如图标),激活语音输入。
该方案的缺点是:需要设备具备屏幕;触发元素占用屏幕内容;受限于软件UI限制,可能导致触发方式繁琐;容易误触发。
3.唤醒词(语音)检测
以某个特定词语(如产品昵称)为唤醒词,设备检测到对应的唤醒词后激活语音输入。
该方案的缺点是:隐私性和社会性较差;交互效率较低。
发明内容
鉴于上述情况,提出了本申请:
根据本申请的一个方面,提供了一种配置有多个麦克风的电子设备,电子设备具有存储器和中央处理器,存储器上存储有计算机可执行指令,计算机可执行指令被中央处理器执行时能够执行如下操作:分析多个麦克风采集的声音信号;判断用户是否正在近距离对着电子设备说话;响应于确定用户正在近距离对着电子设备说话,将麦克风采集的声音信号作为用户的语音输入处理。
优选的,多个麦克风构成麦克风阵列系统。
优选的,所述判断用户是否正在近距离对着电子设备说话包括:利用到达阵列上各传声器的声音信号之间的时间差计算用户嘴部相对于麦克风阵列的位置;当用户嘴部距离电子设备的距离小于一定阈值时,确定用户正在近距离对着电子设备说话。
优选的,所述距离阈值为10厘米。
优选的,所述将该声音信号作为用户的语音输入做处理包括:根据说话人嘴部和电子设备之间距离的不同,对用户的语音输入做不同处理。
优选的,所述判断用户是否正在近距离对着电子设备说话包括:判断是否至少有一个麦克风采集的声音信号中包含用户说话的语音信号;响应于确定至少有一个麦克风采集的声音信号中包含用户说话的语音信号,从麦克风采集的声音信号中提取语音信号;判断从不同麦克风采集的声音信号中提取的语音信号的幅度差异是否超过预定阈值时;响应于确定幅度差值超过预定阈值,确认用户正在近距离对着电子设备说话。
优选的,电子设备还可操作来:定义多个麦克风中,语音信号幅度最大的麦克风为响应麦克风;根据响应麦克风的不同,对用户的语音输入做不同的处理。
优选的,所述判断用户是否正在近距离对着电子设备说话包括:利用提前训练的机器学习模型,处理多个麦克风的声音信号,判断用户是否正在近距离对着电子设备说话。
优选的,用户说话的语音包括:用户以正常音量说话的声音;用户以小音量说话的声音;用户以声带不发声方式说话发出的声音。
优选的,用户说话的语音包括:用户以正常音量说话的声音,用户以小音量说话的声音,用户以声带不发声方式说话发出的声音。
优选的,电子设备还可操作来:响应于确定用户正在近距离对着电子设备说话;判断用户在以如下方式中的一种在发声,包括:用户以正常音量说话的声音;用户以小音量说话的声音;用户以声带不发声方式说话发出的声音;以及根据判断的结果不同,对声音信号做不同的处理。
优选的,所述不同的处理为激活不同的应用程序处理语音输入。
优选的,判断的特征包括音量、频谱特征,能量分布等。
优选的,将声音信号作为用户的语音输入所做的处理包括以下一种或多种:将声音信号存储到电子设备上的可存储介质;将声音信号通过互联网发送出去;将声音信号中的语音信号识别为文字,存储到电子设备上的可存储介质;将声音信号中的语音信号识别为文字,通过互联网发送出去;将声音信号中的语音信号识别为文字,理解用户的语音指令,执行相应操作。
优选的,电子设备还可操作来通过声纹分析识别特定用户,只对包含特定用户语音的声音信号做处理。
优选的,电子设备为智能手机、智能手表、智能戒指、平板电脑中的一种。
根据本申请的另一个方面,提供了一种由配置有多个麦克风的电子设备执行的语音输入触发方法,电子设备具有存储器和中央处理器,存储器上存储有计算机可执行指令,计算机可执行指令被中央处理器执行时能够执行所述语音输入触发方法,所述语音输入触发方法包括:分析多个麦克风采集的声音信号;判断用户是否正在近距离对着电子设备说话;响应于确定用户正在近距离对着电子设备说话,将麦克风采集的声音信号作为用户的语音输入处理。
根据本申请的另一个方面,提供了一种计算机可读介质,其上存储有计算机可执行指令,计算机可执行指令被计算机执行时能够执行语音交互唤醒方法,所述语音交互唤醒方法包括:分析多个麦克风采集的声音信号;判断用户是否正在近距离对着电子设备说话;响应于确定用户正在近距离对着电子设备说话,将麦克风采集的声音信号作为用户的语音输入处理
根据本申请的另一个方面,提供了一种配置有麦克风的电子设备,电子设备具有存储器和中央处理器,存储器上存储有计算机可执行指令,计算机可执行指令被中央处理器执行时能够执行如下操作:分析麦克风采集的声音信号,识别声音信号中是否包含人说话的语音以及是否包含人说话产生的气流撞击麦克风产生的风噪声音,响应于确定声音信号中包含人说话的声音以及包含用户说话产生的气流撞击麦克风产生的风噪声音,将该声音信号作为用户的语音输入做处理。
优选的,用户说话的语音包括:用户以正常音量说话的声音,用户以小音量说话的声音,用户以声带不发声方式说话发出的声音。
优选的,电子设备还可操作来:响应于确定用户正在近距离对着电子设备说话,判断用户在以如下方式中的一种在发声,包括:用户以正常音量说话的声音,用户以小音量说话的声音,用户以声带不发声方式说话发出的声音;根据判断的结果不同,对声音信号做不同的处理。
优选的,所述不同的处理为激活不同的应用程序处理语音输入。
优选的,判断使用的特征包括音量、频谱特征,能量分布等。
优选的,将声音信号作为用户的语音输入所做的处理包括以下一种或多种:将声音信号存储到电子设备上的可存储介质;将声音信号通过互联网发送出去;将声音信号中的语音信号识别为文字,存储到电子设备上的可存储介质;将声音信号中的语音信号识别为文字,通过互联网发送出去;将声音信号中的语音信号识别为文字,理解用户的语音指令,执行相应操作。
优选的,电子设备还可操作来通过声纹分析识别特定用户,只对包含特定用户语音的声音信号做处理。
优选的,电子设备为智能手机、智能手表、智能戒指中的一种。
优选的,电子设备还可操作来:使用神经网络模型判断声音信号中是否包含用户说话的语音以及说话产生的气流撞击麦克风产生的风噪声音。
优选的,电子设备还可操作来识别声音信号中是否包含人说话的语音以及是否包含人说话产生的气流撞击麦克风产生的风噪声音包括识别声音信号中是否包含用户说话的语音;响应于确定声音信号中包含用户说话的语音,识别语音中的音素,将语音信号表示为音素序列;针对音素序列中 的每个音素,确定该音素是否为吐气音素,即:用户发声该音素时有气流从嘴中出来;将声音信号按照固定窗口长度切分为声音片段序列;利用频率特征,识别每个声音片段是否包含风噪声;将语音音素序列中的吐气音素和声音片段序列中识别为风噪声的片段做比较,同时将音素序列中的非吐气音素和风噪声片段作比较,当吐气音素与风噪声片段重合度高于一定阈值,且非吐气音素与非风噪声片段重合度低于一定阈值时,判断该声音信号中包含用户说话产生的气流撞击麦克风产生的风噪声音。
优选的,识别声音信号中是否包含人说话的语音以及是否包含人说话产生的气流撞击麦克风产生的风噪声音包括:识别声音信号中包含风噪的声音特征;响应于确定声音信号中包含风噪声,识别声音信号包含语音信号;响应于确定声音信号中包含语音信号,识别语音信号对应的音素序列;针对声音信号中的风噪特征,计算每一时刻的风噪特征强度;针对音素序列中的每个音素,根据预先定义的数据模型获得该音素吐气的强度;通过基于高斯混合贝叶斯模型分析风噪特征与音素序列的一致性,重合度高于一定阈值时,判断该声音信号中包含用户说话产生的气流撞击麦克风产生的风噪声音。
根据本申请的另一个方面,一种配置有麦克风的电子设备,电子设备具有存储器和中央处理器,存储器上存储有计算机可执行指令,计算机可执行指令被中央处理器执行时能够执行如下操作:判断麦克风采集的声音信号中是否包含语音信号;响应于确认麦克风采集的声音信号中包含语音信号,判断用户是否在做低声说话,即以低于正常音量的方式说话;响应于确定用户正在做低声说话,无需任何唤醒操作地将声音信号作为语音输入处理。
优选的,所述低声说话包括声带不发声的低声说话和声带发声的低声说话两种方式。
优选的,电子设备还操作来:响应于确定用户在做低声说话;判断用户在做声带不发声的低声说话还是在做声带发声的低声说话;根据判断的结果不同,对声音信号做不同的处理。
优选的,不同的处理为激活不同的应用程序来响应语音输入。
优选的,判断用户是否在做低声说话时使用的信号特征包括音量、频 谱特征,能量分布。
优选的,判断用户在做声带不发声的低声说话或者在做声带发声的低声说话时使用的信号特征包括音量、频谱特征,能量分布。
优选的,所述判断用户是否在做低声说话包括:利用机器学习模型,处理麦克风采集的声音信号,判断用户是否在低声说话。
优选的,所述机器学习模型为卷积神经网络模型或者循环神经网络模型。
优选的,所述判断用户在做声带不发声的低声说话还是在做声带发声的低声说话包括:利用机器学习模型,处理麦克风采集的声音信号,判断用户在做声带不发声的低声说话或者在做声带发声的低声说话。
优选的,所述机器学习模型为卷积神经网络模型或者循环神经网络模型。
优选的,将声音信号作为用户的语音输入所做的处理包括以下一种或多种:将声音信号存储到电子设备上的可存储介质;将声音信号通过互联网发送出去;将声音信号中的语音信号识别为文字,存储到电子设备上的可存储介质;将声音信号中的语音信号识别为文字,通过互联网发送出去;将声音信号中的语音信号识别为文字,理解用户的语音指令,执行相应操作。
优选的,通过声纹分析识别特定用户,只对包含特定用户语音的声音信号做处理。
优选的,电子设备为智能手机、智能手表、智能戒指等。
根据本申请的另一个方面,一种配置有麦克风的智能电子设备,所述智能电子便携设备如下操作与用户进行基于语音输入的交互:处理麦克风捕获的声音信号判断声音信号中是否存在语音信号;响应于确认声音信号中存在语音信号,基于麦克风采集的声音信号进一步判断智能电子设备与用户的嘴部距离是否小于预定阈值;响应于确定电子设备与用户嘴部距离小于预定阈值,将麦克风采集的声音信号作为语音输入处理。
优选的,预定阈值为3厘米。
优选的,预定阈值为1厘米。
优选的,电子设备的麦克风处还有接近光传感器,通过接近光传感器 判断是否有物体接近电子设备。
优选的,电子设备的麦克风处还有距离传感器,通过距离传感器直接测量电子设备与用户嘴部的距离。
优选的,通过麦克风收集的声音信号特征来判断智能电子设备与用户的嘴部距离是否小于预定阈值。
优选的,所述语音信号包括下面各项之一或者组合:用户以正常音量说话发出的声音;用户低声说话发出的声音;用户声带不发声说话产生的声音。
优选的,电子设备还可操作来:响应于确定用户正在近距离对着电子设备说话;判断用户在以如下方式中的一种在发声,包括:用户以正常音量说话的声音;用户以小音量说话的声音;用户以声带不发声方式说话发出的声音;以及根据判断的结果不同,对声音信号做不同的处理。
优选的,所述不同的处理为激活不同的应用程序处理语音输入。
优选的,判断中使用的特征包括音量、频谱特征,能量分布等。
优选的,判断智能电子设备与用户的嘴部距离是否小于预定阈值时使用的特征包括声音信号中的时域特征和频域特征,包括音量、频谱能量。
优选的,所述判断智能电子设备与用户的嘴部距离是否小于预定阈值包括:从麦克风采集到的声音信号信号通过滤波器提取语音信号;判断所述语音信号的能量是否超过一定阈值;响应于语音信号强度超过一定阈值,判断电子设备与用户嘴部距离小于预定阈值。
优选的,所述判断智能电子设备与用户的嘴部距离是否小于预定阈值包括:利用深度神经网络模型处理麦克风采集的数据,判断智能电子设备与用户的嘴部距离是否小于预定阈值。
优选的,所述判断智能电子设备与用户的嘴部距离是否小于预定阈值包括:记录用户在未做语音输入时的语音信号;将麦克风当前采集的语音信号与未做语音输入时的语音信号作比较;如果判断麦克风当前采集的语音信号音量超过未做语音输入时的语音信号的音量一定阈值,判断智能电子设备与用户的嘴部距离小于预定阈值。
优选的,将声音信号作为用户的语音输入所做的处理包括以下一种或多种:将声音信号存储到电子设备上的可存储介质;将声音信号通过互联 网发送出去;将声音信号中的语音信号识别为文字,存储到电子设备上的可存储介质;将声音信号中的语音信号识别为文字,通过互联网发送出去;将声音信号中的语音信号识别为文字,理解用户的语音指令,执行相应操作。
优选的,电子设备还通过声纹分析识别特定用户,只对包含特定用户语音的声音信号做处理。
优选的,电子设备为智能手机、智能手表、智能戒指等。
此处的移动设备包括但不限于手机、头戴式显示器、手表,以及智能戒指、腕表等更小型的智能穿戴设备。
本方案优势:
1.交互更加自然。将设备放在嘴前即触发语音输入,符合用户习惯与认知。
2.使用效率更高。单手即可使用。无需在不同的用户界面/应用之间切换,也不需按住某个按键,直接抬起手到嘴边就能使用。
3.收音质量高。设备的录音机在用户嘴边,收取的语音输入信号清晰,受环境音的影响较小。
4.高隐私性与社会性。设备在嘴前,用户只需发出相对较小的声音即可完成高质量的语音输入,对他人的干扰较小,同时用户姿势可以包括捂嘴等,具有较好的隐私保护。
附图说明
从下面结合附图对本申请实施例的详细描述中,本申请的上述和/或其它目的、特征和优势将变得更加清楚并更容易理解。其中:
图1是根据本申请实施例的语音输入交互方法的示意性流程图。
图2示出了根据本申请的另一实施例的配置有多个麦克风的电子设备使用多个麦克风接收的声音信号的差别的语音输入触发方法的总体流程图。
图3示出了根据本申请实施例的内置有麦克风的电子设备基于低声说话方式识别的语音输入触发方法的总体流程图。
图4描述基于麦克风的声音信号的距离判断的语音输入触发方法的总 体流程图
图5是根据本申请实施例的触发姿势中的将手机上端麦克风贴近嘴部的正面示意图。
图6是根据本申请实施例的触发姿势中的将手机上端麦克风贴近嘴部的侧面示意图。
图7是根据本申请实施例的触发姿势中的将手机下端麦克风贴近嘴部的示意图。
图8是根据本申请实施例的触发姿势中的将智能手表麦克风贴近嘴部的示意图。
具体实施方式
为了使本领域技术人员更好地理解本申请,下面结合附图和具体实施方式对本申请作进一步详细说明。
本公开针对智能电子设备的语音输入触发,基于配置的麦克风捕捉的声音内在特征,来确定是否触发语音输入应用,其中无需传统的物理按键触发、界面元素触发、唤醒词检测,交互更加自然。将设备放在嘴前即触发语音输入,符合用户习惯与认知。
下面将从以下几个方面来继续本公开:1、基于人类说话时风噪声特征的语音输入触发,具体地,通过识别人说话时候的语音和风噪声音来直接启动语音输入并将接收的声音信号作为语音输入处理;2、基于多个麦克风接收的声音信号的差别的语音输入触发;3、基于低声说话方式识别的语音输入触发;4、基于麦克风的声音信号的距离判断的语音输入触发。
一、基于人类说话时风噪声特征的语音输入触发
当用户近距离对着麦克风说话时,即使声音很小或者不触发声带发声,麦克风采集到的声音信号中包含两种声音成分,一是人声带和口腔震动发出的声音,二是人说话产生的气流撞击麦克风产生的风噪声音。可以基于这个特性来触发电子设备的语音输入应用。
图1示出了根据本申请实施例的语音输入交互方法100的示意性流程图。
在步骤S101中,分析麦克风采集的声音信号,识别声音信号中是否包含人说话的语音以及是否包含人说话产生的气流撞击麦克风产生的风噪声音,
在步骤S102中,响应于确定声音信号中包含人说话的声音以及包含用户说话产生的气流撞击麦克风产生的风噪声音,将该声音信号作为用户的语音输入做处理。
本申请实施例的语音输入交互方法特别适合于在隐私要求比较高的情况下,不用声带发声地进行语音输入。
这里用户说话的语音可以包括:用户以正常音量说话的声音、用户以小音量说话的声音、用户以声带不发声方式说话发出的声音。
在一个示例中,可以识别上述不同的说话方式,根据识别结果产生不同的反馈,比如正常说话就是控制手机的语音助理,低声说话就是控制微信,声带不发声说话就是做语音转录笔记。
作为示例,将声音信号作为用户的语音输入所做的处理包括以下一种或多种:
将声音信号存储到电子设备上的可存储介质;
将声音信号通过互联网发送出去;
将声音信号中的语音信号识别为文字,存储到电子设备上的可存储介质;
将声音信号中的语音信号识别为文字,通过互联网发送出去;
将声音信号中的语音信号识别为文字,理解用户的语音指令,执行相应操作。
在一个示例中,还包括通过声纹分析识别特定用户,只对包含特定用户语音的声音信号做处理。
在一个示例中,电子设备为智能手机、智能手表、智能戒指中的一种。
在一个示例中,使用神经网络模型判断声音信号中是否包含用户说话的语音以及说话产生的气流撞击麦克风产生的风噪声音。此仅为示例,可以使用其他机器学习算法。
在一个示例中,识别声音信号中是否包含人说话的语音以及是否包含人说话产生的气流撞击麦克风产生的风噪声音包括:
识别声音信号中是否包含用户说话的语音;
响应于确定声音信号中包含用户说话的语音,识别语音中的音素,将语音信号表示为音素序列;
针对音素序列中的每个音素,确定该音素是否为吐气音素,即:用户发声该音素时有气流从嘴中出来;
将声音信号按照固定窗口长度切分为声音片段序列;
利用频率特征,识别每个声音片段是否包含风噪声;
将语音音素序列中的吐气音素和声音片段序列中识别为风噪声的片段做比较,同时将音素序列中的非吐气音素和风噪声片段作比较,当吐气音素与风噪声片段重合度高于一定阈值,且非吐气音素与非风噪声片段重合度低于一定阈值时,判断该声音信号中包含用户说话产生的气流撞击麦克风产生的风噪声音。
在一个示例中,识别声音信号中是否包含人说话的语音以及是否包含人说话产生的气流撞击麦克风产生的风噪声音包括:
识别声音信号中包含风噪的声音特征;
响应于确定声音信号中包含风噪声,识别声音信号包含语音信号;
响应于确定声音信号中包含语音信号,识别语音信号对应的音素序列;
针对声音信号中的风噪特征,计算每一时刻的风噪特征强度;
针对音素序列中的每个音素,根据预先定义的数据模型获得该音素吐气的强度;
通过基于高斯混合贝叶斯模型分析风噪特征与音素序列的一致性,重合度高于一定阈值时,判断该声音信号中包含用户说话产生的气流撞击麦克风产生的风噪声音。
二、基于多个麦克风接收的声音信号的差别的语音输入触发
图2示出了根据本申请的另一实施例的配置有多个麦克风的电子设备使用多个麦克风接收的声音信号的差别的语音输入触发方法的总体流程图。
电子设备例如手机内置有多个麦克风的电子设备,电子设备具有存储器和中央处理器,存储器上存储有计算机可执行指令,计算机可执行指令 被中央处理器执行时能够执行本实施例的语音输入触发方法。
如图2所示,在步骤S201中,分析多个麦克风采集的声音信号。
在一个示例中,多个麦克风包括至少3个麦克风,构成麦克风阵列系统,通过声音信号达到各个麦克风的时间差可以估计声源相对于智能设备的空间位置。
这里的声音信号包括例如声音信号的幅度、频率等等。
在步骤S202中,基于多个麦克风采集的声音信号,判断用户是否正在近距离对着电子设备说话。
在一个示例中,判断用户是否正在近距离对着电子设备说话包括:
利用到达阵列上各传声器的声音信号之间的时间差计算用户嘴部相对于麦克风阵列的位置,
当用户嘴部距离电子设备的距离小于一定阈值时,确定用户正在近距离对着电子设备说话。
在一个示例中,所述距离阈值为10厘米。
在步骤S203中,响应于确定用户正在近距离对着电子设备说话,将麦克风采集的声音信号作为用户的语音输入处理。
在一个示例中,将该声音信号作为用户的语音输入做处理包括:
根据说话人嘴部和电子设备之间距离的不同,对用户的语音输入做不同处理。例如,当距离为0-3cm时,激活语音助手响应用户的语音输入;当距离为3-10cm时,激活微信应用程序响应用户的语音输入,将语音信息发送给好友;
在一个示例中,判断用户是否正在近距离对着电子设备说话包括:
判断是否至少有一个麦克风采集的声音信号中包含用户说话的语音信号,
响应于确定至少有一个麦克风采集的声音信号中包含用户说话的语音信号,从麦克风采集的声音信号中提取语音信号,
判断从不同麦克风采集的声音信号中提取的语音信号的幅度差异是否超过预定阈值时,
响应于确定幅度差值超过预定阈值,确认用户正在近距离对着电子设备说话。
在上面的例子中,还可以包括:
定义多个麦克风中语音信号幅度最大的麦克风为响应麦克风,
根据响应麦克风的不同,对用户的语音输入做不同的处理。例如,当响应麦克风是智能手机底部的麦克风时,激活智能手机上的语音助手;当响应麦克风是智能手机顶部的麦克风时,激活录音机功能将用户的语音记录到存储设备;
在一个示例中,判断用户是否正在近距离对着电子设备说话包括:利用提前训练的机器学习模型,处理多个麦克风的声音信号,判断用户是否正在近距离对着电子设备说话。一般地,准备训练样本数据,然后利用训练样本数据来训练选定的机器学习模型,在实际应用时(有时也叫测试),将多个麦克风捕获的声音信号(作为测试样本)输入机器学习模型,得到的输出表示用户是否正在近距离对着电子设备说话。作为示例,机器学习模型例如为深度学习神经网络、支持向量机、决策树等。
在一个示例中,用户说话的语音包括:用户以正常音量说话的声音,用户以小音量说话的声音,用户以声带不发声方式说话发出的声音。
在一个示例中,将声音信号作为用户的语音输入所做的处理包括以下一种或多种:将声音信号存储到电子设备上的可存储介质;将声音信号通过互联网发送出去;将声音信号中的语音信号识别为文字,存储到电子设备上的可存储介质;将声音信号中的语音信号识别为文字,通过互联网发送出去;将声音信号中的语音信号识别为文字,理解用户的语音指令,执行相应操作。
在一个示例,还包括通过声纹分析识别特定用户,只对包含特定用户语音的声音信号做处理。
作为示例,电子设备为智能手机、智能手表、智能戒指、平板电脑等。
本实施例利用内置的不同麦克风之间声音信号的差别来识别用户是否近距离对着电子设备说话,进而决定是否启动语音输入,具有识别可靠,计算方法简单等优点。
三、基于低声说话方式识别的语音输入触发
低声说话是指说话音量小于正常说话(比如与他人正常交谈)音量的 方式。低声说话包括两种方式。一种是声带不震动的低声说话(俗称悄悄话),另一种是声带发生震动的低声说话。在声带不震动的低声说话方式下,产生的声音主要包含空气通过喉部、嘴部发出的声音以及嘴内舌头牙齿发出的声音。在声带震动的低声说话方式下,发出的声音除了包含声带不震动的低声说话方式下产生的声音,还包括声带震动产生的声音。但相比于正常音量的说话方式,声带震动的低声说话过程中,声带震动程度较小,产生的声带震动声音较小。声带不震动低声说话产生的声音和声带震动产生的声音的频率范围不同,可以区分。声带震动低声说话和声带震动的正常音量说话可以通过音量阈值来区分,具体的阈值可以提前设定,也可以由用户来设定。
示例方法:对麦克风采集的语音信号做滤波处理,提取两部分信号,分别为声带震动产生的声音成分V1和空气通过喉部、嘴部发出的以及嘴内舌头牙齿发出的声音V2。当V1和V2的能量比值小于一定阈值时,判定用户在做低声说话。
一般情况下,低声说话只有当用户距离麦克风比较近的时候才能检测,比如距离小于30厘米时。而定义近距离情况下的低声说话作为语音输入,对用户而言是一种易于学习和理解和方便操作的交互方式,可以免除显式的唤醒操作,比如按压特定的唤醒按键或者是通过语音唤醒词。且这种方式在绝大多数的实际使用情况下,不会被误触发。
图3示出了根据本申请实施例的配备有麦克风的电子设备基于低声说话方式识别的语音输入触发方法的总体流程图。配备有麦克风的电子设备具有存储器和中央处理器,存储器上存储有计算机可执行指令,计算机可执行指令被中央处理器执行时能够执行根据本申请实施例的语音输入触发方法。
如图3所示,在步骤S301中,判断麦克风采集的声音信号中是否包含语音信号。
在步骤S302中,响应于确认麦克风采集的声音信号中包含语音信号,判断用户是否在做低声说话,即以低于正常音量的方式说话。
在步骤S303中,响应于确定用户正在做低声说话,无需任何唤醒操作地将声音信号作为语音输入处理。
低声说话可以包括声带不发声的低声说话和声带发声的低声说话两种方式。
在一个示例中,语音输入触发方法还可以包括:响应于确定用户在做低声说话,判断用户在做声带不发声的低声说话还是在做声带发声的低声说话,根据判断的结果不同,对声音信号做不同的处理。
作为示例,不同的处理为将语音输入交给不同的应用程序来处理。比如正常说话就是控制手机的语音助理,低声说话就是控制微信,声带不发声说话就是做语音转录笔记。
作为示例,判断用户是否在做低声说话时使用的信号特征可以包括音量、频谱特征,能量分布等。
作为示例,判断用户在做声带不发声的低声说话或者在做声带发声的低声说话时使用的信号特征包括音量、频谱特征,能量分布等。
作为示例,判断用户是否在做低声说话可以包括:利用机器学习模型,处理麦克风采集的声音信号,判断用户是否在低声说话。
作为示例,机器学习模型可以为卷积神经网络模型或者循环神经网络模型。
作为示例,判断用户在做声带不发声的低声说话还是在做声带发声的低声说话包括:利用机器学习模型,处理麦克风采集的声音信号,判断用户在做声带不发声的低声说话或者在做声带发声的低声说话。
作为示例,将声音信号作为用户的语音输入所做的处理包括以下一种或多种:
将声音信号存储到电子设备上的可存储介质;
将声音信号通过互联网发送出去;
将声音信号中的语音信号识别为文字,存储到电子设备上的可存储介质;
将声音信号中的语音信号识别为文字,通过互联网发送出去;
将声音信号中的语音信号识别为文字,理解用户的语音指令,执行相应操作。
作为示例,语音输入触发方法还可以包括:通过声纹分析识别特定用户,只对包含特定用户语音的声音信号做处理。
作为示例,电子设备可以为智能手机、智能手表、智能戒指等。
有关低声说话模式以及检测方法,作为示例,可以参考下述参考文献:
Zhang,Chi,and John HL Hansen."Analysis and classification of speech mode:whispered through shouted."Eighth Annual Conference of the International Speech Communication Association.2007.
Meenakshi,G.Nisha,and Prasanta Kumar Ghosh."Robust whisper activity detection using long-term log energy variation of sub-band signal."IEEE Signal Processing Letters 22.11(2015):1859-1863.
四、基于麦克风的声音信号的距离判断的语音输入触发
下面结合图4描述基于麦克风的声音信号的距离判断的语音输入触发方法的总体流程图。
如图4所示,在步骤401中,处理麦克风捕获的声音信号判断声音信号中是否存在语音信号。
在步骤402中,响应于确认声音信号中存在语音信号,基于麦克风采集的声音信号进一步判断智能电子设备与用户的嘴部距离是否小于预定阈值。
在步骤403中,响应于确定电子设备与用户嘴部距离小于预定阈值,将麦克风采集的声音信号作为语音输入处理。
在一个示例中,预定阈值为10厘米。
语音信号可以包括下面各项之一或者组合:用户以正常音量说话发出的声音;用户低声说话发出的声音;用户声带不发声说话产生的声音。
在一个示例中,判断智能电子设备与用户的嘴部距离是否小于预定阈值时使用的特征包括声音信号中的时域特征和频域特征,包括音量、频谱能量。
在一个示例中,所述判断智能电子设备与用户的嘴部距离是否小于预定阈值包括:利用深度神经网络模型处理麦克风采集的数据,判断智能电子设备与用户的嘴部距离是否小于预定阈值。
在一个示例中,所述判断智能电子设备与用户的嘴部距离是否小于预定阈值包括:记录用户在未做语音输入时的语音信号,将麦克风当前采集的语音信号与未做语音输入时的语音信号作比较,如果判断麦克风当前采集的语音信号音量超过未做语音输入时的语音信号的音量一定阈值,判断智能电子设备与用户的嘴部距离小于预定阈值。
在一个示例中,将声音信号作为用户的语音输入所做的处理包括以下一种或多种:将声音信号存储到电子设备上的可存储介质;将声音信号通过互联网发送出去;将声音信号中的语音信号识别为文字,存储到电子设备上的可存储介质;将声音信号中的语音信号识别为文字,通过互联网发送出去;将声音信号中的语音信号识别为文字,理解用户的语音指令,执行相应操作。
在一个示例中,语音输入触发还包括通过声纹分析识别特定用户,只对包含特定用户语音的声音信号做处理。
在一个示例中,电子设备为智能手机、智能手表、智能戒指等。
图5至图8显示了几例用户将智能电子便携设备的麦克风放置到嘴边较近距离的位置,此时用户发出的语音将作为语音输入。其中,图5与图6分别是手机上端有麦克风的情况,在这种情况下,用户有语音交互意图时,可以将手机的麦克风移动到嘴边0~10厘米处,直接说话即可作为语音输入。图7是手机下端有麦克风的情况,与前述上端有麦克风相类似,两种姿势不是互斥的,如果手机上下端均有麦克风则任意一种姿势均可以实施交互方案。图8是对应设备为智能手表时的情况,与上述设备为手机的情况类似。上述对触发姿势的说明是示例性的,并非穷尽性的,并且也不限于所披露的各种设备和麦克风情况。
作为使用单个麦克风来接收声音输入并触发语音输入的一个具体实施例子,可以首先通过分析单麦克风接收到的声音输入,判断是否为语音,并通过分析语音的近距离特有的特征,如麦克风爆破音、近场风噪、吹气声、能量、频谱特征、时域特征等,判断电子设备自身与用户的嘴的距离是否小于给定阈值,以及通过声纹识别判断语音输入来源是否属于可服务用户,结合以上几点来判断是否将麦克风信号作为语音输入。
作为使用双麦克风来接收声音输入并触发语音输入的一个具体实施例 子,通过分析双麦克风输入信号的特征差异,如能量特征、频谱特征,判断发声位置是否贴近其中一个麦克风,通过双麦克风的信号差异从而屏蔽环境噪音、分离语音到对应的单声道,然后通过上述单麦克风的特征分析方法,判断电子设备自身与用户的嘴的距离小于给定阈值,以及通过声纹识别判断语音输入来源是否属于可服务用户,结合以上几点来判断是否将信号作为语音输入。
作为使用多麦克风阵列来接收声音输入并触发语音输入的一个具体实施例子,通过比较分析不同麦克风接收到的声音输入的信号的差异,通过从环境中分离近场语音信号,识别与检测声音信号是否包括语音,通过多麦克风阵列的声源定位技术判断语音信号的用户嘴的位置与设备之间的距离是否小于预定阈值,以及通过声纹识别判断语音输入来源是否属于可服务用户,结合以上几点来判断是否将信号作为语音输入。
在一个示例中,在智能电子便携设备通过分析语音信号,检测到发音位置位于自身附近,也即移动设备位于用户嘴部较近位置,智能电子便携设备便将声音信号作为语音输入,根据任务与上下文的不同,再结合自然语言处理技术理解用户的语音输入并完成相应的任务。
麦克风并不局限于前述示例,而是可以包括下面各项之一或者其组合:设备内置单麦克风;设备内置双麦克风;设备内置多麦克风阵列;外接无线麦克风;以及外接有线麦克风。
如前所述,智能电子便携设备可以为手机,装备有双耳蓝牙耳机、带有麦克风的有线耳机或者其他麦克风传感器。
智能电子便携设备可以为手表,以及智能戒指、腕表中的一种智能穿戴设备。
智能电子便携设备为头戴式智能显示设备,装备有麦克风或者多麦克风组。
在一个示例中,在电子设备激活语音输入应用后,可以做出反馈输出,反馈输出包括震动、语音、图像中的一种或者其组合。
本申请各个实施例的方案可以提供下述一种或几种优势:
1.交互更加自然。将设备放在嘴前即触发语音输入,符合用户习惯与认知。
2.使用效率更高。单手即可使用。无需在不同的用户界面/应用之间切换,也不需按住某个按键,直接抬起手到嘴边就能使用。
3.收音质量高。设备的录音机在用户嘴边,收取的语音输入信号清晰,受环境音的影响较小。
4.高隐私性与社会性。设备在嘴前,用户只需发出相对较小的声音即可完成高质量的语音输入,对他人的干扰较小,同时用户姿势可以包括捂嘴等,具有较好的隐私保护。
以上已经描述了本申请的各实施例,上述说明是示例性的,并非穷尽性的,并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。因此,本申请的保护范围应该以权利要求的保护范围为准。

Claims (17)

  1. 一种配置有多个麦克风的电子设备,电子设备具有存储器和中央处理器,存储器上存储有计算机可执行指令,计算机可执行指令被中央处理器执行时能够执行如下操作:
    分析多个麦克风采集的声音信号,
    判断用户是否正在近距离对着电子设备说话,
    响应于确定用户正在近距离对着电子设备说话,将麦克风采集的声音信号作为用户的语音输入处理。
  2. 根据权利要求1的电子设备,多个麦克风构成麦克风阵列系统。
  3. 根据权利要求2的电子设备,所述判断用户是否正在近距离对着电子设备说话包括:
    利用到达阵列上各传声器的声音信号之间的时间差计算用户嘴部相对于麦克风阵列的位置,
    当用户嘴部距离电子设备的距离小于一定阈值时,确定用户正在近距离对着电子设备说话。
  4. 根据权利要求3的电子设备,所述距离阈值为10厘米。
  5. 根据权利要求3的电子设备,所述将该声音信号作为用户的语音输入做处理包括:
    根据说话人嘴部和电子设备之间距离的不同,对用户的语音输入做不同处理。
  6. 根据权利要求1的电子设备,所述判断用户是否正在近距离对着电子设备说话包括:
    判断是否至少有一个麦克风采集的声音信号中包含用户说话的语音信号,
    响应于确定至少有一个麦克风采集的声音信号中包含用户说话的语音信号,从麦克风采集的声音信号中提取语音信号,
    判断从不同麦克风采集的声音信号中提取的语音信号的幅度差异是否超过预定阈值时,
    响应于确定幅度差值超过预定阈值,确认用户正在近距离对着电子设备说话。
  7. 根据权利要求6的电子设备,还包括:
    定义多个麦克风中,语音信号幅度最大的麦克风为响应麦克风,
    根据响应麦克风的不同,对用户的语音输入做不同的处理。
  8. 根据权利要求1的电子设备,所述判断用户是否正在近距离对着电子设备说话包括:
    利用提前训练的机器学习模型,处理多个麦克风的声音信号,判断用户是否正在近距离对着电子设备说话。
  9. 根据权利要求1的电子设备,用户说话的语音包括:
    用户以正常音量说话的声音,
    用户以小音量说话的声音,
    用户以声带不发声方式说话发出的声音。
  10. 根据权利要求1的电子设备,还包括:
    响应于确定用户正在近距离对着电子设备说话,
    判断用户在以如下方式中的一种在发声,包括:
    用户以正常音量说话的声音,
    用户以小音量说话的声音,
    用户以声带不发声方式说话发出的声音;以及
    根据判断的结果不同,对声音信号做不同的处理。
  11. 根据权利要求10的电子设备,所述不同的处理为激活不同的应用程序处理语音输入。
  12. 根据权利要求10的电子设备,判断的特征包括音量、频谱特征,能量分布。
  13. 根据权利要求1的电子设备,将声音信号作为用户的语音输入所做的处理包括以下一种或多种:
    将声音信号存储到电子设备上的可存储介质;
    将声音信号通过互联网发送出去;
    将声音信号中的语音信号识别为文字,存储到电子设备上的可存储介质;
    将声音信号中的语音信号识别为文字,通过互联网发送出去;
    将声音信号中的语音信号识别为文字,理解用户的语音指令,执行相 应操作。
  14. 根据权利要求1的电子设备,还包括通过声纹分析识别特定用户,其中只对包含特定用户语音的声音信号做处理。
  15. 根据权利要求1的电子设备,电子设备为智能手机、智能手表、智能戒指、平板电脑中的一种。
  16. 一种由配置有多个麦克风的电子设备执行的语音交互唤醒方法,电子设备具有存储器和中央处理器,存储器上存储有计算机可执行指令,计算机可执行指令被中央处理器执行时能够执行所述语音交互唤醒方法所述语音交互唤醒方法包括:
    分析多个麦克风采集的声音信号,
    判断用户是否正在近距离对着电子设备说话,
    响应于确定用户正在近距离对着电子设备说话,将麦克风采集的声音信号作为用户的语音输入处理。
  17. 一种计算机可读介质,其上存储有计算机可执行指令,计算机可执行指令被计算机执行时能够执行语音交互唤醒方法,所述语音交互唤醒方法包括:
    分析多个麦克风采集的声音信号,
    判断用户是否正在近距离对着电子设备说话,
    响应于确定用户正在近距离对着电子设备说话,将麦克风采集的声音信号作为用户的语音输入处理。
PCT/CN2020/092067 2019-06-03 2020-05-25 基于麦克风信号的语音交互唤醒电子设备、方法和介质 WO2020244402A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910475972.9 2019-06-03
CN201910475972.9A CN110428806B (zh) 2019-06-03 2019-06-03 基于麦克风信号的语音交互唤醒电子设备、方法和介质

Publications (1)

Publication Number Publication Date
WO2020244402A1 true WO2020244402A1 (zh) 2020-12-10

Family

ID=68408446

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/092067 WO2020244402A1 (zh) 2019-06-03 2020-05-25 基于麦克风信号的语音交互唤醒电子设备、方法和介质

Country Status (2)

Country Link
CN (1) CN110428806B (zh)
WO (1) WO2020244402A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113380257A (zh) * 2021-06-08 2021-09-10 深圳市同行者科技有限公司 多端智能家居的响应方法、装置、设备及存储介质
CN114779656A (zh) * 2022-04-29 2022-07-22 四川虹美智能科技有限公司 智能家电控制方法及系统

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110097875B (zh) * 2019-06-03 2022-09-02 清华大学 基于麦克风信号的语音交互唤醒电子设备、方法和介质
CN110428806B (zh) * 2019-06-03 2023-02-24 交互未来(北京)科技有限公司 基于麦克风信号的语音交互唤醒电子设备、方法和介质
JP7442331B2 (ja) 2020-02-05 2024-03-04 キヤノン株式会社 音声入力装置およびその制御方法ならびにプログラム
CN112071306A (zh) * 2020-08-26 2020-12-11 吴义魁 语音控制方法、系统、可读存储介质及网关设备

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102801861A (zh) * 2012-08-07 2012-11-28 歌尔声学股份有限公司 一种应用于手机的语音增强方法和装置
JP2016201595A (ja) * 2015-04-07 2016-12-01 井上 時子 音源方向追従システム
CN106255000A (zh) * 2016-07-29 2016-12-21 维沃移动通信有限公司 一种音频信号采集方法及移动终端
US20170162205A1 (en) * 2015-12-07 2017-06-08 Semiconductor Components Industries, Llc Method and apparatus for a low power voice trigger device
CN107742523A (zh) * 2017-11-16 2018-02-27 广东欧珀移动通信有限公司 语音信号处理方法、装置以及移动终端
CN109448718A (zh) * 2018-12-11 2019-03-08 广州小鹏汽车科技有限公司 一种基于多麦克风阵列的语音识别方法及系统
CN110428806A (zh) * 2019-06-03 2019-11-08 清华大学 基于麦克风信号的语音交互唤醒电子设备、方法和介质

Family Cites Families (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6285772B1 (en) * 1999-07-20 2001-09-04 Umevoice, Inc. Noise control device
US7099821B2 (en) * 2003-09-12 2006-08-29 Softmax, Inc. Separation of target acoustic signals in a multi-transducer arrangement
CN102819009B (zh) * 2012-08-10 2014-10-01 香港生产力促进局 用于汽车的驾驶者声源定位系统及方法
KR102216048B1 (ko) * 2014-05-20 2021-02-15 삼성전자주식회사 음성 명령 인식 장치 및 방법
CN104967726B (zh) * 2015-04-30 2018-03-23 努比亚技术有限公司 语音指令处理方法和装置、移动终端
CN105120059B (zh) * 2015-07-07 2019-03-26 惠州Tcl移动通信有限公司 移动终端及其根据呼吸强弱控制耳机通话降噪的方法
KR20170010494A (ko) * 2015-07-20 2017-02-01 엘지전자 주식회사 이동 단말기 및 그 제어 방법
CN105227743B (zh) * 2015-08-25 2016-12-21 努比亚技术有限公司 一种录制方法、装置及移动终端
CN105261359B (zh) * 2015-12-01 2018-11-09 南京师范大学 手机麦克风的消噪系统和消噪方法
KR20180023617A (ko) * 2016-08-26 2018-03-07 삼성전자주식회사 외부 기기를 제어하는 휴대 기기 및 이의 오디오 신호 처리 방법
JP6494062B2 (ja) * 2016-08-29 2019-04-03 Groove X株式会社 音源の方向を認識する自律行動型ロボット
US11128749B2 (en) * 2016-10-27 2021-09-21 Ntt Docomo, Inc. Communication terminal device, program, and information-processing method
CN106686249B (zh) * 2017-01-17 2020-04-24 维沃移动通信有限公司 一种语音通话方法及移动终端
EP3422736B1 (en) * 2017-06-30 2020-07-29 GN Audio A/S Pop noise reduction in headsets having multiple microphones
CN108401200A (zh) * 2018-04-09 2018-08-14 北京唱吧科技股份有限公司 一种麦克风装置
CN109448759A (zh) * 2018-12-28 2019-03-08 武汉大学 一种基于气爆音的抗语音认证欺骗攻击检测方法
CN109741758A (zh) * 2019-01-14 2019-05-10 杭州微纳科技股份有限公司 一种双麦克风语音降噪方法

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102801861A (zh) * 2012-08-07 2012-11-28 歌尔声学股份有限公司 一种应用于手机的语音增强方法和装置
JP2016201595A (ja) * 2015-04-07 2016-12-01 井上 時子 音源方向追従システム
US20170162205A1 (en) * 2015-12-07 2017-06-08 Semiconductor Components Industries, Llc Method and apparatus for a low power voice trigger device
CN106255000A (zh) * 2016-07-29 2016-12-21 维沃移动通信有限公司 一种音频信号采集方法及移动终端
CN107742523A (zh) * 2017-11-16 2018-02-27 广东欧珀移动通信有限公司 语音信号处理方法、装置以及移动终端
CN109448718A (zh) * 2018-12-11 2019-03-08 广州小鹏汽车科技有限公司 一种基于多麦克风阵列的语音识别方法及系统
CN110428806A (zh) * 2019-06-03 2019-11-08 清华大学 基于麦克风信号的语音交互唤醒电子设备、方法和介质

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113380257A (zh) * 2021-06-08 2021-09-10 深圳市同行者科技有限公司 多端智能家居的响应方法、装置、设备及存储介质
CN114779656A (zh) * 2022-04-29 2022-07-22 四川虹美智能科技有限公司 智能家电控制方法及系统
CN114779656B (zh) * 2022-04-29 2023-08-11 四川虹美智能科技有限公司 智能家电控制方法及系统

Also Published As

Publication number Publication date
CN110428806B (zh) 2023-02-24
CN110428806A (zh) 2019-11-08

Similar Documents

Publication Publication Date Title
WO2020244355A1 (zh) 基于麦克风信号的语音交互唤醒电子设备、方法和介质
WO2020244402A1 (zh) 基于麦克风信号的语音交互唤醒电子设备、方法和介质
WO2020244416A1 (zh) 基于麦克风信号的语音交互唤醒电子设备、方法和介质
WO2020244411A1 (zh) 基于麦克风信号的语音交互唤醒电子设备、方法和介质
US10276164B2 (en) Multi-speaker speech recognition correction system
CN112074900B (zh) 用于自然语言处理的音频分析
EP3210205B1 (en) Sound sample verification for generating sound detection model
JP6819672B2 (ja) 情報処理装置、情報処理方法、及びプログラム
KR20200111853A (ko) 전자 장치 및 전자 장치의 음성 인식 제어 방법
WO2021184549A1 (zh) 单耳耳机、智能电子设备、方法和计算机可读介质
US20200053611A1 (en) Wireless device connection handover
US10109294B1 (en) Adaptive echo cancellation
KR102628211B1 (ko) 전자 장치 및 그 제어 방법
CN112739253A (zh) 用于肺部状况监测与分析的系统和方法
JP6585733B2 (ja) 情報処理装置
WO2016183961A1 (zh) 智能设备的界面切换方法、系统、设备及非易失性计算机存储介质
JP2009178783A (ja) コミュニケーションロボット及びその制御方法
KR20210042523A (ko) 전자 장치 및 이의 제어 방법
CN110728993A (zh) 一种变声识别方法及电子设备
WO2021153101A1 (ja) 情報処理装置、情報処理方法および情報処理プログラム
KR102114365B1 (ko) 음성인식 방법 및 장치
KR20170029390A (ko) 음성 명령 모드 진입 방법
KR20210098250A (ko) 전자 장치 및 이의 제어 방법
US20240079007A1 (en) System and method for detecting a wakeup command for a voice assistant
WO2019187543A1 (ja) 情報処理装置および情報処理方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20819549

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20819549

Country of ref document: EP

Kind code of ref document: A1