WO2020244355A1 - Microphone signal-based voice interaction wake-up electronic device, method, and medium - Google Patents

Microphone signal-based voice interaction wake-up electronic device, method, and medium Download PDF

Info

Publication number
WO2020244355A1
WO2020244355A1 PCT/CN2020/089551 CN2020089551W WO2020244355A1 WO 2020244355 A1 WO2020244355 A1 WO 2020244355A1 CN 2020089551 W CN2020089551 W CN 2020089551W WO 2020244355 A1 WO2020244355 A1 WO 2020244355A1
Authority
WO
WIPO (PCT)
Prior art keywords
electronic device
user
voice
microphone
signal
Prior art date
Application number
PCT/CN2020/089551
Other languages
French (fr)
Chinese (zh)
Inventor
史元春
喻纯
Original Assignee
清华大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 清华大学 filed Critical 清华大学
Publication of WO2020244355A1 publication Critical patent/WO2020244355A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Definitions

  • This application relates to the field of voice input, and more specifically, to smart electronic devices and voice input triggering methods.
  • voice input After pressing (or holding down) a certain (or certain) physical buttons of the mobile device, voice input is activated.
  • the device needs to have a screen; the trigger element occupies the screen content; the limitation of the software UI may lead to a cumbersome trigger method; it is easy to trigger by mistake.
  • the device activates voice input after detecting the corresponding wake-up word.
  • a specific word such as product nickname
  • a smart electronic device with a built-in microphone the smart electronic portable device interacts with a user based on voice input as follows: process the sound signal captured by the microphone to determine whether there is a voice signal in the sound signal; After confirming that there is a voice signal in the sound signal, it is further determined based on the sound signal collected by the microphone whether the distance between the smart electronic device and the user’s mouth is less than a predetermined threshold; in response to determining that the distance between the electronic device and the user’s mouth is less than the predetermined threshold, the sound collected by the microphone is The signal is processed as voice input.
  • the predetermined threshold is 3 cm.
  • the predetermined threshold is 1 cm.
  • the proximity light sensor determines whether an object is approaching the electronic device.
  • the distance sensor at the microphone of the electronic device, and the distance between the electronic device and the user's mouth is directly measured by the distance sensor.
  • the distance between the smart electronic device and the user's mouth is less than a predetermined threshold value through the characteristics of the sound signal collected by the microphone.
  • the voice signal includes one or a combination of the following items: the sound produced by the user speaking at a normal volume; the sound produced by the user speaking in a low voice; the sound produced by the user's vocal cord without speaking.
  • the electronic device is further operable: in response to determining that the user is speaking to the electronic device at close range; determining that the user is speaking in one of the following ways, including: the user's voice at a normal volume; the user with a low volume The voice of speaking; the voice that the user speaks in a way that the vocal cords are silent; and according to the different results of the judgment, the voice signal is processed differently.
  • the different processing is activating different applications to process voice input.
  • the characteristics used in the judgment include volume, frequency spectrum characteristics, energy distribution, and the like.
  • the features used when judging whether the distance between the smart electronic device and the user's mouth is less than a predetermined threshold include time-domain features and frequency-domain features in the sound signal, including volume and spectral energy.
  • the judging whether the distance between the smart electronic device and the user's mouth is less than a predetermined threshold includes: extracting a voice signal from a sound signal collected by a microphone through a filter; judging whether the energy of the voice signal exceeds a certain threshold; responding to the voice If the signal strength exceeds a certain threshold, it is determined that the distance between the electronic device and the user's mouth is less than the predetermined threshold.
  • the judging whether the distance between the smart electronic device and the user's mouth is less than a predetermined threshold includes: using a deep neural network model to process data collected by a microphone to determine whether the distance between the smart electronic device and the user's mouth is less than a predetermined threshold.
  • the judging whether the distance between the smart electronic device and the user's mouth is less than a predetermined threshold includes: recording the user's voice signal when the user is not making a voice input; and comparing the voice signal currently collected by the microphone with the voice signal when the user is not making a voice input. Comparison; if it is determined that the volume of the voice signal currently collected by the microphone exceeds a certain threshold of the volume of the voice signal when no voice input is made, it is determined that the distance between the smart electronic device and the user's mouth is less than the predetermined threshold.
  • the processing of using the sound signal as the user's voice input includes one or more of the following: storing the sound signal on a storage medium on the electronic device; sending the sound signal through the Internet; sending the voice in the sound signal
  • the signal is recognized as text and stored on the storage medium of the electronic device; the voice signal in the sound signal is recognized as text and sent through the Internet; the voice signal in the sound signal is recognized as text, the user’s voice command is understood, and the corresponding operating.
  • the electronic device also recognizes a specific user through voiceprint analysis, and only processes the voice signal containing the voice of the specific user.
  • the electronic device is a smart phone, a smart watch, a smart ring, etc.
  • the mobile devices here include, but are not limited to, mobile phones, head-mounted displays, watches, and smaller smart wearable devices such as smart rings and watches.
  • a voice input triggering method executed by a smart electronic device equipped with a microphone including the following operations for the smart electronic device to interact with a user based on voice input: processing sound signals captured by the microphone Determine whether there is a voice signal in the sound signal; in response to confirming that there is a voice signal in the sound signal, further determine whether the distance between the smart electronic device and the user's mouth is less than a predetermined threshold based on the sound signal collected by the microphone; in response to determining whether the electronic device and the user's mouth The distance is less than a predetermined threshold, and the sound signal collected by the microphone is processed as a voice input.
  • a computer-readable medium having computer-executable instructions stored thereon, and when the computer-executable instructions are executed by a computer, a voice interactive wake-up method can be executed.
  • the voice interactive wake-up method includes: processing The sound signal captured by the microphone determines whether there is a voice signal in the sound signal; in response to confirming that there is a voice signal in the sound signal, it is further determined based on the sound signal collected by the microphone whether the distance between the smart electronic device and the user’s mouth is less than a predetermined threshold; The distance between the device and the user's mouth is less than a predetermined threshold, and the sound signal collected by the microphone is processed as a voice input.
  • an electronic device equipped with a microphone.
  • the electronic device has a memory and a central processing unit.
  • the memory stores computer executable instructions.
  • the computer executable instructions When executed by the central processing unit, the following operations can be performed : Analyze the sound signal collected by the microphone, identify whether the sound signal contains human speech and whether it contains the wind noise generated by the airflow generated by the human speech hitting the microphone, and respond to determining that the sound signal contains the human speech and the user's speech production The wind noise generated by the airflow hitting the microphone is processed as the user’s voice input.
  • the voice spoken by the user includes: the voice of the user speaking at a normal volume, the voice of the user speaking at a low volume, and the voice of the user speaking in a non-voicing manner.
  • the electronic device is further operable: in response to determining that the user is speaking to the electronic device at a close distance, it is determined that the user is speaking in one of the following ways, including: the user speaking at a normal volume, and the user speaking at a low volume
  • Speaking sound is the sound that the user speaks in a way that the vocal cords are silent; according to the different judgment results, the sound signal is processed differently.
  • the different processing is activating different applications to process voice input.
  • the characteristics used in the judgment include volume, frequency spectrum characteristics, energy distribution, and the like.
  • the processing of using the sound signal as the user's voice input includes one or more of the following: storing the sound signal on a storage medium on the electronic device; sending the sound signal through the Internet; sending the voice in the sound signal
  • the signal is recognized as text and stored on the storage medium of the electronic device; the voice signal in the sound signal is recognized as text and sent out via the Internet; the voice signal in the sound signal is recognized as text, the user’s voice command is understood, and the corresponding operating.
  • the electronic device is also operable to identify a specific user through voiceprint analysis, and only process the voice signal containing the voice of the specific user.
  • the electronic device is one of a smart phone, a smart watch, and a smart ring.
  • the electronic device is further operable to use a neural network model to determine whether the voice signal of the user is included in the voice signal and the wind noise generated by the airflow generated by the speech hitting the microphone.
  • the electronic device is further operable to recognize whether the sound signal contains the voice spoken by a human and whether it contains the wind noise generated by the airflow generated by the human speech hitting the microphone includes recognizing whether the voice signal contains the voice spoken by the user; in response to determining the sound The signal contains the user's spoken voice, recognizes the phonemes in the voice, and expresses the voice signal as a phoneme sequence; for each phoneme in the phoneme sequence, determine whether the phoneme is an exhaled phoneme, that is: there is airflow from the mouth when the user utters the phoneme Out; cut the sound signal into a sequence of sound segments according to a fixed window length; use frequency characteristics to identify whether each sound segment contains wind noise; identify the exhaled phoneme in the speech phoneme sequence and the segment in the sound segment sequence as wind noise For comparison, the non-exhaled phoneme and wind noise segment in the phoneme sequence are compared at the same time.
  • the sound signal contains wind noise generated by the airflow generated by the user's speech hitting the microphone.
  • recognizing whether the sound signal contains human speech and whether it contains wind noise generated by the airflow generated by human speech hitting the microphone includes: recognizing the sound characteristics of wind noise contained in the sound signal; and in response to determining that the sound signal contains wind noise , Recognize that the sound signal contains a voice signal; In response to determining that the sound signal contains a voice signal, identify the phoneme sequence corresponding to the voice signal; For the wind noise feature in the sound signal, calculate the wind noise feature intensity at each moment; For the phoneme sequence For each phoneme, the exhalation intensity of the phoneme is obtained according to the pre-defined data model; the consistency of the wind noise feature and the phoneme sequence is analyzed based on the Gaussian Mixture Bayesian model. When the coincidence degree is higher than a certain threshold, it is determined that the sound signal contains The wind noise generated by the airflow generated by the user's speech hitting the microphone.
  • an electronic device with a plurality of built-in microphones.
  • the electronic device has a memory and a central processing unit.
  • the memory stores computer-executable instructions.
  • the computer-executable instructions can be executed by the central processing unit. Perform the following operations: analyze the sound signals collected by multiple microphones; determine whether the user is talking to the electronic device at close range; in response to determining that the user is talking to the electronic device at close range, the sound signal collected by the microphone is processed as the user's voice input .
  • a plurality of microphones constitute a microphone array system.
  • the judging whether the user is speaking into the electronic device at close range includes: calculating the position of the user's mouth relative to the microphone array by using the time difference between the sound signals arriving at each microphone on the array; When the distance is less than a certain threshold, it is determined that the user is talking to the electronic device at a close distance.
  • the distance threshold is 10 cm.
  • the processing the voice signal as the user's voice input includes: performing different processing on the user's voice input according to the distance between the speaker's mouth and the electronic device.
  • the judging whether the user is speaking into the electronic device at close range includes: judging whether the sound signal collected by at least one microphone contains the voice signal of the user speaking; in response to determining that the sound signal collected by the at least one microphone contains the user Speaking voice signal, extract the voice signal from the voice signal collected by the microphone; determine whether the amplitude difference of the voice signal extracted from the voice signal collected by different microphones exceeds a predetermined threshold; in response to determining that the amplitude difference exceeds the predetermined threshold, confirm the user Talking to the electronic device at close range.
  • the electronic device can also be operated to: define the microphone with the largest voice signal amplitude among the multiple microphones as the response microphone; and perform different processing on the user's voice input according to the different response microphones.
  • the judging whether the user is speaking into the electronic device at close range includes: using a machine learning model trained in advance to process the sound signals of multiple microphones to judge whether the user is speaking into the electronic device at close range.
  • the voice spoken by the user includes: the voice that the user speaks at a normal volume; the voice that the user speaks at a low volume; the voice that the user speaks in a non-voicing manner.
  • the voice spoken by the user includes: the voice of the user speaking at a normal volume, the voice of the user speaking at a low volume, and the voice of the user speaking in a non-voicing manner.
  • the electronic device is further operable: in response to determining that the user is speaking to the electronic device at close range; determining that the user is speaking in one of the following ways, including: the user's voice at a normal volume; the user with a low volume The voice of speaking; the voice that the user speaks in a way that the vocal cords are silent; and according to the different results of the judgment, the voice signal is processed differently.
  • the different processing is activating different applications to process voice input.
  • the judged characteristics include volume, frequency spectrum characteristics, energy distribution, and the like.
  • the processing of using the sound signal as the user's voice input includes one or more of the following: storing the sound signal on a storage medium on the electronic device; sending the sound signal through the Internet; sending the voice in the sound signal
  • the signal is recognized as text and stored on the storage medium of the electronic device; the voice signal in the sound signal is recognized as text and sent through the Internet; the voice signal in the sound signal is recognized as text, the user’s voice command is understood, and the corresponding operating.
  • the electronic device is also operable to identify a specific user through voiceprint analysis, and only process the voice signal containing the voice of the specific user.
  • the electronic device is one of a smart phone, a smart watch, a smart ring, and a tablet computer.
  • an electronic device with a built-in microphone has a memory and a central processing unit.
  • the memory stores computer executable instructions.
  • the computer executable instructions When executed by the central processing unit, the following operations can be performed: Determine whether the voice signal collected by the microphone contains a voice signal; in response to confirming that the voice signal collected by the microphone contains a voice signal, determine whether the user is speaking in a low voice, that is, speaking at a lower than normal volume; in response to determining that the user is doing Speak in a low voice, and process the sound signal as a voice input without any wake-up operation.
  • the whispered speech includes two ways of whispering without vocal cords and whispering with vocal cords.
  • the electronic device is also operated: in response to determining that the user is speaking in a low voice; judging whether the user is speaking in a low voice without vocal cords or speaking in a low voice with vocal cords; Different treatment.
  • the different processing is to activate different applications to respond to voice input.
  • the signal characteristics used to determine whether the user is speaking in a low voice include volume, frequency spectrum characteristics, and energy distribution.
  • the signal characteristics used when judging that the user is making a low-voiced speech without vocal cords or when making a low-voiced speech with vocal cords include volume, frequency spectrum characteristics, and energy distribution.
  • the judging whether the user is speaking in a low voice includes: using a machine learning model to process a sound signal collected by a microphone to determine whether the user is speaking in a low voice.
  • the machine learning model is a convolutional neural network model or a recurrent neural network model.
  • the judging whether the user is doing a low-voiced speech without vocal cords or a low-voiced speech with vocal cords includes: using a machine learning model to process the sound signal collected by a microphone to determine that the user is doing low-voiced speech without vocal cords Or speaking in a low voice with vocal cords.
  • the machine learning model is a convolutional neural network model or a recurrent neural network model.
  • the processing of using the sound signal as the user's voice input includes one or more of the following: storing the sound signal on a storage medium on the electronic device; sending the sound signal through the Internet; sending the voice in the sound signal
  • the signal is recognized as text and stored on the storage medium of the electronic device; the voice signal in the sound signal is recognized as text and sent through the Internet; the voice signal in the sound signal is recognized as text, the user’s voice command is understood, and the corresponding operating.
  • the specific user is identified through voiceprint analysis, and only the voice signal containing the voice of the specific user is processed.
  • the electronic device is a smart phone, a smart watch, a smart ring, etc.
  • the use efficiency is higher. It can be used with one hand. No need to switch between different user interfaces/applications, no need to hold down a button, just lift your hand to your mouth to use it.
  • the radio quality is high.
  • the device's voice recorder is at the user's mouth, and the voice input signal received is clear, and it is less affected by environmental sounds.
  • Fig. 1 is a schematic flowchart of a voice input interaction method according to an embodiment of the present application
  • FIG. 2 shows an overall flowchart of a method for triggering a voice input using differences in sound signals received by multiple microphones using an electronic device equipped with multiple microphones according to another embodiment of the present application;
  • FIG. 3 shows an overall flowchart of a voice input trigger method based on whispered speech recognition for an electronic device with a built-in microphone according to an embodiment of the present application
  • Figure 4 depicts the overall flow chart of the voice input trigger method based on the distance judgment of the sound signal of the microphone
  • FIG. 5 is a front schematic diagram of placing the upper microphone of the mobile phone close to the mouth in the triggering posture according to an embodiment of the present application
  • FIG. 6 is a schematic side view of the upper microphone of the mobile phone close to the mouth in the triggering posture according to an embodiment of the present application
  • Fig. 7 is a schematic diagram of placing the lower end microphone of the mobile phone close to the mouth in the triggering posture according to an embodiment of the present application
  • Fig. 8 is a schematic diagram of placing the smart watch microphone close to the mouth in the triggering posture according to an embodiment of the present application.
  • the present disclosure is directed to the voice input trigger of the smart electronic device, and determines whether to trigger the voice input application based on the inherent characteristics of the sound captured by the configured microphone. There is no need for traditional physical button triggering, interface element triggering, and wake-up word detection, and the interaction is more natural. Putting the device in front of the mouth triggers voice input, which conforms to user habits and cognition.
  • Voice input trigger based on the characteristics of wind noise when a human speaks, specifically, by recognizing the voice and wind noise sound when a human speaks to directly initiate the voice input and receive the sound signal As voice input processing; 2. Voice input trigger based on the difference of sound signals received by multiple microphones; 3. Voice input trigger based on whispered speech recognition; 4. Voice input trigger based on distance judgment of microphone's voice signal. .
  • the sound signal collected by the microphone contains two sound components, one is the sound produced by the oral vibration of the human vocal cord, and the other is produced by human speech The wind noise generated by the airflow hitting the microphone.
  • the voice input application of the electronic device can be triggered based on this characteristic.
  • Fig. 1 shows a schematic flowchart of a voice input interaction method 100 according to an embodiment of the present application.
  • step S101 the sound signal collected by the microphone is analyzed to identify whether the sound signal contains human speech and whether it contains wind noise generated by the airflow generated by the human speech hitting the microphone,
  • step S102 in response to determining that the sound signal contains the voice of a person's speech and the wind noise generated by the airflow generated by the user's speech hitting the microphone, the sound signal is processed as the user's voice input.
  • the voice input interaction method of the embodiment of the present application is particularly suitable for voice input without vocal cords when the privacy requirements are relatively high.
  • the voice spoken by the user may include: the voice of the user speaking at a normal volume, the voice of the user speaking at a low volume, and the voice of the user speaking in a non-voicing manner.
  • the above-mentioned different ways of speaking can be recognized, and different feedbacks can be generated according to the recognition results.
  • normal speaking is to control the voice assistant of the mobile phone
  • whispered speaking is to control WeChat
  • voiceless speaking is to make voice transcription notes.
  • the processing of using a sound signal as the user's voice input includes one or more of the following:
  • it also includes identifying a specific user through voiceprint analysis, and processing only the voice signal containing the voice of the specific user.
  • the electronic device is one of a smart phone, a smart watch, and a smart ring.
  • a neural network model is used to determine whether the voice signal of the user is included in the voice signal and the wind noise generated by the airflow hitting the microphone. This is just an example, other machine learning algorithms can be used.
  • recognizing whether the voice signal contains human speech and whether it contains wind noise generated by the airflow generated by human speech and hitting the microphone includes:
  • the voice signal contains the voice spoken by the user, recognize the phonemes in the voice, and represent the voice signal as a phoneme sequence;
  • recognizing whether the voice signal contains human speech and whether it contains wind noise generated by the airflow generated by human speech hitting the microphone includes:
  • Fig. 2 shows an overall flow chart of a method for triggering a voice input using a difference in sound signals received by multiple microphones for an electronic device equipped with multiple microphones according to another embodiment of the present application.
  • An electronic device such as an electronic device with multiple microphones built in a mobile phone.
  • the electronic device has a memory and a central processing unit.
  • the memory stores computer executable instructions.
  • the voice input trigger of this embodiment can be executed. method.
  • step S201 the sound signals collected by multiple microphones are analyzed.
  • the multiple microphones include at least 3 microphones to form a microphone array system, and the spatial position of the sound source relative to the smart device can be estimated by the time difference between the sound signal reaching each microphone.
  • the sound signal here includes, for example, the amplitude and frequency of the sound signal.
  • step S202 based on the sound signals collected by multiple microphones, it is determined whether the user is speaking to the electronic device at a close range.
  • determining whether the user is speaking into the electronic device at close range includes:
  • the distance between the user's mouth and the electronic device is less than a certain threshold, it is determined that the user is speaking to the electronic device at close range.
  • the distance threshold is 10 cm.
  • step S203 in response to determining that the user is speaking to the electronic device at a close distance, the sound signal collected by the microphone is processed as the user's voice input.
  • processing the sound signal as the user's voice input includes:
  • the user's voice input is processed differently. For example, when the distance is 0-3cm, the voice assistant is activated to respond to the user's voice input; when the distance is 3-10cm, the WeChat application is activated to respond to the user's voice input and send voice messages to friends;
  • determining whether the user is speaking into the electronic device at close range includes:
  • the user's voice input is processed differently. For example, when the response microphone is the microphone at the bottom of the smartphone, activate the voice assistant on the smartphone; when the response microphone is the microphone at the top of the smartphone, activate the voice recorder function to record the user's voice to the storage device;
  • determining whether the user is speaking into the electronic device at close range includes: using a machine learning model trained in advance to process sound signals from multiple microphones to determine whether the user is speaking into the electronic device at close range. Generally, prepare training sample data, and then use the training sample data to train the selected machine learning model. In actual application (sometimes called testing), input the sound signals captured by multiple microphones (as test samples) into the machine learning model , The output obtained indicates whether the user is talking to the electronic device at close range.
  • the machine learning model is, for example, deep learning neural network, support vector machine, decision tree, etc.
  • the voice spoken by the user includes: the user speaks at a normal volume, the user speaks at a low volume, and the user speaks in a silent manner.
  • the processing of using the sound signal as the user's voice input includes one or more of the following: storing the sound signal in a storage medium on the electronic device; sending the sound signal through the Internet; The voice signal is recognized as text and stored on the storage medium of the electronic device; the voice signal in the voice signal is recognized as text and sent out through the Internet; the voice signal in the voice signal is recognized as text, and the user’s voice command is understood, Take the appropriate action.
  • it also includes identifying a specific user through voiceprint analysis, and processing only the voice signal containing the voice of the specific user.
  • the electronic device is a smart phone, smart watch, smart ring, tablet computer, etc.
  • This embodiment uses the difference in sound signals between different built-in microphones to identify whether the user is speaking to the electronic device at a close distance, and then decides whether to start the voice input, which has the advantages of reliable recognition and simple calculation method.
  • Whispering refers to the way in which the speaking volume is less than the volume of normal speaking (such as a normal conversation with others).
  • the sound produced mainly includes the sound of air passing through the throat, the sound of the mouth, and the sound of the tongue and teeth in the mouth.
  • the sound produced includes not only the sound produced in the whispering mode of vocal cord vibration without vibration, but also the sound produced by vocal cord vibration.
  • the vibration of the vocal cords is less in the process of low-voice speech with vocal cord vibration, and the vocal cord vibration sound produced is smaller.
  • the vocal cords do not vibrate
  • the sound produced by whispering and the sound produced by vocal cord vibration have different frequency ranges and can be distinguished.
  • Vocal cord vibration low voice speech and vocal cord vibration normal volume speech can be distinguished by the volume threshold.
  • the specific threshold can be set in advance or by the user.
  • Example method filter the voice signal collected by the microphone, and extract two parts of the signal, which are the sound component V1 generated by the vibration of the vocal cords and the sound V2 generated by the air through the throat, the mouth, and the tongue and teeth in the mouth.
  • V1 and V2 When the energy ratio of V1 and V2 is less than a certain threshold, it is determined that the user is speaking in a low voice.
  • whispering can only be detected when the user is close to the microphone, such as when the distance is less than 30 cm.
  • the definition of whispering in close-range situations as voice input is an interactive method that is easy to learn and understand and convenient for users to operate. It can avoid explicit wake-up operations, such as pressing a specific wake-up button or through voice Wake word. And this method will not be triggered by mistake in most practical use cases.
  • Fig. 3 shows an overall flow chart of a voice input trigger method based on whispered speech recognition for an electronic device equipped with a microphone according to an embodiment of the present application.
  • An electronic device equipped with a microphone has a memory and a central processing unit, and computer executable instructions are stored on the memory. When the computer executable instructions are executed by the central processing unit, the voice input trigger method according to the embodiments of the present application can be executed.
  • step S301 it is determined whether the sound signal collected by the microphone contains a voice signal.
  • step S302 in response to confirming that the voice signal collected by the microphone contains a voice signal, it is determined whether the user is speaking in a low voice, that is, speaking at a lower than normal volume.
  • step S303 in response to determining that the user is speaking in a low voice, the sound signal is processed as a voice input without any wake-up operation.
  • Whispering can include two ways of whispering without vocal cords and whispering with vocal cords.
  • the voice input triggering method may further include: in response to determining that the user is speaking in a low voice, judging whether the user is making a low voice without vocal cords or a low voice speaking with vocal cords, according to different judgment results, The sound signal is processed differently.
  • different processing is to hand over voice input to different applications for processing.
  • normal speaking is to control the voice assistant of the mobile phone
  • speaking in a low voice is to control WeChat
  • speaking without vocal cords is to make voice transcription notes.
  • the signal characteristics used to determine whether the user is speaking in a low voice may include volume, frequency spectrum characteristics, energy distribution, and so on.
  • the signal characteristics used by the user when making a low-voiced speech without vocal cords or when making a low-voiced speech with vocal cords include volume, spectral characteristics, energy distribution, and the like.
  • determining whether the user is speaking in a low voice may include: using a machine learning model to process the sound signal collected by the microphone to determine whether the user is speaking in a low voice.
  • the machine learning model may be a convolutional neural network model or a recurrent neural network model.
  • judging whether the user is doing low-voicing without vocal cords or low-voicing with vocal cords includes: using a machine learning model to process the sound signals collected by the microphone to determine whether the user is doing low-voicing without vocal cords or Talk in a low voice with vocal cords.
  • the processing of using a sound signal as the user's voice input includes one or more of the following:
  • the voice input triggering method may also include: identifying a specific user through voiceprint analysis, and processing only the voice signal containing the voice of the specific user.
  • the electronic device may be a smart phone, smart watch, smart ring, etc.
  • the voice input trigger based on the distance judgment of the sound signal of the microphone
  • the following describes the overall flowchart of the voice input trigger method based on the distance judgment of the sound signal of the microphone with reference to FIG. 4.
  • step 401 the sound signal captured by the microphone is processed to determine whether there is a voice signal in the sound signal.
  • step 402 in response to confirming that the voice signal exists in the voice signal, it is further determined whether the distance between the smart electronic device and the user's mouth is less than a predetermined threshold based on the voice signal collected by the microphone.
  • step 403 in response to determining that the distance between the electronic device and the user's mouth is less than a predetermined threshold, the sound signal collected by the microphone is processed as a voice input.
  • the predetermined threshold is 10 cm.
  • the voice signal may include one or a combination of the following items: the sound produced by the user speaking at a normal volume; the sound produced by the user speaking in a low voice; the sound produced by the user's vocal cord without speaking.
  • the features used when judging whether the distance between the smart electronic device and the user's mouth is less than a predetermined threshold includes time domain features and frequency domain features in the sound signal, including volume and spectral energy.
  • the judging whether the distance between the smart electronic device and the user's mouth is less than a predetermined threshold includes: using a deep neural network model to process data collected by a microphone to determine whether the distance between the smart electronic device and the user's mouth is less than a predetermined threshold.
  • the judging whether the distance between the smart electronic device and the user's mouth is less than a predetermined threshold includes: recording the user's voice signal when the user is not making a voice input, and comparing the voice signal currently collected by the microphone with the voice when the user is not making a voice input. Signals are compared, and if it is determined that the volume of the voice signal currently collected by the microphone exceeds a certain threshold of the volume of the voice signal when no voice input is made, it is determined that the distance between the smart electronic device and the user's mouth is less than the predetermined threshold.
  • the processing of using the sound signal as the user's voice input includes one or more of the following: storing the sound signal in a storage medium on the electronic device; sending the sound signal through the Internet; The voice signal is recognized as text and stored on the storage medium of the electronic device; the voice signal in the voice signal is recognized as text and sent out through the Internet; the voice signal in the voice signal is recognized as text, and the user’s voice command is understood, Take the appropriate action.
  • the voice input triggering further includes identifying a specific user through voiceprint analysis, and processing only the voice signal containing the voice of the specific user.
  • the electronic device is a smart phone, smart watch, smart ring, etc.
  • Figures 5 to 8 show a few cases where the user places the microphone of the smart electronic portable device close to the mouth, and the voice of the user will be used as the voice input.
  • Figure 5 and Figure 6 are respectively the case where there is a microphone at the top of the mobile phone.
  • Fig. 7 is the case where there is a microphone at the lower end of the mobile phone. Similar to the above-mentioned upper end with a microphone, the two postures are not mutually exclusive. If there are microphones at the upper and lower ends of the mobile phone, either posture can implement an interactive scheme.
  • Fig. 8 shows the situation when the corresponding device is a smart watch, which is similar to the situation when the above-mentioned device is a mobile phone.
  • the above description of the trigger gesture is exemplary, not exhaustive, and is not limited to the various devices and microphones disclosed.
  • the short-distance unique characteristics of the voice such as microphone popping sound , Near-field wind noise, blowing sound, energy, spectrum characteristics, time domain characteristics, etc.
  • a multi-microphone array to receive voice input and trigger voice input, by comparing and analyzing the difference of the voice input signal received by different microphones, by separating the near-field voice signal from the environment, identifying and detecting whether the voice signal is Including voice, using multi-microphone array sound source localization technology to determine whether the distance between the position of the user’s mouth of the voice signal and the device is less than a predetermined threshold, and using voiceprint recognition to determine whether the voice input source belongs to a serviceable user, combining the above points To determine whether to use the signal as a voice input.
  • the smart electronic portable device analyzes the voice signal and detects that the pronunciation position is near itself, that is, the mobile device is located closer to the user's mouth.
  • the smart electronic portable device uses the sound signal as a voice input, and according to the task and The context is different, combined with natural language processing technology to understand the user's voice input and complete the corresponding task.
  • the microphone is not limited to the foregoing examples, but may include one or a combination of the following items: a built-in single microphone in the device; a built-in dual microphone in the device; a built-in multi-microphone array in the device; an external wireless microphone; and an external wired microphone.
  • the smart electronic portable device can be a mobile phone equipped with a binaural Bluetooth headset, a wired headset with a microphone, or other microphone sensors.
  • the smart electronic portable device can be a watch, a smart wearable device among smart rings and watches.
  • the smart electronic portable device is a head-mounted smart display device equipped with a microphone or a multi-microphone group.
  • a feedback output may be made, and the feedback output includes one of vibration, voice, and image, or a combination thereof.
  • the use efficiency is higher. It can be used with one hand. No need to switch between different user interfaces/applications, no need to hold down a button, just lift your hand to your mouth to use it.
  • the radio quality is high.
  • the device's voice recorder is at the user's mouth, and the voice input signal received is clear, and it is less affected by environmental sounds.

Abstract

A smart electronic device having a built-in microphone; the smart electronic portable device performs voice input-based interaction with a user by means of the following operations: processing a sound signal captured by a microphone to determine whether a voice signal is present in the sound signal; in response to confirming that a voice signal is present in the sound signal, further determining on the basis of the sound signal collected by the microphone whether the distance between the smart electronic device and the mouth of the user is less than a predetermined threshold; and in response to determining that the distance between the electronic device and the mouth of the user is less than the predetermined threshold, using the sound signal collected by the microphone as a voice input and processing same. The described interaction method is suitable for a user performing voice input while carrying a smart electronic device, operation being natural and simple; in addition, the step of voice input is simplified, and the burden and difficulty of interaction are reduced so that interaction is more natural.

Description

基于麦克风信号的语音交互唤醒电子设备、方法和介质Voice interaction based on microphone signal to wake up electronic equipment, method and medium
本申请要求于2019年06月03日提交中国专利局、申请号为201910475949.X发明名称为“基于麦克风信号的语音交互唤醒电子设备、方法和介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on June 3, 2019, the application number is 201910475949.X and the invention title is "Microphone signal-based voice interaction wake-up electronic equipment, methods and media", and its entire content Incorporated in this application by reference.
技术领域Technical field
本申请涉及语音输入领域,且更为具体地,涉及智能电子设备、语音输入触发方法。This application relates to the field of voice input, and more specifically, to smart electronic devices and voice input triggering methods.
背景技术Background technique
随着计算机技术的发展,语音识别算法日益成熟,语音输入因其在交互方式上的高自然性与有效性而正变得越来越重要。用户可以通过语音与移动设备(手机、手表等)进行交互,完成指令输入、信息查询、语音聊天等多种任务。With the development of computer technology, speech recognition algorithms are becoming more and more mature, and speech input is becoming more and more important due to its high naturalness and effectiveness in interactive methods. Users can interact with mobile devices (mobile phones, watches, etc.) through voice to complete various tasks such as command input, information query, and voice chat.
而在何时触发语音输入这一点上,现有的解决方案都有一些缺陷:As for when to trigger the voice input, the existing solutions have some shortcomings:
1.物理按键触发1. Physical button trigger
按下(或按住)移动设备的某个(或某些)物理按键后,激活语音输入。After pressing (or holding down) a certain (or certain) physical buttons of the mobile device, voice input is activated.
该方案的缺点是:需要物理按键;容易误触发;需要用户按键。The disadvantages of this solution are: physical buttons are required; easy to trigger by mistake; user buttons are required.
2.界面元素触发2. Interface element trigger
点击(或按住)移动设备的屏幕上的界面元素(如图标),激活语音输入。Tap (or hold) an interface element (such as an icon) on the screen of the mobile device to activate voice input.
该方案的缺点是:需要设备具备屏幕;触发元素占用屏幕内容;受限于软件UI限制,可能导致触发方式繁琐;容易误触发。The disadvantages of this solution are: the device needs to have a screen; the trigger element occupies the screen content; the limitation of the software UI may lead to a cumbersome trigger method; it is easy to trigger by mistake.
3.唤醒词(语音)检测3. Wake word (voice) detection
以某个特定词语(如产品昵称)为唤醒词,设备检测到对应的唤醒词后激活语音输入。Using a specific word (such as product nickname) as the wake-up word, the device activates voice input after detecting the corresponding wake-up word.
该方案的缺点是:隐私性和社会性较差;交互效率较低。The disadvantages of this scheme are: poor privacy and sociality; low interaction efficiency.
发明内容Summary of the invention
鉴于上述情况,提出了本申请:In view of the above situation, this application is submitted:
根据本申请的一个方面,一种内置有麦克风的智能电子设备,所述智能电子便携设备如下操作与用户进行基于语音输入的交互:处理麦克风捕获的声音信号判断声音信号中是否存在语音信号;响应于确认声音信号中存在语音信号,基于麦克风采集的声音信号进一步判断智能电子设备与用户的嘴部距离是否小于预定阈值;响应于确定电子设备与用户嘴部距离小于预定阈值,将麦克风采集的声音信号作为语音输入处理。According to one aspect of the present application, a smart electronic device with a built-in microphone, the smart electronic portable device interacts with a user based on voice input as follows: process the sound signal captured by the microphone to determine whether there is a voice signal in the sound signal; After confirming that there is a voice signal in the sound signal, it is further determined based on the sound signal collected by the microphone whether the distance between the smart electronic device and the user’s mouth is less than a predetermined threshold; in response to determining that the distance between the electronic device and the user’s mouth is less than the predetermined threshold, the sound collected by the microphone is The signal is processed as voice input.
优选的,预定阈值为3厘米。Preferably, the predetermined threshold is 3 cm.
优选的,预定阈值为1厘米。Preferably, the predetermined threshold is 1 cm.
优选的,电子设备的麦克风处还有接近光传感器,通过接近光传感器判断是否有物体接近电子设备。Preferably, there is a proximity light sensor at the microphone of the electronic device, and the proximity light sensor determines whether an object is approaching the electronic device.
优选的,电子设备的麦克风处还有距离传感器,通过距离传感器直接测量电子设备与用户嘴部的距离。Preferably, there is a distance sensor at the microphone of the electronic device, and the distance between the electronic device and the user's mouth is directly measured by the distance sensor.
优选的,通过麦克风收集的声音信号特征来判断智能电子设备与用户的嘴部距离是否小于预定阈值。Preferably, it is determined whether the distance between the smart electronic device and the user's mouth is less than a predetermined threshold value through the characteristics of the sound signal collected by the microphone.
优选的,所述语音信号包括下面各项之一或者组合:用户以正常音量说话发出的声音;用户低声说话发出的声音;用户声带不发声说话产生的声音。Preferably, the voice signal includes one or a combination of the following items: the sound produced by the user speaking at a normal volume; the sound produced by the user speaking in a low voice; the sound produced by the user's vocal cord without speaking.
优选的,电子设备还可操作来:响应于确定用户正在近距离对着电子设备说话;判断用户在以如下方式中的一种在发声,包括:用户以正常音量说话的声音;用户以小音量说话的声音;用户以声带不发声方式说话发出的声音;以及根据判断的结果不同,对声音信号做不同的处理。Preferably, the electronic device is further operable: in response to determining that the user is speaking to the electronic device at close range; determining that the user is speaking in one of the following ways, including: the user's voice at a normal volume; the user with a low volume The voice of speaking; the voice that the user speaks in a way that the vocal cords are silent; and according to the different results of the judgment, the voice signal is processed differently.
优选的,所述不同的处理为激活不同的应用程序处理语音输入。Preferably, the different processing is activating different applications to process voice input.
优选的,判断中使用的特征包括音量、频谱特征,能量分布等。Preferably, the characteristics used in the judgment include volume, frequency spectrum characteristics, energy distribution, and the like.
优选的,判断智能电子设备与用户的嘴部距离是否小于预定阈值时使用的特征包括声音信号中的时域特征和频域特征,包括音量、频谱能量。Preferably, the features used when judging whether the distance between the smart electronic device and the user's mouth is less than a predetermined threshold include time-domain features and frequency-domain features in the sound signal, including volume and spectral energy.
优选的,所述判断智能电子设备与用户的嘴部距离是否小于预定阈值包括:从麦克风采集到的声音信号通过滤波器提取语音信号;判断所述语音信号的能量是否超过一定阈值;响应于语音信号强度超过一定阈值,判断电子设备与用户嘴部距离小于预定阈值。Preferably, the judging whether the distance between the smart electronic device and the user's mouth is less than a predetermined threshold includes: extracting a voice signal from a sound signal collected by a microphone through a filter; judging whether the energy of the voice signal exceeds a certain threshold; responding to the voice If the signal strength exceeds a certain threshold, it is determined that the distance between the electronic device and the user's mouth is less than the predetermined threshold.
优选的,所述判断智能电子设备与用户的嘴部距离是否小于预定阈值包括:利用深度神经网络模型处理麦克风采集的数据,判断智能电子设备与用户的嘴部距离是否小于预定阈值。Preferably, the judging whether the distance between the smart electronic device and the user's mouth is less than a predetermined threshold includes: using a deep neural network model to process data collected by a microphone to determine whether the distance between the smart electronic device and the user's mouth is less than a predetermined threshold.
优选的,所述判断智能电子设备与用户的嘴部距离是否小于预定阈值包括:记录用户在未做语音输入时的语音信号;将麦克风当前采集的语音信号与未做语音输入时的语音信号作比较;如果判断麦克风当前采集的语音信号音量超过未做语音输入时的语音信号的音量一定阈值,判断智能电子设备与用户的嘴部距离小于预定阈值。Preferably, the judging whether the distance between the smart electronic device and the user's mouth is less than a predetermined threshold includes: recording the user's voice signal when the user is not making a voice input; and comparing the voice signal currently collected by the microphone with the voice signal when the user is not making a voice input. Comparison; if it is determined that the volume of the voice signal currently collected by the microphone exceeds a certain threshold of the volume of the voice signal when no voice input is made, it is determined that the distance between the smart electronic device and the user's mouth is less than the predetermined threshold.
优选的,将声音信号作为用户的语音输入所做的处理包括以下一种或多种:将声音信号存储到电子设备上的可存储介质;将声音信号通过互联网发送出去;将声音信号中的语音信号识别为文字,存储到电子设备上的可存储介质;将声音信号中的语音信号识别为文字,通过互联网发送出去;将声音信号中的语音信号识别为文字,理解用户的语音指令,执行相应操作。Preferably, the processing of using the sound signal as the user's voice input includes one or more of the following: storing the sound signal on a storage medium on the electronic device; sending the sound signal through the Internet; sending the voice in the sound signal The signal is recognized as text and stored on the storage medium of the electronic device; the voice signal in the sound signal is recognized as text and sent through the Internet; the voice signal in the sound signal is recognized as text, the user’s voice command is understood, and the corresponding operating.
优选的,电子设备还通过声纹分析识别特定用户,只对包含特定用户语音的声音信号做处理。Preferably, the electronic device also recognizes a specific user through voiceprint analysis, and only processes the voice signal containing the voice of the specific user.
优选的,电子设备为智能手机、智能手表、智能戒指等。Preferably, the electronic device is a smart phone, a smart watch, a smart ring, etc.
此处的移动设备包括但不限于手机、头戴式显示器、手表,以及智能戒指、腕表等更小型的智能穿戴设备。The mobile devices here include, but are not limited to, mobile phones, head-mounted displays, watches, and smaller smart wearable devices such as smart rings and watches.
根据本申请的另一方面,提供了一种由配置有麦克风的智能电子设备执行的语音输入触发方法,包括如下操作用于智能电子设备与用户进行基于语音输入的交互:处理麦克风捕获的声音信号判断声音信号中是否存在语音信号;响应于确认声音信号中存在语音信号,基于麦克风采集的声音 信号进一步判断智能电子设备与用户的嘴部距离是否小于预定阈值;响应于确定电子设备与用户嘴部距离小于预定阈值,将麦克风采集的声音信号作为语音输入处理。According to another aspect of the present application, there is provided a voice input triggering method executed by a smart electronic device equipped with a microphone, including the following operations for the smart electronic device to interact with a user based on voice input: processing sound signals captured by the microphone Determine whether there is a voice signal in the sound signal; in response to confirming that there is a voice signal in the sound signal, further determine whether the distance between the smart electronic device and the user's mouth is less than a predetermined threshold based on the sound signal collected by the microphone; in response to determining whether the electronic device and the user's mouth The distance is less than a predetermined threshold, and the sound signal collected by the microphone is processed as a voice input.
根据本申请的另一方面,提供了一种计算机可读介质,其上存储有计算机可执行指令,计算机可执行指令被计算机执行时能够执行语音交互唤醒方法,所述语音交互唤醒方法包括:处理麦克风捕获的声音信号判断声音信号中是否存在语音信号;响应于确认声音信号中存在语音信号,基于麦克风采集的声音信号进一步判断智能电子设备与用户的嘴部距离是否小于预定阈值;响应于确定电子设备与用户嘴部距离小于预定阈值,将麦克风采集的声音信号作为语音输入处理。According to another aspect of the present application, there is provided a computer-readable medium having computer-executable instructions stored thereon, and when the computer-executable instructions are executed by a computer, a voice interactive wake-up method can be executed. The voice interactive wake-up method includes: processing The sound signal captured by the microphone determines whether there is a voice signal in the sound signal; in response to confirming that there is a voice signal in the sound signal, it is further determined based on the sound signal collected by the microphone whether the distance between the smart electronic device and the user’s mouth is less than a predetermined threshold; The distance between the device and the user's mouth is less than a predetermined threshold, and the sound signal collected by the microphone is processed as a voice input.
根据本申请的一个方面,提供了一种配置有麦克风的电子设备,电子设备具有存储器和中央处理器,存储器上存储有计算机可执行指令,计算机可执行指令被中央处理器执行时能够执行如下操作:分析麦克风采集的声音信号,识别声音信号中是否包含人说话的语音以及是否包含人说话产生的气流撞击麦克风产生的风噪声音,响应于确定声音信号中包含人说话的声音以及包含用户说话产生的气流撞击麦克风产生的风噪声音,将该声音信号作为用户的语音输入做处理。According to one aspect of the present application, there is provided an electronic device equipped with a microphone. The electronic device has a memory and a central processing unit. The memory stores computer executable instructions. When the computer executable instructions are executed by the central processing unit, the following operations can be performed : Analyze the sound signal collected by the microphone, identify whether the sound signal contains human speech and whether it contains the wind noise generated by the airflow generated by the human speech hitting the microphone, and respond to determining that the sound signal contains the human speech and the user's speech production The wind noise generated by the airflow hitting the microphone is processed as the user’s voice input.
优选的,用户说话的语音包括:用户以正常音量说话的声音,用户以小音量说话的声音,用户以声带不发声方式说话发出的声音。Preferably, the voice spoken by the user includes: the voice of the user speaking at a normal volume, the voice of the user speaking at a low volume, and the voice of the user speaking in a non-voicing manner.
优选的,电子设备还可操作来:响应于确定用户正在近距离对着电子设备说话,判断用户在以如下方式中的一种在发声,包括:用户以正常音量说话的声音,用户以小音量说话的声音,用户以声带不发声方式说话发出的声音;根据判断的结果不同,对声音信号做不同的处理。Preferably, the electronic device is further operable: in response to determining that the user is speaking to the electronic device at a close distance, it is determined that the user is speaking in one of the following ways, including: the user speaking at a normal volume, and the user speaking at a low volume Speaking sound is the sound that the user speaks in a way that the vocal cords are silent; according to the different judgment results, the sound signal is processed differently.
优选的,所述不同的处理为激活不同的应用程序处理语音输入。Preferably, the different processing is activating different applications to process voice input.
优选的,判断使用的特征包括音量、频谱特征,能量分布等。Preferably, the characteristics used in the judgment include volume, frequency spectrum characteristics, energy distribution, and the like.
优选的,将声音信号作为用户的语音输入所做的处理包括以下一种或多种:将声音信号存储到电子设备上的可存储介质;将声音信号通过互联网发送出去;将声音信号中的语音信号识别为文字,存储到电子设备上的可存储介质;将声音信号中的语音信号识别为文字,通过互联网发送出去; 将声音信号中的语音信号识别为文字,理解用户的语音指令,执行相应操作。Preferably, the processing of using the sound signal as the user's voice input includes one or more of the following: storing the sound signal on a storage medium on the electronic device; sending the sound signal through the Internet; sending the voice in the sound signal The signal is recognized as text and stored on the storage medium of the electronic device; the voice signal in the sound signal is recognized as text and sent out via the Internet; the voice signal in the sound signal is recognized as text, the user’s voice command is understood, and the corresponding operating.
优选的,电子设备还可操作来通过声纹分析识别特定用户,只对包含特定用户语音的声音信号做处理。Preferably, the electronic device is also operable to identify a specific user through voiceprint analysis, and only process the voice signal containing the voice of the specific user.
优选的,电子设备为智能手机、智能手表、智能戒指中的一种。Preferably, the electronic device is one of a smart phone, a smart watch, and a smart ring.
优选的,电子设备还可操作来:使用神经网络模型判断声音信号中是否包含用户说话的语音以及说话产生的气流撞击麦克风产生的风噪声音。Preferably, the electronic device is further operable to use a neural network model to determine whether the voice signal of the user is included in the voice signal and the wind noise generated by the airflow generated by the speech hitting the microphone.
优选的,电子设备还可操作来识别声音信号中是否包含人说话的语音以及是否包含人说话产生的气流撞击麦克风产生的风噪声音包括识别声音信号中是否包含用户说话的语音;响应于确定声音信号中包含用户说话的语音,识别语音中的音素,将语音信号表示为音素序列;针对音素序列中的每个音素,确定该音素是否为吐气音素,即:用户发声该音素时有气流从嘴中出来;将声音信号按照固定窗口长度切分为声音片段序列;利用频率特征,识别每个声音片段是否包含风噪声;将语音音素序列中的吐气音素和声音片段序列中识别为风噪声的片段做比较,同时将音素序列中的非吐气音素和风噪声片段作比较,当吐气音素与风噪声片段重合度高于一定阈值,且非吐气音素与非风噪声片段重合度低于一定阈值时,判断该声音信号中包含用户说话产生的气流撞击麦克风产生的风噪声音。Preferably, the electronic device is further operable to recognize whether the sound signal contains the voice spoken by a human and whether it contains the wind noise generated by the airflow generated by the human speech hitting the microphone includes recognizing whether the voice signal contains the voice spoken by the user; in response to determining the sound The signal contains the user's spoken voice, recognizes the phonemes in the voice, and expresses the voice signal as a phoneme sequence; for each phoneme in the phoneme sequence, determine whether the phoneme is an exhaled phoneme, that is: there is airflow from the mouth when the user utters the phoneme Out; cut the sound signal into a sequence of sound segments according to a fixed window length; use frequency characteristics to identify whether each sound segment contains wind noise; identify the exhaled phoneme in the speech phoneme sequence and the segment in the sound segment sequence as wind noise For comparison, the non-exhaled phoneme and wind noise segment in the phoneme sequence are compared at the same time. When the overlap degree of the exhaled phoneme and the wind noise segment is higher than a certain threshold, and the overlap degree of the non-exhaled phoneme and the non-wind noise segment is lower than a certain threshold, it is judged The sound signal contains wind noise generated by the airflow generated by the user's speech hitting the microphone.
优选的,识别声音信号中是否包含人说话的语音以及是否包含人说话产生的气流撞击麦克风产生的风噪声音包括:识别声音信号中包含风噪的声音特征;响应于确定声音信号中包含风噪声,识别声音信号包含语音信号;响应于确定声音信号中包含语音信号,识别语音信号对应的音素序列;针对声音信号中的风噪特征,计算每一时刻的风噪特征强度;针对音素序列中的每个音素,根据预先定义的数据模型获得该音素吐气的强度;通过基于高斯混合贝叶斯模型分析风噪特征与音素序列的一致性,重合度高于一定阈值时,判断该声音信号中包含用户说话产生的气流撞击麦克风产生的风噪声音。Preferably, recognizing whether the sound signal contains human speech and whether it contains wind noise generated by the airflow generated by human speech hitting the microphone includes: recognizing the sound characteristics of wind noise contained in the sound signal; and in response to determining that the sound signal contains wind noise , Recognize that the sound signal contains a voice signal; In response to determining that the sound signal contains a voice signal, identify the phoneme sequence corresponding to the voice signal; For the wind noise feature in the sound signal, calculate the wind noise feature intensity at each moment; For the phoneme sequence For each phoneme, the exhalation intensity of the phoneme is obtained according to the pre-defined data model; the consistency of the wind noise feature and the phoneme sequence is analyzed based on the Gaussian Mixture Bayesian model. When the coincidence degree is higher than a certain threshold, it is determined that the sound signal contains The wind noise generated by the airflow generated by the user's speech hitting the microphone.
根据本申请的另一方面,提供了一种内置有多个麦克风的电子设备,电子设备具有存储器和中央处理器,存储器上存储有计算机可执行指令, 计算机可执行指令被中央处理器执行时能够执行如下操作:分析多个麦克风采集的声音信号;判断用户是否正在近距离对着电子设备说话;响应于确定用户正在近距离对着电子设备说话,将麦克风采集的声音信号作为用户的语音输入处理。According to another aspect of the present application, there is provided an electronic device with a plurality of built-in microphones. The electronic device has a memory and a central processing unit. The memory stores computer-executable instructions. The computer-executable instructions can be executed by the central processing unit. Perform the following operations: analyze the sound signals collected by multiple microphones; determine whether the user is talking to the electronic device at close range; in response to determining that the user is talking to the electronic device at close range, the sound signal collected by the microphone is processed as the user's voice input .
优选的,多个麦克风构成麦克风阵列系统。Preferably, a plurality of microphones constitute a microphone array system.
优选的,所述判断用户是否正在近距离对着电子设备说话包括:利用到达阵列上各传声器的声音信号之间的时间差计算用户嘴部相对于麦克风阵列的位置;当用户嘴部距离电子设备的距离小于一定阈值时,确定用户正在近距离对着电子设备说话。Preferably, the judging whether the user is speaking into the electronic device at close range includes: calculating the position of the user's mouth relative to the microphone array by using the time difference between the sound signals arriving at each microphone on the array; When the distance is less than a certain threshold, it is determined that the user is talking to the electronic device at a close distance.
优选的,所述距离阈值为10厘米。Preferably, the distance threshold is 10 cm.
优选的,所述将该声音信号作为用户的语音输入做处理包括:根据说话人嘴部和电子设备之间距离的不同,对用户的语音输入做不同处理。Preferably, the processing the voice signal as the user's voice input includes: performing different processing on the user's voice input according to the distance between the speaker's mouth and the electronic device.
优选的,所述判断用户是否正在近距离对着电子设备说话包括:判断是否至少有一个麦克风采集的声音信号中包含用户说话的语音信号;响应于确定至少有一个麦克风采集的声音信号中包含用户说话的语音信号,从麦克风采集的声音信号中提取语音信号;判断从不同麦克风采集的声音信号中提取的语音信号的幅度差异是否超过预定阈值时;响应于确定幅度差值超过预定阈值,确认用户正在近距离对着电子设备说话。Preferably, the judging whether the user is speaking into the electronic device at close range includes: judging whether the sound signal collected by at least one microphone contains the voice signal of the user speaking; in response to determining that the sound signal collected by the at least one microphone contains the user Speaking voice signal, extract the voice signal from the voice signal collected by the microphone; determine whether the amplitude difference of the voice signal extracted from the voice signal collected by different microphones exceeds a predetermined threshold; in response to determining that the amplitude difference exceeds the predetermined threshold, confirm the user Talking to the electronic device at close range.
优选的,电子设备还可操作来:定义多个麦克风中,语音信号幅度最大的麦克风为响应麦克风;根据响应麦克风的不同,对用户的语音输入做不同的处理。Preferably, the electronic device can also be operated to: define the microphone with the largest voice signal amplitude among the multiple microphones as the response microphone; and perform different processing on the user's voice input according to the different response microphones.
优选的,所述判断用户是否正在近距离对着电子设备说话包括:利用提前训练的机器学习模型,处理多个麦克风的声音信号,判断用户是否正在近距离对着电子设备说话。Preferably, the judging whether the user is speaking into the electronic device at close range includes: using a machine learning model trained in advance to process the sound signals of multiple microphones to judge whether the user is speaking into the electronic device at close range.
优选的,用户说话的语音包括:用户以正常音量说话的声音;用户以小音量说话的声音;用户以声带不发声方式说话发出的声音。Preferably, the voice spoken by the user includes: the voice that the user speaks at a normal volume; the voice that the user speaks at a low volume; the voice that the user speaks in a non-voicing manner.
优选的,用户说话的语音包括:用户以正常音量说话的声音,用户以小音量说话的声音,用户以声带不发声方式说话发出的声音。Preferably, the voice spoken by the user includes: the voice of the user speaking at a normal volume, the voice of the user speaking at a low volume, and the voice of the user speaking in a non-voicing manner.
优选的,电子设备还可操作来:响应于确定用户正在近距离对着电子 设备说话;判断用户在以如下方式中的一种在发声,包括:用户以正常音量说话的声音;用户以小音量说话的声音;用户以声带不发声方式说话发出的声音;以及根据判断的结果不同,对声音信号做不同的处理。Preferably, the electronic device is further operable: in response to determining that the user is speaking to the electronic device at close range; determining that the user is speaking in one of the following ways, including: the user's voice at a normal volume; the user with a low volume The voice of speaking; the voice that the user speaks in a way that the vocal cords are silent; and according to the different results of the judgment, the voice signal is processed differently.
优选的,所述不同的处理为激活不同的应用程序处理语音输入。Preferably, the different processing is activating different applications to process voice input.
优选的,判断的特征包括音量、频谱特征,能量分布等。Preferably, the judged characteristics include volume, frequency spectrum characteristics, energy distribution, and the like.
优选的,将声音信号作为用户的语音输入所做的处理包括以下一种或多种:将声音信号存储到电子设备上的可存储介质;将声音信号通过互联网发送出去;将声音信号中的语音信号识别为文字,存储到电子设备上的可存储介质;将声音信号中的语音信号识别为文字,通过互联网发送出去;将声音信号中的语音信号识别为文字,理解用户的语音指令,执行相应操作。Preferably, the processing of using the sound signal as the user's voice input includes one or more of the following: storing the sound signal on a storage medium on the electronic device; sending the sound signal through the Internet; sending the voice in the sound signal The signal is recognized as text and stored on the storage medium of the electronic device; the voice signal in the sound signal is recognized as text and sent through the Internet; the voice signal in the sound signal is recognized as text, the user’s voice command is understood, and the corresponding operating.
优选的,电子设备还可操作来通过声纹分析识别特定用户,只对包含特定用户语音的声音信号做处理。Preferably, the electronic device is also operable to identify a specific user through voiceprint analysis, and only process the voice signal containing the voice of the specific user.
优选的,电子设备为智能手机、智能手表、智能戒指、平板电脑中的一种。Preferably, the electronic device is one of a smart phone, a smart watch, a smart ring, and a tablet computer.
根据本申请的另一方面,一种内置有麦克风的电子设备,电子设备具有存储器和中央处理器,存储器上存储有计算机可执行指令,计算机可执行指令被中央处理器执行时能够执行如下操作:判断麦克风采集的声音信号中是否包含语音信号;响应于确认麦克风采集的声音信号中包含语音信号,判断用户是否在做低声说话,即以低于正常音量的方式说话;响应于确定用户正在做低声说话,无需任何唤醒操作地将声音信号作为语音输入处理。According to another aspect of the present application, an electronic device with a built-in microphone has a memory and a central processing unit. The memory stores computer executable instructions. When the computer executable instructions are executed by the central processing unit, the following operations can be performed: Determine whether the voice signal collected by the microphone contains a voice signal; in response to confirming that the voice signal collected by the microphone contains a voice signal, determine whether the user is speaking in a low voice, that is, speaking at a lower than normal volume; in response to determining that the user is doing Speak in a low voice, and process the sound signal as a voice input without any wake-up operation.
优选的,所述低声说话包括声带不发声的低声说话和声带发声的低声说话两种方式。Preferably, the whispered speech includes two ways of whispering without vocal cords and whispering with vocal cords.
优选的,电子设备还操作来:响应于确定用户在做低声说话;判断用户在做声带不发声的低声说话还是在做声带发声的低声说话;根据判断的结果不同,对声音信号做不同的处理。Preferably, the electronic device is also operated: in response to determining that the user is speaking in a low voice; judging whether the user is speaking in a low voice without vocal cords or speaking in a low voice with vocal cords; Different treatment.
优选的,不同的处理为激活不同的应用程序来响应语音输入。Preferably, the different processing is to activate different applications to respond to voice input.
优选的,判断用户是否在做低声说话时使用的信号特征包括音量、频 谱特征,能量分布。Preferably, the signal characteristics used to determine whether the user is speaking in a low voice include volume, frequency spectrum characteristics, and energy distribution.
优选的,判断用户在做声带不发声的低声说话或者在做声带发声的低声说话时使用的信号特征包括音量、频谱特征,能量分布。Preferably, the signal characteristics used when judging that the user is making a low-voiced speech without vocal cords or when making a low-voiced speech with vocal cords include volume, frequency spectrum characteristics, and energy distribution.
优选的,所述判断用户是否在做低声说话包括:利用机器学习模型,处理麦克风采集的声音信号,判断用户是否在低声说话。Preferably, the judging whether the user is speaking in a low voice includes: using a machine learning model to process a sound signal collected by a microphone to determine whether the user is speaking in a low voice.
优选的,所述机器学习模型为卷积神经网络模型或者循环神经网络模型。Preferably, the machine learning model is a convolutional neural network model or a recurrent neural network model.
优选的,所述判断用户在做声带不发声的低声说话还是在做声带发声的低声说话包括:利用机器学习模型,处理麦克风采集的声音信号,判断用户在做声带不发声的低声说话或者在做声带发声的低声说话。Preferably, the judging whether the user is doing a low-voiced speech without vocal cords or a low-voiced speech with vocal cords includes: using a machine learning model to process the sound signal collected by a microphone to determine that the user is doing low-voiced speech without vocal cords Or speaking in a low voice with vocal cords.
优选的,所述机器学习模型为卷积神经网络模型或者循环神经网络模型。Preferably, the machine learning model is a convolutional neural network model or a recurrent neural network model.
优选的,将声音信号作为用户的语音输入所做的处理包括以下一种或多种:将声音信号存储到电子设备上的可存储介质;将声音信号通过互联网发送出去;将声音信号中的语音信号识别为文字,存储到电子设备上的可存储介质;将声音信号中的语音信号识别为文字,通过互联网发送出去;将声音信号中的语音信号识别为文字,理解用户的语音指令,执行相应操作。Preferably, the processing of using the sound signal as the user's voice input includes one or more of the following: storing the sound signal on a storage medium on the electronic device; sending the sound signal through the Internet; sending the voice in the sound signal The signal is recognized as text and stored on the storage medium of the electronic device; the voice signal in the sound signal is recognized as text and sent through the Internet; the voice signal in the sound signal is recognized as text, the user’s voice command is understood, and the corresponding operating.
优选的,通过声纹分析识别特定用户,只对包含特定用户语音的声音信号做处理。Preferably, the specific user is identified through voiceprint analysis, and only the voice signal containing the voice of the specific user is processed.
优选的,电子设备为智能手机、智能手表、智能戒指等。Preferably, the electronic device is a smart phone, a smart watch, a smart ring, etc.
本方案优势:Advantages of this program:
1.交互更加自然。将设备放在嘴前即触发语音输入,符合用户习惯与认知。1. The interaction is more natural. Putting the device in front of the mouth triggers voice input, which conforms to user habits and cognition.
2.使用效率更高。单手即可使用。无需在不同的用户界面/应用之间切换,也不需按住某个按键,直接抬起手到嘴边就能使用。2. The use efficiency is higher. It can be used with one hand. No need to switch between different user interfaces/applications, no need to hold down a button, just lift your hand to your mouth to use it.
3.收音质量高。设备的录音机在用户嘴边,收取的语音输入信号清晰,受环境音的影响较小。3. The radio quality is high. The device's voice recorder is at the user's mouth, and the voice input signal received is clear, and it is less affected by environmental sounds.
4.高隐私性与社会性。设备在嘴前,用户只需发出相对较小的声音即可完成高质量的语音输入,对他人的干扰较小,同时用户姿势可以包括捂嘴等,具有较好的隐私保护。4. High privacy and sociality. With the device in front of the mouth, the user only needs to make a relatively small sound to complete high-quality voice input, which has little interference to others. At the same time, the user's posture can include covering the mouth, which has better privacy protection.
附图说明Description of the drawings
图1是根据本申请实施例的语音输入交互方法的示意性流程图;Fig. 1 is a schematic flowchart of a voice input interaction method according to an embodiment of the present application;
图2示出了根据本申请的另一实施例的配置有多个麦克风的电子设备使用多个麦克风接收的声音信号的差别的语音输入触发方法的总体流程图;FIG. 2 shows an overall flowchart of a method for triggering a voice input using differences in sound signals received by multiple microphones using an electronic device equipped with multiple microphones according to another embodiment of the present application;
图3示出了根据本申请实施例的内置有麦克风的电子设备基于低声说话方式识别的语音输入触发方法的总体流程图;FIG. 3 shows an overall flowchart of a voice input trigger method based on whispered speech recognition for an electronic device with a built-in microphone according to an embodiment of the present application;
图4描述基于麦克风的声音信号的距离判断的语音输入触发方法的总体流程图;Figure 4 depicts the overall flow chart of the voice input trigger method based on the distance judgment of the sound signal of the microphone;
图5是根据本申请实施例的触发姿势中的将手机上端麦克风贴近嘴部的正面示意图;FIG. 5 is a front schematic diagram of placing the upper microphone of the mobile phone close to the mouth in the triggering posture according to an embodiment of the present application;
图6是根据本申请实施例的触发姿势中的将手机上端麦克风贴近嘴部的侧面示意图;6 is a schematic side view of the upper microphone of the mobile phone close to the mouth in the triggering posture according to an embodiment of the present application;
图7是根据本申请实施例的触发姿势中的将手机下端麦克风贴近嘴部的示意图;Fig. 7 is a schematic diagram of placing the lower end microphone of the mobile phone close to the mouth in the triggering posture according to an embodiment of the present application;
图8是根据本申请实施例的触发姿势中的将智能手表麦克风贴近嘴部的示意图。Fig. 8 is a schematic diagram of placing the smart watch microphone close to the mouth in the triggering posture according to an embodiment of the present application.
具体实施方式Detailed ways
为了使本领域技术人员更好地理解本申请,下面结合附图和具体实施方式对本申请作进一步详细说明。In order to enable those skilled in the art to better understand the application, the application will be further described in detail below with reference to the drawings and specific implementations.
本公开针对智能电子设备的语音输入触发,基于配置的麦克风捕捉的声音内在特征,来确定是否触发语音输入应用,其中无需传统的物理按键 触发、界面元素触发、唤醒词检测,交互更加自然。将设备放在嘴前即触发语音输入,符合用户习惯与认知。The present disclosure is directed to the voice input trigger of the smart electronic device, and determines whether to trigger the voice input application based on the inherent characteristics of the sound captured by the configured microphone. There is no need for traditional physical button triggering, interface element triggering, and wake-up word detection, and the interaction is more natural. Putting the device in front of the mouth triggers voice input, which conforms to user habits and cognition.
下面将从以下几个方面来继续本公开:1、基于人类说话时风噪声特征的语音输入触发,具体地,通过识别人说话时候的语音和风噪声音来直接启动语音输入并将接收的声音信号作为语音输入处理;2、基于多个麦克风接收的声音信号的差别的语音输入触发;3、基于低声说话方式识别的语音输入触发;4、基于麦克风的声音信号的距离判断的语音输入触发。。The present disclosure will be continued from the following aspects: 1. Voice input trigger based on the characteristics of wind noise when a human speaks, specifically, by recognizing the voice and wind noise sound when a human speaks to directly initiate the voice input and receive the sound signal As voice input processing; 2. Voice input trigger based on the difference of sound signals received by multiple microphones; 3. Voice input trigger based on whispered speech recognition; 4. Voice input trigger based on distance judgment of microphone's voice signal. .
一、基于人类说话时风噪声特征的语音输入触发1. Voice input trigger based on the characteristics of wind noise when human speaking
当用户近距离对着麦克风说话时,即使声音很小或者不触发声带发声,麦克风采集到的声音信号中包含两种声音成分,一是人声带后者口腔震动发出的声音,二是人说话产生的气流撞击麦克风产生的风噪声音。可以基于这个特性来触发电子设备的语音输入应用。When the user speaks into the microphone at a close distance, even if the sound is small or does not trigger the vocal cords to speak, the sound signal collected by the microphone contains two sound components, one is the sound produced by the oral vibration of the human vocal cord, and the other is produced by human speech The wind noise generated by the airflow hitting the microphone. The voice input application of the electronic device can be triggered based on this characteristic.
图1示出了根据本申请实施例的语音输入交互方法100的示意性流程图。Fig. 1 shows a schematic flowchart of a voice input interaction method 100 according to an embodiment of the present application.
在步骤S101中,分析麦克风采集的声音信号,识别声音信号中是否包含人说话的语音以及是否包含人说话产生的气流撞击麦克风产生的风噪声音,In step S101, the sound signal collected by the microphone is analyzed to identify whether the sound signal contains human speech and whether it contains wind noise generated by the airflow generated by the human speech hitting the microphone,
在步骤S102中,响应于确定声音信号中包含人说话的声音以及包含用户说话产生的气流撞击麦克风产生的风噪声音,将该声音信号作为用户的语音输入做处理。In step S102, in response to determining that the sound signal contains the voice of a person's speech and the wind noise generated by the airflow generated by the user's speech hitting the microphone, the sound signal is processed as the user's voice input.
本申请实施例的语音输入交互方法特别适合于在隐私要求比较高的情况下,不用声带发声地进行语音输入。The voice input interaction method of the embodiment of the present application is particularly suitable for voice input without vocal cords when the privacy requirements are relatively high.
这里用户说话的语音可以包括:用户以正常音量说话的声音、用户以小音量说话的声音、用户以声带不发声方式说话发出的声音。Here, the voice spoken by the user may include: the voice of the user speaking at a normal volume, the voice of the user speaking at a low volume, and the voice of the user speaking in a non-voicing manner.
在一个示例中,可以识别上述不同的说话方式,根据识别结果产生不同的反馈,比如正常说话就是控制手机的语音助理,低声说话就是控制微信,声带不发声说话就是做语音转录笔记。In one example, the above-mentioned different ways of speaking can be recognized, and different feedbacks can be generated according to the recognition results. For example, normal speaking is to control the voice assistant of the mobile phone, whispered speaking is to control WeChat, and voiceless speaking is to make voice transcription notes.
作为示例,将声音信号作为用户的语音输入所做的处理包括以下一种 或多种:As an example, the processing of using a sound signal as the user's voice input includes one or more of the following:
将声音信号存储到电子设备上的可存储介质;Store the sound signal to the storage medium on the electronic device;
将声音信号通过互联网发送出去;Send the sound signal through the Internet;
将声音信号中的语音信号识别为文字,存储到电子设备上的可存储介质;Recognize the voice signal in the sound signal as text and store it in the storage medium on the electronic device;
将声音信号中的语音信号识别为文字,通过互联网发送出去;Recognize the voice signal in the voice signal as text and send it out via the Internet;
将声音信号中的语音信号识别为文字,理解用户的语音指令,执行相应操作。Recognize the voice signal in the voice signal as text, understand the user's voice command, and perform the corresponding operation.
在一个示例中,还包括通过声纹分析识别特定用户,只对包含特定用户语音的声音信号做处理。In one example, it also includes identifying a specific user through voiceprint analysis, and processing only the voice signal containing the voice of the specific user.
在一个示例中,电子设备为智能手机、智能手表、智能戒指中的一种。In an example, the electronic device is one of a smart phone, a smart watch, and a smart ring.
在一个示例中,使用神经网络模型判断声音信号中是否包含用户说话的语音以及说话产生的气流撞击麦克风产生的风噪声音。此仅为示例,可以使用其他机器学习算法。In one example, a neural network model is used to determine whether the voice signal of the user is included in the voice signal and the wind noise generated by the airflow hitting the microphone. This is just an example, other machine learning algorithms can be used.
在一个示例中,识别声音信号中是否包含人说话的语音以及是否包含人说话产生的气流撞击麦克风产生的风噪声音包括:In one example, recognizing whether the voice signal contains human speech and whether it contains wind noise generated by the airflow generated by human speech and hitting the microphone includes:
识别声音信号中是否包含用户说话的语音;Identify whether the voice signal contains the user's voice;
响应于确定声音信号中包含用户说话的语音,识别语音中的音素,将语音信号表示为音素序列;In response to determining that the voice signal contains the voice spoken by the user, recognize the phonemes in the voice, and represent the voice signal as a phoneme sequence;
针对音素序列中的每个音素,确定该音素是否为吐气音素,即:用户发声该音素时有气流从嘴中出来;For each phoneme in the phoneme sequence, determine whether the phoneme is an exhaled phoneme, that is, when the user utters the phoneme, there is airflow out of the mouth;
将声音信号按照固定窗口长度切分为声音片段序列;Divide the sound signal into a sequence of sound segments according to a fixed window length;
利用频率特征,识别每个声音片段是否包含风噪声;Using frequency characteristics, identify whether each sound segment contains wind noise;
将语音音素序列中的吐气音素和声音片段序列中识别为风噪声的片段做比较,同时将音素序列中的非吐气音素和风噪声片段作比较,当吐气音素与风噪声片段重合度高于一定阈值,且非吐气音素与非风噪声片段重合度低于一定阈值时,判断该声音信号中包含用户说话产生的气流撞击麦克风产生的风噪声音。Compare the exhaled phoneme in the speech phoneme sequence with the segment identified as wind noise in the sound segment sequence, and compare the non-exhaled phoneme in the phoneme sequence with the wind noise segment. When the overlap between the exhaled phoneme and the wind noise segment is higher than a certain threshold , And when the degree of coincidence of the non-breathing phoneme and the non-wind noise segment is lower than a certain threshold, it is determined that the sound signal contains the wind noise sound generated by the airflow generated by the user's speech hitting the microphone.
在一个示例中,识别声音信号中是否包含人说话的语音以及是否包含 人说话产生的气流撞击麦克风产生的风噪声音包括:In one example, recognizing whether the voice signal contains human speech and whether it contains wind noise generated by the airflow generated by human speech hitting the microphone includes:
识别声音信号中包含风噪的声音特征;Identify the sound characteristics of wind noise in the sound signal;
响应于确定声音信号中包含风噪声,识别声音信号包含语音信号;In response to determining that the sound signal contains wind noise, identifying that the sound signal contains a voice signal;
响应于确定声音信号中包含语音信号,识别语音信号对应的音素序列;Recognizing the phoneme sequence corresponding to the voice signal in response to determining that the voice signal contains the voice signal;
针对声音信号中的风噪特征,计算每一时刻的风噪特征强度;According to the wind noise characteristics in the sound signal, calculate the wind noise characteristic intensity at each moment;
针对音素序列中的每个音素,根据预先定义的数据模型获得该音素吐气的强度;For each phoneme in the phoneme sequence, obtain the exhalation intensity of that phoneme according to a predefined data model;
通过基于高斯混合贝叶斯模型分析风噪特征与音素序列的一致性,重合度高于一定阈值时,判断该声音信号中包含用户说话产生的气流撞击麦克风产生的风噪声音。By analyzing the consistency of wind noise features and phoneme sequence based on the Gaussian Mixture Bayesian model, when the coincidence degree is higher than a certain threshold, it is determined that the sound signal contains wind noise generated by the airflow generated by the user's speech hitting the microphone.
二、基于多个麦克风接收的声音信号的差别的语音输入触发2. Voice input trigger based on the difference of sound signals received by multiple microphones
图2示出了根据本申请的另一实施例的配置有多个麦克风的电子设备使用多个麦克风接收的声音信号的差别的语音输入触发方法的总体流程图。Fig. 2 shows an overall flow chart of a method for triggering a voice input using a difference in sound signals received by multiple microphones for an electronic device equipped with multiple microphones according to another embodiment of the present application.
电子设备例如手机内置有多个麦克风的电子设备,电子设备具有存储器和中央处理器,存储器上存储有计算机可执行指令,计算机可执行指令被中央处理器执行时能够执行本实施例的语音输入触发方法。An electronic device such as an electronic device with multiple microphones built in a mobile phone. The electronic device has a memory and a central processing unit. The memory stores computer executable instructions. When the computer executable instructions are executed by the central processing unit, the voice input trigger of this embodiment can be executed. method.
如图2所示,在步骤S201中,分析多个麦克风采集的声音信号。As shown in FIG. 2, in step S201, the sound signals collected by multiple microphones are analyzed.
在一个示例中,多个麦克风包括至少3个麦克风,构成麦克风阵列系统,通过声音信号达到各个麦克风的时间差可以估计声源相对于智能设备的空间位置。In an example, the multiple microphones include at least 3 microphones to form a microphone array system, and the spatial position of the sound source relative to the smart device can be estimated by the time difference between the sound signal reaching each microphone.
这里的声音信号包括例如声音信号的幅度、频率等等。The sound signal here includes, for example, the amplitude and frequency of the sound signal.
在步骤S202中,基于多个麦克风采集的声音信号,判断用户是否正在近距离对着电子设备说话。In step S202, based on the sound signals collected by multiple microphones, it is determined whether the user is speaking to the electronic device at a close range.
在一个示例中,判断用户是否正在近距离对着电子设备说话包括:In one example, determining whether the user is speaking into the electronic device at close range includes:
利用到达阵列上各传声器的声音信号之间的时间差计算用户嘴部相对于麦克风阵列的位置,Calculate the position of the user’s mouth relative to the microphone array using the time difference between the sound signals arriving at each microphone on the array,
当用户嘴部距离电子设备的距离小于一定阈值时,确定用户正在近距离对着电子设备说话。When the distance between the user's mouth and the electronic device is less than a certain threshold, it is determined that the user is speaking to the electronic device at close range.
在一个示例中,所述距离阈值为10厘米。In an example, the distance threshold is 10 cm.
在步骤S203中,响应于确定用户正在近距离对着电子设备说话,将麦克风采集的声音信号作为用户的语音输入处理。In step S203, in response to determining that the user is speaking to the electronic device at a close distance, the sound signal collected by the microphone is processed as the user's voice input.
在一个示例中,将该声音信号作为用户的语音输入做处理包括:In an example, processing the sound signal as the user's voice input includes:
根据说话人嘴部和电子设备之间距离的不同,对用户的语音输入做不同处理。例如,当距离为0-3cm时,激活语音助手响应用户的语音输入;当距离为3-10cm时,激活微信应用程序响应用户的语音输入,将语音信息发送给好友;According to the distance between the speaker's mouth and the electronic device, the user's voice input is processed differently. For example, when the distance is 0-3cm, the voice assistant is activated to respond to the user's voice input; when the distance is 3-10cm, the WeChat application is activated to respond to the user's voice input and send voice messages to friends;
在一个示例中,判断用户是否正在近距离对着电子设备说话包括:In one example, determining whether the user is speaking into the electronic device at close range includes:
判断是否至少有一个麦克风采集的声音信号中包含用户说话的语音信号,Determine whether the voice signal collected by at least one microphone contains the voice signal of the user's speech,
响应于确定至少有一个麦克风采集的声音信号中包含用户说话的语音信号,从麦克风采集的声音信号中提取语音信号,In response to determining that the sound signal collected by at least one microphone contains the voice signal of the user's speaking, extracting the voice signal from the sound signal collected by the microphone,
判断从不同麦克风采集的声音信号中提取的语音信号的幅度差异是否超过预定阈值时,When judging whether the amplitude difference of the voice signals extracted from the sound signals collected by different microphones exceeds a predetermined threshold,
响应于确定幅度差值超过预定阈值,确认用户正在近距离对着电子设备说话。In response to determining that the amplitude difference exceeds the predetermined threshold, it is confirmed that the user is speaking to the electronic device at close range.
在上面的例子中,还可以包括:In the above example, you can also include:
定义多个麦克风中语音信号幅度最大的麦克风为响应麦克风,Define the microphone with the largest voice signal amplitude among multiple microphones as the response microphone,
根据响应麦克风的不同,对用户的语音输入做不同的处理。例如,当响应麦克风是智能手机底部的麦克风时,激活智能手机上的语音助手;当响应麦克风是智能手机顶部的麦克风时,激活录音机功能将用户的语音记录到存储设备;Depending on the response microphone, the user's voice input is processed differently. For example, when the response microphone is the microphone at the bottom of the smartphone, activate the voice assistant on the smartphone; when the response microphone is the microphone at the top of the smartphone, activate the voice recorder function to record the user's voice to the storage device;
在一个示例中,判断用户是否正在近距离对着电子设备说话包括:利用提前训练的机器学习模型,处理多个麦克风的声音信号,判断用户是否正在近距离对着电子设备说话。一般地,准备训练样本数据,然后利用训练样本数据来训练选定的机器学习模型,在实际应用时(有时也叫测试), 将多个麦克风捕获的声音信号(作为测试样本)输入机器学习模型,得到的输出表示用户是否正在近距离对着电子设备说话。作为示例,机器学习模型例如为深度学习神经网络、支持向量机、决策树等。In one example, determining whether the user is speaking into the electronic device at close range includes: using a machine learning model trained in advance to process sound signals from multiple microphones to determine whether the user is speaking into the electronic device at close range. Generally, prepare training sample data, and then use the training sample data to train the selected machine learning model. In actual application (sometimes called testing), input the sound signals captured by multiple microphones (as test samples) into the machine learning model , The output obtained indicates whether the user is talking to the electronic device at close range. As an example, the machine learning model is, for example, deep learning neural network, support vector machine, decision tree, etc.
在一个示例中,用户说话的语音包括:用户以正常音量说话的声音,用户以小音量说话的声音,用户以声带不发声方式说话发出的声音。In an example, the voice spoken by the user includes: the user speaks at a normal volume, the user speaks at a low volume, and the user speaks in a silent manner.
在一个示例中,将声音信号作为用户的语音输入所做的处理包括以下一种或多种:将声音信号存储到电子设备上的可存储介质;将声音信号通过互联网发送出去;将声音信号中的语音信号识别为文字,存储到电子设备上的可存储介质;将声音信号中的语音信号识别为文字,通过互联网发送出去;将声音信号中的语音信号识别为文字,理解用户的语音指令,执行相应操作。In an example, the processing of using the sound signal as the user's voice input includes one or more of the following: storing the sound signal in a storage medium on the electronic device; sending the sound signal through the Internet; The voice signal is recognized as text and stored on the storage medium of the electronic device; the voice signal in the voice signal is recognized as text and sent out through the Internet; the voice signal in the voice signal is recognized as text, and the user’s voice command is understood, Take the appropriate action.
在一个示例,还包括通过声纹分析识别特定用户,只对包含特定用户语音的声音信号做处理。In one example, it also includes identifying a specific user through voiceprint analysis, and processing only the voice signal containing the voice of the specific user.
作为示例,电子设备为智能手机、智能手表、智能戒指、平板电脑等。As an example, the electronic device is a smart phone, smart watch, smart ring, tablet computer, etc.
本实施例利用内置的不同麦克风之间声音信号的差别来识别用户是否近距离对着电子设备说话,进而决定是否启动语音输入,具有识别可靠,计算方法简单等优点。This embodiment uses the difference in sound signals between different built-in microphones to identify whether the user is speaking to the electronic device at a close distance, and then decides whether to start the voice input, which has the advantages of reliable recognition and simple calculation method.
三、基于低声说话方式识别的语音输入触发3. Voice input trigger based on whispered speech recognition
低声说话是指说话音量小于正常说话(比如与他人正常交谈)音量的方式。低声说话包括两种方式。一种是声带不震动的低声说话(俗称悄悄话),另一种是声带发生震动的低声说话。在声带不震动的低声说话方式下,产生的声音主要包含空气通过喉部、嘴部发出的声音以及嘴内舌头牙齿发出的声音。在声带震动的低声说话方式下,发出的声音除了包含声带不震动的低声说话方式下产生的声音,还包括声带震动产生的声音。但相比于正常音量的说话方式,声带震动的低声说话过程中,声带震动程度较小,产生的声带震动声音较小。声带不震动低声说话产生的声音和声带震动产生的声音的频率范围不同,可以区分。声带震动低声说话和声带震动的正常音量说话可以通过音量阈值来区分,具体的阈值可以提前设定,也可以 由用户来设定。Whispering refers to the way in which the speaking volume is less than the volume of normal speaking (such as a normal conversation with others). There are two ways to speak in a low voice. One is whispering without vibration of the vocal cords (commonly known as whispering), and the other is whispering where the vocal cords vibrate. In the low-voicing mode where the vocal cords do not vibrate, the sound produced mainly includes the sound of air passing through the throat, the sound of the mouth, and the sound of the tongue and teeth in the mouth. In the whispering mode of vocal cord vibration, the sound produced includes not only the sound produced in the whispering mode of vocal cord vibration without vibration, but also the sound produced by vocal cord vibration. However, compared to the way of speaking at a normal volume, the vibration of the vocal cords is less in the process of low-voice speech with vocal cord vibration, and the vocal cord vibration sound produced is smaller. The vocal cords do not vibrate The sound produced by whispering and the sound produced by vocal cord vibration have different frequency ranges and can be distinguished. Vocal cord vibration low voice speech and vocal cord vibration normal volume speech can be distinguished by the volume threshold. The specific threshold can be set in advance or by the user.
示例方法:对麦克风采集的语音信号做滤波处理,提取两部分信号,分别为声带震动产生的声音成分V1和空气通过喉部、嘴部发出的以及嘴内舌头牙齿发出的声音V2。当V1和V2的能量比值小于一定阈值时,判定用户在做低声说话。Example method: filter the voice signal collected by the microphone, and extract two parts of the signal, which are the sound component V1 generated by the vibration of the vocal cords and the sound V2 generated by the air through the throat, the mouth, and the tongue and teeth in the mouth. When the energy ratio of V1 and V2 is less than a certain threshold, it is determined that the user is speaking in a low voice.
一般情况下,低声说话只有当用户距离麦克风比较近的时候才能检测,比如距离小于30厘米时。而定义近距离情况下的低声说话作为语音输入,对用户而言是一种易于学习和理解和方便操作的交互方式,可以免除显式的唤醒操作,比如按压特定的唤醒按键或者是通过语音唤醒词。且这种方式在绝大多数的实际使用情况下,不会被误触发。Generally speaking, whispering can only be detected when the user is close to the microphone, such as when the distance is less than 30 cm. The definition of whispering in close-range situations as voice input is an interactive method that is easy to learn and understand and convenient for users to operate. It can avoid explicit wake-up operations, such as pressing a specific wake-up button or through voice Wake word. And this method will not be triggered by mistake in most practical use cases.
图3示出了根据本申请实施例的配备有麦克风的电子设备基于低声说话方式识别的语音输入触发方法的总体流程图。配备有麦克风的电子设备具有存储器和中央处理器,存储器上存储有计算机可执行指令,计算机可执行指令被中央处理器执行时能够执行根据本申请实施例的语音输入触发方法。Fig. 3 shows an overall flow chart of a voice input trigger method based on whispered speech recognition for an electronic device equipped with a microphone according to an embodiment of the present application. An electronic device equipped with a microphone has a memory and a central processing unit, and computer executable instructions are stored on the memory. When the computer executable instructions are executed by the central processing unit, the voice input trigger method according to the embodiments of the present application can be executed.
如图3所示,在步骤S301中,判断麦克风采集的声音信号中是否包含语音信号。As shown in FIG. 3, in step S301, it is determined whether the sound signal collected by the microphone contains a voice signal.
在步骤S302中,响应于确认麦克风采集的声音信号中包含语音信号,判断用户是否在做低声说话,即以低于正常音量的方式说话。In step S302, in response to confirming that the voice signal collected by the microphone contains a voice signal, it is determined whether the user is speaking in a low voice, that is, speaking at a lower than normal volume.
在步骤S303中,响应于确定用户正在做低声说话,无需任何唤醒操作地将声音信号作为语音输入处理。In step S303, in response to determining that the user is speaking in a low voice, the sound signal is processed as a voice input without any wake-up operation.
低声说话可以包括声带不发声的低声说话和声带发声的低声说话两种方式。Whispering can include two ways of whispering without vocal cords and whispering with vocal cords.
在一个示例中,语音输入触发方法还可以包括:响应于确定用户在做低声说话,判断用户在做声带不发声的低声说话还是在做声带发声的低声说话,根据判断的结果不同,对声音信号做不同的处理。In one example, the voice input triggering method may further include: in response to determining that the user is speaking in a low voice, judging whether the user is making a low voice without vocal cords or a low voice speaking with vocal cords, according to different judgment results, The sound signal is processed differently.
作为示例,不同的处理为将语音输入交给不同的应用程序来处理。比如正常说话就是控制手机的语音助理,低声说话就是控制微信,声带不发声说话就是做语音转录笔记。As an example, different processing is to hand over voice input to different applications for processing. For example, normal speaking is to control the voice assistant of the mobile phone, speaking in a low voice is to control WeChat, and speaking without vocal cords is to make voice transcription notes.
作为示例,判断用户是否在做低声说话时使用的信号特征可以包括音量、频谱特征,能量分布等。As an example, the signal characteristics used to determine whether the user is speaking in a low voice may include volume, frequency spectrum characteristics, energy distribution, and so on.
作为示例,判断用户在做声带不发声的低声说话或者在做声带发声的低声说话时使用的信号特征包括音量、频谱特征,能量分布等。As an example, it is determined that the signal characteristics used by the user when making a low-voiced speech without vocal cords or when making a low-voiced speech with vocal cords include volume, spectral characteristics, energy distribution, and the like.
作为示例,判断用户是否在做低声说话可以包括:利用机器学习模型,处理麦克风采集的声音信号,判断用户是否在低声说话。As an example, determining whether the user is speaking in a low voice may include: using a machine learning model to process the sound signal collected by the microphone to determine whether the user is speaking in a low voice.
作为示例,机器学习模型可以为卷积神经网络模型或者循环神经网络模型。As an example, the machine learning model may be a convolutional neural network model or a recurrent neural network model.
作为示例,判断用户在做声带不发声的低声说话还是在做声带发声的低声说话包括:利用机器学习模型,处理麦克风采集的声音信号,判断用户在做声带不发声的低声说话或者在做声带发声的低声说话。As an example, judging whether the user is doing low-voicing without vocal cords or low-voicing with vocal cords includes: using a machine learning model to process the sound signals collected by the microphone to determine whether the user is doing low-voicing without vocal cords or Talk in a low voice with vocal cords.
作为示例,将声音信号作为用户的语音输入所做的处理包括以下一种或多种:As an example, the processing of using a sound signal as the user's voice input includes one or more of the following:
将声音信号存储到电子设备上的可存储介质;Store the sound signal to the storage medium on the electronic device;
将声音信号通过互联网发送出去;Send the sound signal through the Internet;
将声音信号中的语音信号识别为文字,存储到电子设备上的可存储介质;Recognize the voice signal in the sound signal as text and store it in the storage medium on the electronic device;
将声音信号中的语音信号识别为文字,通过互联网发送出去;Recognize the voice signal in the voice signal as text and send it out via the Internet;
将声音信号中的语音信号识别为文字,理解用户的语音指令,执行相应操作。Recognize the voice signal in the voice signal as text, understand the user's voice command, and perform the corresponding operation.
作为示例,语音输入触发方法还可以包括:通过声纹分析识别特定用户,只对包含特定用户语音的声音信号做处理。As an example, the voice input triggering method may also include: identifying a specific user through voiceprint analysis, and processing only the voice signal containing the voice of the specific user.
作为示例,电子设备可以为智能手机、智能手表、智能戒指等。As an example, the electronic device may be a smart phone, smart watch, smart ring, etc.
有关低声说话模式以及检测方法,作为示例,可以参考下述参考文献:Regarding the whispering mode and detection method, as an example, you can refer to the following references:
Zhang,Chi,and John HL Hansen."Analysis and classification of speech mode:whispered through shouted."Eighth Annual Conference of the International Speech Communication Association.2007.Zhang, Chi, and John HL Hansen. "Analysis and classification of speech mode: whispered through shouted. "Eighth Annual Conference of the International Speech Communication Association. 2007.
Meenakshi,G.Nisha,and Prasanta Kumar Ghosh."Robust whisper activity detection using long-term log energy variation of sub-band signal."IEEE Signal Processing Letters 22.11(2015):1859-1863.Meenakshi, G. Nisha, and Prasanta Kumar Ghosh. "Robust whisper activity detection using long-term log energy variation of sub-band signal." IEEE Signal Processing Letters 22.11 (2015): 1859-1863.
四、基于麦克风的声音信号的距离判断的语音输入触发Fourth, the voice input trigger based on the distance judgment of the sound signal of the microphone
下面结合图4描述基于麦克风的声音信号的距离判断的语音输入触发方法的总体流程图。The following describes the overall flowchart of the voice input trigger method based on the distance judgment of the sound signal of the microphone with reference to FIG. 4.
如图4所示,在步骤401中,处理麦克风捕获的声音信号判断声音信号中是否存在语音信号。As shown in FIG. 4, in step 401, the sound signal captured by the microphone is processed to determine whether there is a voice signal in the sound signal.
在步骤402中,响应于确认声音信号中存在语音信号,基于麦克风采集的声音信号进一步判断智能电子设备与用户的嘴部距离是否小于预定阈值。In step 402, in response to confirming that the voice signal exists in the voice signal, it is further determined whether the distance between the smart electronic device and the user's mouth is less than a predetermined threshold based on the voice signal collected by the microphone.
在步骤403中,响应于确定电子设备与用户嘴部距离小于预定阈值,将麦克风采集的声音信号作为语音输入处理。In step 403, in response to determining that the distance between the electronic device and the user's mouth is less than a predetermined threshold, the sound signal collected by the microphone is processed as a voice input.
在一个示例中,预定阈值为10厘米。In one example, the predetermined threshold is 10 cm.
语音信号可以包括下面各项之一或者组合:用户以正常音量说话发出的声音;用户低声说话发出的声音;用户声带不发声说话产生的声音。The voice signal may include one or a combination of the following items: the sound produced by the user speaking at a normal volume; the sound produced by the user speaking in a low voice; the sound produced by the user's vocal cord without speaking.
在一个示例中,判断智能电子设备与用户的嘴部距离是否小于预定阈值时使用的特征包括声音信号中的时域特征和频域特征,包括音量、频谱能量。In one example, the features used when judging whether the distance between the smart electronic device and the user's mouth is less than a predetermined threshold includes time domain features and frequency domain features in the sound signal, including volume and spectral energy.
在一个示例中,所述判断智能电子设备与用户的嘴部距离是否小于预定阈值包括:利用深度神经网络模型处理麦克风采集的数据,判断智能电子设备与用户的嘴部距离是否小于预定阈值。In an example, the judging whether the distance between the smart electronic device and the user's mouth is less than a predetermined threshold includes: using a deep neural network model to process data collected by a microphone to determine whether the distance between the smart electronic device and the user's mouth is less than a predetermined threshold.
在一个示例中,所述判断智能电子设备与用户的嘴部距离是否小于预定阈值包括:记录用户在未做语音输入时的语音信号,将麦克风当前采集的语音信号与未做语音输入时的语音信号作比较,如果判断麦克风当前采集的语音信号音量超过未做语音输入时的语音信号的音量一定阈值,判断智能电子设备与用户的嘴部距离小于预定阈值。In an example, the judging whether the distance between the smart electronic device and the user's mouth is less than a predetermined threshold includes: recording the user's voice signal when the user is not making a voice input, and comparing the voice signal currently collected by the microphone with the voice when the user is not making a voice input. Signals are compared, and if it is determined that the volume of the voice signal currently collected by the microphone exceeds a certain threshold of the volume of the voice signal when no voice input is made, it is determined that the distance between the smart electronic device and the user's mouth is less than the predetermined threshold.
在一个示例中,将声音信号作为用户的语音输入所做的处理包括以 下一种或多种:将声音信号存储到电子设备上的可存储介质;将声音信号通过互联网发送出去;将声音信号中的语音信号识别为文字,存储到电子设备上的可存储介质;将声音信号中的语音信号识别为文字,通过互联网发送出去;将声音信号中的语音信号识别为文字,理解用户的语音指令,执行相应操作。In an example, the processing of using the sound signal as the user's voice input includes one or more of the following: storing the sound signal in a storage medium on the electronic device; sending the sound signal through the Internet; The voice signal is recognized as text and stored on the storage medium of the electronic device; the voice signal in the voice signal is recognized as text and sent out through the Internet; the voice signal in the voice signal is recognized as text, and the user’s voice command is understood, Take the appropriate action.
在一个示例中,语音输入触发还包括通过声纹分析识别特定用户,只对包含特定用户语音的声音信号做处理。In an example, the voice input triggering further includes identifying a specific user through voiceprint analysis, and processing only the voice signal containing the voice of the specific user.
在一个示例中,电子设备为智能手机、智能手表、智能戒指等。In one example, the electronic device is a smart phone, smart watch, smart ring, etc.
图5至图8显示了几例用户将智能电子便携设备的麦克风放置到嘴边较近距离的位置,此时用户发出的语音将作为语音输入。其中,图5与图6分别是手机上端有麦克风的情况,在这种情况下,用户有语音交互意图时,可以将手机的麦克风移动到嘴边0~10厘米处,直接说话即可作为语音输入。图7是手机下端有麦克风的情况,与前述上端有麦克风相类似,两种姿势不是互斥的,如果手机上下端均有麦克风则任意一种姿势均可以实施交互方案。图8是对应设备为智能手表时的情况,与上述设备为手机的情况类似。上述对触发姿势的说明是示例性的,并非穷尽性的,并且也不限于所披露的各种设备和麦克风情况。Figures 5 to 8 show a few cases where the user places the microphone of the smart electronic portable device close to the mouth, and the voice of the user will be used as the voice input. Among them, Figure 5 and Figure 6 are respectively the case where there is a microphone at the top of the mobile phone. In this case, when the user has voice interaction intentions, they can move the microphone of the mobile phone to 0-10 cm from the mouth, and speak directly as voice enter. Fig. 7 is the case where there is a microphone at the lower end of the mobile phone. Similar to the above-mentioned upper end with a microphone, the two postures are not mutually exclusive. If there are microphones at the upper and lower ends of the mobile phone, either posture can implement an interactive scheme. Fig. 8 shows the situation when the corresponding device is a smart watch, which is similar to the situation when the above-mentioned device is a mobile phone. The above description of the trigger gesture is exemplary, not exhaustive, and is not limited to the various devices and microphones disclosed.
作为使用单个麦克风来接收声音输入并触发语音输入的一个具体实施例子,可以首先通过分析单麦克风接收到的声音输入,判断是否为语音,并通过分析语音的近距离特有的特征,如麦克风爆破音、近场风噪、吹气声、能量、频谱特征、时域特征等,判断电子设备自身与用户的嘴的距离是否小于给定阈值,以及通过声纹识别判断语音输入来源是否属于可服务用户,结合以上几点来判断是否将麦克风信号作为语音输入。As a specific implementation example of using a single microphone to receive voice input and trigger voice input, you can first analyze the voice input received by a single microphone to determine whether it is voice, and analyze the short-distance unique characteristics of the voice, such as microphone popping sound , Near-field wind noise, blowing sound, energy, spectrum characteristics, time domain characteristics, etc., to determine whether the distance between the electronic device itself and the user’s mouth is less than a given threshold, and to determine whether the voice input source belongs to a serviceable user through voiceprint recognition ,Combine the above points to determine whether to use the microphone signal as voice input.
作为使用双麦克风来接收声音输入并触发语音输入的一个具体实施例子,通过分析双麦克风输入信号的特征差异,如能量特征、频谱特征,判断发声位置是否贴近其中一个麦克风,通过双麦克风的信号差异从而屏蔽环境噪音、分离语音到对应的单声道,然后通过上述单麦克风的特征分析方法,判断电子设备自身与用户的嘴的距离小于给定阈值,以及通过声纹识别判断语音输入来源是否属于可服务用户,结合以上几点来判断是否将 信号作为语音输入。As a specific implementation example of using dual microphones to receive sound input and trigger voice input, by analyzing the characteristic differences of the dual microphone input signals, such as energy characteristics and frequency spectrum characteristics, it is determined whether the sound position is close to one of the microphones, and the signal difference between the dual microphones In this way, the environmental noise is shielded, the voice is separated into the corresponding mono, and then through the feature analysis method of the single microphone, it is determined that the distance between the electronic device itself and the user’s mouth is less than a given threshold, and the voiceprint recognition is used to determine whether the voice input source belongs to It can serve the user, combining the above points to determine whether to use the signal as a voice input.
作为使用多麦克风阵列来接收声音输入并触发语音输入的一个具体实施例子,通过比较分析不同麦克风接收到的声音输入的信号的差异,通过从环境中分离近场语音信号,识别与检测声音信号是否包括语音,通过多麦克风阵列的声源定位技术判断语音信号的用户嘴的位置与设备之间的距离是否小于预定阈值,以及通过声纹识别判断语音输入来源是否属于可服务用户,结合以上几点来判断是否将信号作为语音输入。As a specific implementation example of using a multi-microphone array to receive voice input and trigger voice input, by comparing and analyzing the difference of the voice input signal received by different microphones, by separating the near-field voice signal from the environment, identifying and detecting whether the voice signal is Including voice, using multi-microphone array sound source localization technology to determine whether the distance between the position of the user’s mouth of the voice signal and the device is less than a predetermined threshold, and using voiceprint recognition to determine whether the voice input source belongs to a serviceable user, combining the above points To determine whether to use the signal as a voice input.
在一个示例中,在智能电子便携设备通过分析语音信号,检测到发音位置位于自身附近,也即移动设备位于用户嘴部较近位置,智能电子便携设备便将声音信号作为语音输入,根据任务与上下文的不同,再结合自然语言处理技术理解用户的语音输入并完成相应的任务。In one example, the smart electronic portable device analyzes the voice signal and detects that the pronunciation position is near itself, that is, the mobile device is located closer to the user's mouth. The smart electronic portable device uses the sound signal as a voice input, and according to the task and The context is different, combined with natural language processing technology to understand the user's voice input and complete the corresponding task.
麦克风并不局限于前述示例,而是可以包括下面各项之一或者其组合:设备内置单麦克风;设备内置双麦克风;设备内置多麦克风阵列;外接无线麦克风;以及外接有线麦克风。The microphone is not limited to the foregoing examples, but may include one or a combination of the following items: a built-in single microphone in the device; a built-in dual microphone in the device; a built-in multi-microphone array in the device; an external wireless microphone; and an external wired microphone.
如前所述,智能电子便携设备可以为手机,装备有双耳蓝牙耳机、带有麦克风的有线耳机或者其他麦克风传感器。As mentioned above, the smart electronic portable device can be a mobile phone equipped with a binaural Bluetooth headset, a wired headset with a microphone, or other microphone sensors.
智能电子便携设备可以为手表,以及智能戒指、腕表中的一种智能穿戴设备。The smart electronic portable device can be a watch, a smart wearable device among smart rings and watches.
智能电子便携设备为头戴式智能显示设备,装备有麦克风或者多麦克风组。The smart electronic portable device is a head-mounted smart display device equipped with a microphone or a multi-microphone group.
在一个示例中,在电子设备激活语音输入应用后,可以做出反馈输出,反馈输出包括震动、语音、图像中的一种或者其组合。In one example, after the electronic device activates the voice input application, a feedback output may be made, and the feedback output includes one of vibration, voice, and image, or a combination thereof.
本申请各个实施例的方案可以提供下述一种或几种优势:The solutions of the various embodiments of the present application can provide one or more of the following advantages:
1.交互更加自然。将设备放在嘴前即触发语音输入,符合用户习惯与认知。1. The interaction is more natural. Putting the device in front of the mouth triggers voice input, which conforms to user habits and cognition.
2.使用效率更高。单手即可使用。无需在不同的用户界面/应用之间切换,也不需按住某个按键,直接抬起手到嘴边就能使用。2. The use efficiency is higher. It can be used with one hand. No need to switch between different user interfaces/applications, no need to hold down a button, just lift your hand to your mouth to use it.
3.收音质量高。设备的录音机在用户嘴边,收取的语音输入信号清晰,受环境音的影响较小。3. The radio quality is high. The device's voice recorder is at the user's mouth, and the voice input signal received is clear, and it is less affected by environmental sounds.
4.高隐私性与社会性。设备在嘴前,用户只需发出相对较小的声音即可完成高质量的语音输入,对他人的干扰较小,同时用户姿势可以包括捂嘴等,具有较好的隐私保护。4. High privacy and sociality. With the device in front of the mouth, the user only needs to make a relatively small sound to complete high-quality voice input, which has little interference to others. At the same time, the user's posture can include covering the mouth, which has better privacy protection.
以上已经描述了本申请的各实施例,上述说明是示例性的,并非穷尽性的,并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。因此,本申请的保护范围应该以权利要求的保护范围为准。The embodiments of the present application have been described above, and the above description is exemplary and not exhaustive, and is not limited to the disclosed embodiments. Without departing from the scope and spirit of the described embodiments, many modifications and changes are obvious to those of ordinary skill in the art. Therefore, the protection scope of this application should be subject to the protection scope of the claims.

Claims (19)

  1. 一种配置有麦克风的智能电子设备,其特征在于,所述智能电子便携设备如下操作与用户进行基于语音输入的交互:A smart electronic device equipped with a microphone is characterized in that the smart electronic portable device performs voice input-based interaction with a user as follows:
    处理麦克风捕获的声音信号判断声音信号中是否存在语音信号;Process the sound signal captured by the microphone to determine whether there is a voice signal in the sound signal;
    响应于确认声音信号中存在语音信号,基于麦克风采集的声音信号进一步判断智能电子设备与用户的嘴部距离是否小于预定阈值;以及In response to confirming that there is a voice signal in the voice signal, based on the voice signal collected by the microphone, it is further determined whether the distance between the smart electronic device and the user's mouth is less than a predetermined threshold; and
    响应于确定电子设备与用户嘴部距离小于预定阈值,将麦克风采集的声音信号作为语音输入处理。In response to determining that the distance between the electronic device and the user's mouth is less than a predetermined threshold, the sound signal collected by the microphone is processed as a voice input.
  2. 根据权利要求1的智能电子设备,其特征在于,预定阈值为3厘米。The smart electronic device according to claim 1, wherein the predetermined threshold is 3 cm.
  3. 根据权利要求1的智能电子设备,其特征在于,预定阈值为1厘米。The smart electronic device according to claim 1, wherein the predetermined threshold is 1 cm.
  4. 根据权利要求1的智能电子设备,其特征在于,电子设备的麦克风处还有接近光传感器,通过接近光传感器判断是否有物体接近电子设备。The smart electronic device according to claim 1, wherein there is a proximity light sensor at the microphone of the electronic device, and the proximity light sensor determines whether an object is approaching the electronic device.
  5. 根据权利要求1的智能电子设备,其特征在于,电子设备的麦克风处还有距离传感器,通过距离传感器直接测量电子设备与用户嘴部的距离。The intelligent electronic device according to claim 1, characterized in that there is a distance sensor at the microphone of the electronic device, and the distance between the electronic device and the user's mouth is directly measured by the distance sensor.
  6. 根据权利要求1的智能电子设备,其特征在于,通过麦克风收集的声音信号特征来判断智能电子设备与用户的嘴部距离是否小于预定阈值。The smart electronic device according to claim 1, characterized in that it is determined whether the distance between the smart electronic device and the user's mouth is less than a predetermined threshold value based on the characteristics of the sound signal collected by the microphone.
  7. 根据权利要求1的智能电子设备,其特征在于,所述语音信号包括下面各项之一或者组合:The intelligent electronic device according to claim 1, wherein the voice signal includes one or a combination of the following items:
    用户以正常音量说话发出的声音;The sound made by the user speaking at a normal volume;
    用户低声说话发出的声音;The user's whispered voice;
    用户声带不发声说话产生的声音。The sound produced by the user's vocal cords without speaking.
  8. 根据权利要求1的智能电子设备,其特征在于,还包括:The intelligent electronic device according to claim 1, further comprising:
    响应于确定用户正在近距离对着电子设备说话,In response to determining that the user is speaking into the electronic device at close range,
    判断用户在以如下方式中的一种在发声,包括:Determine that the user is speaking in one of the following ways, including:
    用户以正常音量说话的声音,The user’s voice at normal volume,
    用户以小音量说话的声音,The user’s voice at a low volume,
    用户以声带不发声方式说话发出的声音;以及The sound made by the user when the vocal cord is silent; and
    根据判断的结果不同,对声音信号做不同的处理。According to the different judgment results, the sound signal is processed differently.
  9. 根据权利要求8的智能电子设备,其特征在于,所述不同的处理为 激活不同的应用程序处理语音输入。The intelligent electronic device according to claim 8, wherein the different processing is activating different application programs to process voice input.
  10. 根据权利要求8的智能电子设备,其特征在于,判断中使用的特征包括音量、频谱特征,能量分布。The intelligent electronic device according to claim 8, wherein the characteristics used in the judgment include volume, frequency spectrum characteristics, and energy distribution.
  11. 根据权利要求1的智能电子设备,其特征在于,判断智能电子设备与用户的嘴部距离是否小于预定阈值时使用的特征包括声音信号中的时域特征和频域特征,包括音量、频谱能量。The smart electronic device according to claim 1, wherein the features used when judging whether the distance between the smart electronic device and the user's mouth is less than a predetermined threshold value includes time-domain features and frequency-domain features in the sound signal, including volume and spectral energy.
  12. 根据权利要求11的智能电子设备,其特征在于,所述判断智能电子设备与用户的嘴部距离是否小于预定阈值包括:The smart electronic device according to claim 11, wherein said determining whether the distance between the smart electronic device and the user's mouth is less than a predetermined threshold comprises:
    从麦克风采集到的声音信号通过滤波器提取语音信号,The voice signal collected from the microphone is passed through the filter to extract the voice signal,
    判断所述语音信号的能量是否超过一定阈值,Determine whether the energy of the voice signal exceeds a certain threshold,
    响应于语音信号强度超过一定阈值时,In response to the voice signal strength exceeding a certain threshold,
    判断电子设备与用户嘴部距离小于预定阈值。It is determined that the distance between the electronic device and the user's mouth is less than a predetermined threshold.
  13. 根据权利要求1的智能电子设备,其特征在于,所述判断智能电子设备与用户的嘴部距离是否小于预定阈值包括:The smart electronic device according to claim 1, wherein said determining whether the distance between the smart electronic device and the user's mouth is less than a predetermined threshold comprises:
    利用深度神经网络模型处理麦克风采集的数据,判断智能电子设备与用户的嘴部距离是否小于预定阈值。The deep neural network model is used to process the data collected by the microphone to determine whether the distance between the smart electronic device and the user's mouth is less than a predetermined threshold.
  14. 根据权利要求1的智能电子设备,其特征在于,所述判断智能电子设备与用户的嘴部距离是否小于预定阈值包括:The smart electronic device according to claim 1, wherein said determining whether the distance between the smart electronic device and the user's mouth is less than a predetermined threshold comprises:
    记录用户在未做语音输入时的语音信号,Record the voice signal when the user is not making voice input,
    将麦克风当前采集的语音信号与未做语音输入时的语音信号作比较,Compare the voice signal currently collected by the microphone with the voice signal when no voice input is made,
    如果判断麦克风当前采集的语音信号音量超过未做语音输入时的语音信号的音量一定阈值,判断智能电子设备与用户的嘴部距离小于预定阈值。If it is determined that the volume of the voice signal currently collected by the microphone exceeds a certain threshold of the volume of the voice signal when no voice input is made, it is determined that the distance between the smart electronic device and the user's mouth is less than the predetermined threshold.
  15. 根据权利要求1的智能电子设备,其特征在于,将声音信号作为用户的语音输入所做的处理包括以下一种或多种:The smart electronic device according to claim 1, wherein the processing of using the sound signal as the user's voice input includes one or more of the following:
    将声音信号存储到电子设备上的可存储介质;Store the sound signal to the storage medium on the electronic device;
    将声音信号通过互联网发送出去;Send the sound signal through the Internet;
    将声音信号中的语音信号识别为文字,存储到电子设备上的可存储介质;Recognize the voice signal in the sound signal as text and store it in the storage medium on the electronic device;
    将声音信号中的语音信号识别为文字,通过互联网发送出去;Recognize the voice signal in the voice signal as text and send it out via the Internet;
    将声音信号中的语音信号识别为文字,理解用户的语音指令,执行相应操作。Recognize the voice signal in the voice signal as text, understand the user's voice command, and perform the corresponding operation.
  16. 根据权利要求1的智能电子设备,其特征在于,还包括通过声纹分析识别特定用户,只对包含特定用户语音的声音信号做处理。The intelligent electronic device according to claim 1, further comprising identifying a specific user through voiceprint analysis, and processing only the voice signal containing the voice of the specific user.
  17. 根据权利要求1的智能电子设备,其特征在于,电子设备为智能手机、智能手表或智能戒指。The smart electronic device according to claim 1, wherein the electronic device is a smart phone, a smart watch or a smart ring.
  18. 一种由配置有麦克风的智能电子设备执行的语音交互唤醒方法,其特征在于,包括如下操作用于智能电子设备与用户进行基于语音输入的交互:A voice interactive wake-up method executed by a smart electronic device equipped with a microphone, which is characterized in that it includes the following operations for the smart electronic device to interact with a user based on voice input:
    处理麦克风捕获的声音信号判断声音信号中是否存在语音信号,Process the sound signal captured by the microphone to determine whether there is a voice signal in the sound signal,
    响应于确认声音信号中存在语音信号,基于麦克风采集的声音信号进一步判断智能电子设备与用户的嘴部距离是否小于预定阈值;In response to confirming that there is a voice signal in the voice signal, further determining whether the distance between the smart electronic device and the user's mouth is less than a predetermined threshold based on the voice signal collected by the microphone;
    响应于确定电子设备与用户嘴部距离小于预定阈值,将麦克风采集的声音信号作为语音输入处理。In response to determining that the distance between the electronic device and the user's mouth is less than a predetermined threshold, the sound signal collected by the microphone is processed as a voice input.
  19. 一种计算机可读介质,其特征在于,其上存储有计算机可执行指令,计算机可执行指令被计算机执行时能够执行语音交互唤醒方法,所述语音交互唤醒方法包括:A computer-readable medium, characterized in that computer-executable instructions are stored thereon, and when the computer-executable instructions are executed by a computer, a voice interactive wake-up method can be executed, the voice interactive wake-up method includes:
    处理麦克风捕获的声音信号判断声音信号中是否存在语音信号,Process the sound signal captured by the microphone to determine whether there is a voice signal in the sound signal,
    响应于确认声音信号中存在语音信号,基于麦克风采集的声音信号进一步判断智能电子设备与用户的嘴部距离是否小于预定阈值;In response to confirming that there is a voice signal in the voice signal, further determining whether the distance between the smart electronic device and the user's mouth is less than a predetermined threshold based on the voice signal collected by the microphone;
    响应于确定电子设备与用户嘴部距离小于预定阈值,将麦克风采集的声音信号作为语音输入处理。In response to determining that the distance between the electronic device and the user's mouth is less than a predetermined threshold, the sound signal collected by the microphone is processed as a voice input.
PCT/CN2020/089551 2019-06-03 2020-05-11 Microphone signal-based voice interaction wake-up electronic device, method, and medium WO2020244355A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910475949.XA CN110097875B (en) 2019-06-03 2019-06-03 Microphone signal based voice interaction wake-up electronic device, method, and medium
CN201910475949.X 2019-06-03

Publications (1)

Publication Number Publication Date
WO2020244355A1 true WO2020244355A1 (en) 2020-12-10

Family

ID=67450117

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/089551 WO2020244355A1 (en) 2019-06-03 2020-05-11 Microphone signal-based voice interaction wake-up electronic device, method, and medium

Country Status (2)

Country Link
CN (1) CN110097875B (en)
WO (1) WO2020244355A1 (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110097875B (en) * 2019-06-03 2022-09-02 清华大学 Microphone signal based voice interaction wake-up electronic device, method, and medium
CN111276155B (en) * 2019-12-20 2023-05-30 上海明略人工智能(集团)有限公司 Voice separation method, device and storage medium
CN111343410A (en) * 2020-02-14 2020-06-26 北京字节跳动网络技术有限公司 Mute prompt method and device, electronic equipment and storage medium
CN111681654A (en) * 2020-05-21 2020-09-18 北京声智科技有限公司 Voice control method and device, electronic equipment and storage medium
CN111933140B (en) * 2020-08-27 2023-11-03 恒玄科技(上海)股份有限公司 Method, device and storage medium for detecting voice of earphone wearer
CN114260919B (en) * 2022-01-18 2023-08-29 华中科技大学同济医学院附属协和医院 Intelligent robot
CN117746849A (en) * 2022-09-14 2024-03-22 荣耀终端有限公司 Voice interaction method, device and terminal

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105120059A (en) * 2015-07-07 2015-12-02 惠州Tcl移动通信有限公司 Mobile terminal and method of controlling noise reduction in earphone conversation according to breathing strength
CN105847584A (en) * 2016-05-12 2016-08-10 歌尔声学股份有限公司 Method for intelligent device to identify private conversations
WO2018223388A1 (en) * 2017-06-09 2018-12-13 Microsoft Technology Licensing, Llc. Silent voice input
CN109313898A (en) * 2016-06-10 2019-02-05 苹果公司 The digital assistants of voice in a low voice are provided
CN109686378A (en) * 2017-10-13 2019-04-26 华为技术有限公司 Method of speech processing and terminal
CN110097875A (en) * 2019-06-03 2019-08-06 清华大学 Interactive voice based on microphone signal wakes up electronic equipment, method and medium
CN110111776A (en) * 2019-06-03 2019-08-09 清华大学 Interactive voice based on microphone signal wakes up electronic equipment, method and medium
CN110223711A (en) * 2019-06-03 2019-09-10 清华大学 Interactive voice based on microphone signal wakes up electronic equipment, method and medium
CN110428806A (en) * 2019-06-03 2019-11-08 清华大学 Interactive voice based on microphone signal wakes up electronic equipment, method and medium

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4247002B2 (en) * 2003-01-22 2009-04-02 富士通株式会社 Speaker distance detection apparatus and method using microphone array, and voice input / output apparatus using the apparatus
JP2013080015A (en) * 2011-09-30 2013-05-02 Toshiba Corp Speech recognition device and speech recognition method
US20150039314A1 (en) * 2011-12-20 2015-02-05 Squarehead Technology As Speech recognition method and apparatus based on sound mapping
CN105096946B (en) * 2014-05-08 2020-09-29 钰太芯微电子科技(上海)有限公司 Awakening device and method based on voice activation detection
CN104657105B (en) * 2015-01-30 2016-10-26 腾讯科技(深圳)有限公司 A kind of method and apparatus of the speech voice input function opening terminal
CN106254612A (en) * 2015-06-15 2016-12-21 中兴通讯股份有限公司 A kind of sound control method and device
CN106412259B (en) * 2016-09-14 2019-04-05 Oppo广东移动通信有限公司 Mobile terminal call control method, device and mobile terminal
CN106448672B (en) * 2016-10-27 2020-07-14 Tcl通力电子(惠州)有限公司 Sound system and control method
CN107889031B (en) * 2017-11-30 2020-02-14 广东小天才科技有限公司 Audio control method, audio control device and electronic equipment
CN109448759A (en) * 2018-12-28 2019-03-08 武汉大学 A kind of anti-voice authentication spoofing attack detection method based on gas explosion sound

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105120059A (en) * 2015-07-07 2015-12-02 惠州Tcl移动通信有限公司 Mobile terminal and method of controlling noise reduction in earphone conversation according to breathing strength
CN105847584A (en) * 2016-05-12 2016-08-10 歌尔声学股份有限公司 Method for intelligent device to identify private conversations
CN109313898A (en) * 2016-06-10 2019-02-05 苹果公司 The digital assistants of voice in a low voice are provided
WO2018223388A1 (en) * 2017-06-09 2018-12-13 Microsoft Technology Licensing, Llc. Silent voice input
CN109686378A (en) * 2017-10-13 2019-04-26 华为技术有限公司 Method of speech processing and terminal
CN110097875A (en) * 2019-06-03 2019-08-06 清华大学 Interactive voice based on microphone signal wakes up electronic equipment, method and medium
CN110111776A (en) * 2019-06-03 2019-08-09 清华大学 Interactive voice based on microphone signal wakes up electronic equipment, method and medium
CN110223711A (en) * 2019-06-03 2019-09-10 清华大学 Interactive voice based on microphone signal wakes up electronic equipment, method and medium
CN110428806A (en) * 2019-06-03 2019-11-08 清华大学 Interactive voice based on microphone signal wakes up electronic equipment, method and medium

Also Published As

Publication number Publication date
CN110097875A (en) 2019-08-06
CN110097875B (en) 2022-09-02

Similar Documents

Publication Publication Date Title
WO2020244355A1 (en) Microphone signal-based voice interaction wake-up electronic device, method, and medium
WO2020244402A1 (en) Speech interaction wakeup electronic device and method based on microphone signal, and medium
WO2020244416A1 (en) Voice interactive wakeup electronic device and method based on microphone signal, and medium
WO2020244411A1 (en) Microphone signal-based voice interaction wakeup electronic device and method, and medium
US10276164B2 (en) Multi-speaker speech recognition correction system
US9299344B2 (en) Apparatus and method to classify sound to detect speech
CN112074900B (en) Audio analysis for natural language processing
JP6819672B2 (en) Information processing equipment, information processing methods, and programs
US8762144B2 (en) Method and apparatus for voice activity detection
EP3210205B1 (en) Sound sample verification for generating sound detection model
CN111432303B (en) Monaural headset, intelligent electronic device, method, and computer-readable medium
US20130211826A1 (en) Audio Signals as Buffered Streams of Audio Signals and Metadata
US20190355352A1 (en) Voice and conversation recognition system
US10109294B1 (en) Adaptive echo cancellation
KR102628211B1 (en) Electronic apparatus and thereof control method
KR20150112337A (en) display apparatus and user interaction method thereof
JP6585733B2 (en) Information processing device
WO2016183961A1 (en) Method, system and device for switching interface of smart device, and nonvolatile computer storage medium
JP2009178783A (en) Communication robot and its control method
CN110728993A (en) Voice change identification method and electronic equipment
KR20210042523A (en) An electronic apparatus and Method for controlling the electronic apparatus thereof
WO2021153101A1 (en) Information processing device, information processing method, and information processing program
KR102114365B1 (en) Speech recognition method and apparatus
WO2019187543A1 (en) Information processing device and information processing method
US20240079007A1 (en) System and method for detecting a wakeup command for a voice assistant

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20818432

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20818432

Country of ref document: EP

Kind code of ref document: A1