CN110097875A

CN110097875A - Interactive voice based on microphone signal wakes up electronic equipment, method and medium

Info

Publication number: CN110097875A
Application number: CN201910475949.XA
Authority: CN
Inventors: 史元春; 喻纯
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2019-06-03
Filing date: 2019-06-03
Publication date: 2019-08-06
Anticipated expiration: 2039-06-03
Also published as: CN110097875B; WO2020244355A1

Abstract

Provide a kind of intelligent electronic device for being built-in with microphone, the smart electronics portable equipment operates as follows carries out the interaction inputted based on voice with user: the voice signal for handling microphones capture judges in voice signal with the presence or absence of voice signal；In response to, there are voice signal, the voice signal based on microphone acquisition further judges whether the mouth distance of intelligent electronic device and user are less than predetermined threshold in confirmation voice signal；In response to determining that electronic equipment and user's mouth distance are less than predetermined threshold, using the voice signal of microphone acquisition as voice input processing.The step of exchange method is suitable for user and carries out voice input when carrying intelligent electronic device, operates nature and simple, simplifies voice input reduces interaction and bears and difficulty, so that interactive more natural.

Description

Interactive voice based on microphone signal wakes up electronic equipment, method and medium

Technical field

The present invention generally relates to voices to input field, and more specifically, is related to intelligent electronic device, voice input Triggering method.

Background technique

With the development of computer technology, speech recognition algorithm is increasingly mature, and voice is inputted because it is in interactive mode High naturality is just becoming more and more important with validity.User can by voice and mobile device (mobile phone, wrist-watch etc.) into Row interaction, completes the multiple-tasks such as instruction input, information inquiry, voice-enabled chat.

And on this point of when triggering voice input, existing solution has some defects:

1. physical button triggers

After some (or certain) physical button for pressing (or pinning) mobile device, activation voice input.

The disadvantages of this solution is: needing physical button；It is easy false triggering；Need user key-press.

2. interface element triggers

Click the interface element (such as icon) on the screen of (or pinning) mobile device, activation voice input.

The disadvantages of this solution is: equipment being needed to have screen；It triggers element and occupies screen content；It is limited to software UI limit System, it is cumbersome to may cause triggering mode；It is easy false triggering.

3. waking up word (voice) detection

It is to wake up word with some particular words (such as product pet name), activation voice is defeated after equipment detects corresponding wake-up word Enter.

The disadvantages of this solution is: privacy and social poor；Interactive efficiency is lower.

Summary of the invention

In view of the foregoing, the present invention is proposed:

According to an aspect of the present invention, a kind of intelligent electronic device being built-in with microphone, the smart electronics are portable Equipment operates as follows carries out the interaction inputted based on voice with user: the voice signal for handling microphones capture judges voice signal In whether there is voice signal；In response to confirming there are voice signal in voice signal, the voice signal based on microphone acquisition Further judge whether the mouth distance of intelligent electronic device and user are less than predetermined threshold；In response to determining electronic equipment and using Family mouth distance is less than predetermined threshold, using the voice signal of microphone acquisition as voice input processing.

Preferably, predetermined threshold is 3 centimetres.

Preferably, predetermined threshold is 1 centimetre.

Preferably, also close to optical sensor at the microphone of electronic equipment, by being judged whether there is close to optical sensor Object proximity electronic equipment.

Preferably, there are also range sensors at the microphone of electronic equipment, and electronics is directly measured by range sensor and is set It is standby at a distance from user's mouth.

Preferably, the mouth distance of intelligent electronic device and user is judged by the voice signal property of microphone collection Whether predetermined threshold is less than.

Preferably, the voice signal includes one of following items or combination: user is spoken sending with normal quantity Sound；User mumbles the sound of sending；User's vocal cords not speak the sound of generation by sounding.

Preferably, electronic equipment is also operable to: in response to determining that user is closely speaking against electronic equipment；Sentence Disconnected user is one of as follows in sounding, comprising: user is with normal quantity one's voice in speech；User is said with small volume The sound of words；User with vocal cords, do not speak the sound of sending by tune；And it is different according to the result of judgement, to voice signal Do different processing.

Preferably, the different processing is the different application program processing voice input of activation.

Preferably, feature used in judgement includes volume, spectrum signature, Energy distribution etc..

Preferably, judge the feature packet used when whether the mouth distance of intelligent electronic device and user are less than predetermined threshold Include the temporal signatures and frequency domain character in voice signal, including volume, spectrum energy.

Preferably, whether the mouth distance for judging intelligent electronic device and user includes: from wheat less than predetermined threshold Gram collected voice signal signal of wind extracts voice signal by filter；Judge the voice signal energy whether be more than Certain threshold value；It is more than certain threshold value in response to voice signal intensity, judges that electronic equipment and user's mouth distance are less than predetermined threshold Value.

Preferably, whether the mouth distance for judging intelligent electronic device and user includes: to utilize less than predetermined threshold Deep neural network model handles the data of microphone acquisition, judges whether the mouth distance of intelligent electronic device and user are less than Predetermined threshold.

Preferably, whether the mouth distance for judging intelligent electronic device and user includes: record less than predetermined threshold Voice signal of the user when not doing voice input；By voice signal that microphone currently acquires and language when not doing voice input Sound signal is made comparisons；If it is determined that the voice signal volume that microphone currently acquires is more than voice signal when not doing voice input The certain threshold value of volume, judge that the mouth distance of intelligent electronic device and user are less than predetermined threshold.

Preferably, voice signal is inputted done processing as the voice of user includes one or more of: by sound Sound signal storage on electronic equipment can storage medium；Voice signal is sent by internet；It will be in voice signal Voice signal be identified as text, store on electronic equipment can storage medium；By the voice signal identification in voice signal For text, sent by internet；Voice signal in voice signal is identified as text, understands that the voice of user refers to It enables, executes corresponding operating.

Preferably, electronic equipment also identifies specific user by voiceprint analysis, only to the sound comprising specific user's voice Signal processes.

Preferably, electronic equipment is smart phone, smartwatch, intelligent ring etc..

Mobile device herein includes but is not limited to mobile phone, head-mounted display, wrist-watch, and intelligent ring, watch etc. Smaller intelligent wearable device.

According to another aspect of the present invention, a kind of voice executed by the intelligent electronic device configured with microphone is provided Triggering method is inputted, carries out the interaction inputted based on voice for intelligent electronic device and user including operating as follows: processing wheat The voice signal of gram wind capture judge to whether there is voice signal in voice signal；In response to there are voices in confirmation voice signal Signal, the voice signal based on microphone acquisition further judge whether the mouth distance of intelligent electronic device and user are less than in advance Determine threshold value；In response to determining that electronic equipment and user's mouth distance are less than predetermined threshold, the voice signal of microphone acquisition is made For voice input processing.

According to another aspect of the present invention, a kind of computer-readable medium is provided, it is executable to be stored thereon with computer Instruction, is able to carry out interactive voice awakening method when computer executable instructions are computer-executed, the interactive voice wakes up Method includes: that the voice signal of processing microphones capture judges in voice signal with the presence or absence of voice signal；In response to confirmation sound There are voice signal in sound signal, the voice signal based on microphone acquisition further judges intelligent electronic device and the mouth of user Whether portion's distance is less than predetermined threshold；In response to determining that electronic equipment and user's mouth distance are less than predetermined threshold, by microphone The voice signal of acquisition is as voice input processing.

According to an aspect of the invention, there is provided a kind of electronic equipment configured with microphone, electronic equipment, which has, to be deposited Reservoir and central processing unit are stored with computer executable instructions on memory, and computer executable instructions are by central processing unit Following operation is able to carry out when execution: the voice signal of analysis microphone acquisition identifies whether speak comprising people in voice signal Voice and whether comprising people speak generation air-flow hit microphone generate wind noise sound, in response to determine voice signal In comprising people's one's voice in speech and comprising user speak generation air-flow hit microphone generate wind noise sound, by the sound Signal is processed as the voice input of user.

Preferably, the voice that user speaks includes: user with normal quantity one's voice in speech, and user is spoken with small volume Sound, user with vocal cords, do not speak the sound of sending by tune.

Preferably, electronic equipment is also operable to: in response to determining that user is closely speaking against electronic equipment, being sentenced Disconnected user is one of as follows in sounding, comprising: user is said with normal quantity one's voice in speech, user with small volume The sound of words, user with vocal cords, do not speak the sound of sending by tune；It is different according to the result of judgement, voice signal is done not Same processing.

Preferably, judge that the feature used includes volume, spectrum signature, Energy distribution etc..

Preferably, electronic equipment is also operable to identify specific user by voiceprint analysis, only to including specific user's language The voice signal of sound processes.

Preferably, electronic equipment is one of smart phone, smartwatch, intelligent ring.

Preferably, whether electronic equipment is also operable to: being judged in voice signal using neural network model comprising user The air-flow of the voice and generation of speaking spoken hits the wind noise sound that microphone generates.

Preferably, whether electronic equipment is also operable to the voice whether spoken comprising people in identification voice signal and wraps The speak air-flow of generation containing people hits the wind noise sound that microphone generates includes whether speaking comprising user in identification voice signal Voice；In response to determining the voice spoken in voice signal comprising user, identifies the phoneme in voice, voice signal is indicated For aligned phoneme sequence；For each phoneme in aligned phoneme sequence, determine whether the phoneme is phoneme of feeling elated and exultant, it may be assumed that user's sounding sound There is air-flow to come out when plain from mouth；It according to fixed length of window cutting is sound clip sequence by voice signal；Utilize frequency spy Sign, identifies whether each sound clip includes wind noise；It will be in phoneme and the sound fragment sequence venting one's pent-up feelings in phoneme of speech sound sequence The segment for being identified as wind noise compares, at the same by aligned phoneme sequence non-phoneme venting one's pent-up feelings and wind noise segment make comparisons, when spitting Aspirant element and wind noise segment registration are higher than certain threshold value, and non-phoneme venting one's pent-up feelings and non-wind noise segment registration are lower than certain When threshold value, judge in the voice signal comprising user speak generation air-flow hit microphone generate wind noise sound.

Preferably, it identifies the voice whether spoken comprising people in voice signal and whether speaks the air-flow of generation comprising people Hitting the wind noise sound that microphone generates includes: the sound characteristic for identifying and making an uproar in voice signal comprising wind；In response to determining sound It include wind noise in signal, identification voice signal includes voice signal；Include voice signal in voice signal in response to determining, knows The corresponding aligned phoneme sequence of other voice signal；It makes an uproar feature for the wind in voice signal, the wind for calculating each moment is made an uproar characteristic strength； For each phoneme in aligned phoneme sequence, phoneme intensity venting one's pent-up feelings is obtained according to data model predetermined；By being based on Gaussian Mixture Bayesian model analysis wind is made an uproar the consistency of feature and aligned phoneme sequence, and when registration is higher than certain threshold value, judgement should In voice signal comprising user speak generation air-flow hit microphone generate wind noise sound.

According to another aspect of the present invention, a kind of electronic equipment for being built-in with multiple microphones, electronic equipment tool are provided There are memory and central processing unit, computer executable instructions is stored on memory, computer executable instructions are by centre Reason device is able to carry out following operation when executing: analyzing the voice signal of multiple microphone acquisitions；Judge user whether just in low coverage It speaks from against electronic equipment；In response to determining that user is closely speaking against electronic equipment, the sound that microphone is acquired Voice input processing of the sound signal as user.

Preferably, multiple microphones constitute microphone array system.

Preferably, it is described judge user whether closely against electronic equipment speaking include: using reach array on Time difference between the voice signal of each microphone calculates position of user's mouth relative to microphone array；When user's mouth away from When being less than certain threshold value with a distance from electronic equipment, determine that user closely speaks against electronic equipment.

Preferably, the distance threshold is 10 centimetres.

Preferably, it is described processed using the voice signal as the input of the voice of user include: according to speaker's mouth and The difference of distance between electronic equipment does different disposal to the voice input of user.

Preferably, described to judge whether closely to speak include: to judge whether at least to user against electronic equipment The voice signal spoken in the voice signal of one microphone acquisition comprising user；In response to determining at least one Mike's elegance The voice signal spoken in the voice signal of collection comprising user extracts voice signal from the voice signal that microphone acquires；Sentence When whether the amplitude difference of the disconnected voice signal extracted from the voice signal that different microphones acquire is more than predetermined threshold；Response In determining that Magnitude Difference is more than predetermined threshold, confirmation user closely speaks against electronic equipment.

Preferably, electronic equipment is also operable to: being defined in multiple microphones, the maximum microphone of voice signal amplitude is Response microphones；The difference of microphone according to response does different processing to the voice input of user.

Preferably, it is described judge user whether closely against electronic equipment speaking include: using in advance training Machine learning model handles the voice signal of multiple microphones, judges whether user is closely speaking against electronic equipment.

Preferably, the voice that user speaks includes: user with normal quantity one's voice in speech；User is spoken with small volume Sound；User with vocal cords, do not speak the sound of sending by tune.

Preferably, the feature of judgement includes volume, spectrum signature, Energy distribution etc..

Preferably, electronic equipment is one of smart phone, smartwatch, intelligent ring, tablet computer.

According to another aspect of the present invention, a kind of electronic equipment being built-in with microphone, electronic equipment have memory and Central processing unit is stored with computer executable instructions on memory, when computer executable instructions are executed by central processing unit It is able to carry out following operation: whether judging in the voice signal of microphone acquisition comprising voice signal；In response to confirming microphone Include voice signal in the voice signal of acquisition, judge whether user is mumbling, i.e., in a manner of lower than normal quantity It speaks；In response to determining that user is mumbling, without any wake operation using voice signal as voice input Reason.

Preferably, the described two kinds of sides that mumble to mumble with vocal cords sounding including vocal cords not sounding that mumble Formula.

Preferably, electronic equipment also operates to: in response to determining that user is mumbling；Judge that user is doing vocal cords not Sounding mumbles or is doing mumbling for vocal cords sounding；It is different according to the result of judgement, difference is done to voice signal Processing.

Preferably, different processing is that the different application program of activation carrys out voice responsive input.

Preferably, judge that the signal characteristic whether user uses when mumbling includes volume, spectrum signature, energy Distribution.

Preferably, judging user, sounding does not mumble or is doing making when mumbling for vocal cords sounding doing vocal cords Signal characteristic includes volume, spectrum signature, Energy distribution.

Preferably, described to judge whether user includes: to handle Mike's elegance using machine learning model mumbling The voice signal of collection, judges whether user is mumbling.

Preferably, the machine learning model is convolutional neural networks model or Recognition with Recurrent Neural Network model.

Preferably, sounding does not mumble or is doing mumbling for vocal cords sounding doing vocal cords by the judgement user It include: the voice signal of processing microphone acquisition using machine learning model, judging user, sounding is not whispered doing vocal cords Words are doing mumbling for vocal cords sounding.

Preferably, specific user is identified by voiceprint analysis, only the voice signal comprising specific user's voice is processed.

This programme advantage:

1. interaction is more natural.Equipment is placed on before mouth i.e. triggering voice input, meets user's habit and cognition.

2. service efficiency is higher.One hand can be used.Without switching between different user interface/applications, also it is not required to Some key is pinned, directly lifting hand can use to mouth.

3. radio reception quality is high.The recorder of equipment is in user mouth, and the voice input signal collected is clear, by ambient sound It influences smaller.

4. high privacy with it is social.Equipment before mouth, user need to only issue relatively small sound can be completed it is high-quality The voice of amount inputs, smaller to other people interference, while user's posture may include sealing mouth etc., has preferable secret protection.

Detailed description of the invention

From the detailed description with reference to the accompanying drawing to the embodiment of the present invention, above-mentioned and/or other purpose of the invention, spy The advantage of seeking peace will become clearer and be easier to understand.Wherein:

Fig. 1 is the schematic flow chart of voice input exchange method according to an embodiment of the present invention.

Fig. 2 shows the electronic equipments configured with multiple microphones according to another embodiment of the present invention to use multiple wheats The overview flow chart of the voice input triggering method of the difference of gram received voice signal of wind.

Fig. 3 is shown the electronic equipment according to an embodiment of the present invention for being built-in with microphone and is identified based on the mode of mumbling Voice input triggering method overview flow chart.

Fig. 4 describes the overview flow chart of the voice input triggering method of the Distance Judgment of the voice signal based on microphone

Fig. 5 is that the front by mobile phone upper end microphone close to mouth in trigger gesture according to an embodiment of the present invention is illustrated Figure.

Fig. 6 is to illustrate mobile phone upper end microphone close to the side of mouth in trigger gesture according to an embodiment of the present invention Figure.

Fig. 7 is the schematic diagram by mobile phone lower end microphone close to mouth in trigger gesture according to an embodiment of the present invention.

Fig. 8 is the schematic diagram by smartwatch microphone close to mouth in trigger gesture according to an embodiment of the present invention.

Specific embodiment

In order to make those skilled in the art more fully understand the present invention, with reference to the accompanying drawings and detailed description to this hair It is bright to be described in further detail.

The disclosure inputs triggering for the voice of intelligent electronic device, in spy in the sound that the microphone based on configuration captures Sign, to determine whether to trigger voice input application, wherein without traditional physical button triggering, interface element triggering, waking up word Detection, interaction are more natural.Equipment is placed on before mouth i.e. triggering voice input, meets user's habit and cognition.

It will continue the disclosure from the following aspects below: the voice input of wind noise feature when 1, speaking based on the mankind Triggering, specifically, the voice and wind noise sound when by identifying that people speaks are to directly initiate voice input and by received sound Sound signal is as voice input processing；2, the voice of the difference based on the received voice signal of multiple microphones inputs triggering；3, The voice input triggering identified based on the mode of mumbling；4, the voice input of the Distance Judgment of the voice signal based on microphone Triggering.

One, the voice of wind noise feature inputs triggering when being spoken based on the mankind

When user closely speaks against microphone, even if sound very little or not triggering vocal cords sounding, Mike's elegance It include two kinds of acoustic constituents in the voice signal collected, first is that the sound that the vibration of human vocal band the latter oral cavity issues, second is that people speaks The air-flow of generation hits the wind noise sound that microphone generates.The voice input that electronic equipment can be triggered based on this characteristic is answered With.

Fig. 1 shows the schematic flow chart of voice input exchange method 100 according to an embodiment of the present invention.

In step s101, the voice signal of analysis microphone acquisition identifies in voice signal whether include what people spoke Voice and whether comprising people speak generation air-flow hit microphone generate wind noise sound,

In step s 102, in response to determining comprising people's one's voice in speech in voice signal and speaking generation comprising user Air-flow hit microphone generate wind noise sound, using the voice signal as user voice input process.

The voice input exchange method of the embodiment of the present invention is not particularly suitable for having in the case where privacy requirement is relatively high Carry out voice input to vocal cords sounding.

Here the voice that user speaks may include: that user is spoken with normal quantity one's voice in speech, user with small volume Sound, user with vocal cords, do not speak the sound of sending by tune.

In one example, it can identify above-mentioned different tongue, different feedbacks is generated according to recognition result, than The voice assistant of mobile phone is exactly controlled as normally spoken, mumbles and exactly controls wechat, do not speak vocal cords be exactly to do language by sounding Phonemic transcription notes.

As an example, it includes one or more of that voice signal, which is inputted done processing as the voice of user:

By in sound signal storage to electronic equipment can storage medium；

Voice signal is sent by internet；

Voice signal in voice signal is identified as text, store on electronic equipment can storage medium；

Voice signal in voice signal is identified as text, is sent by internet；

Voice signal in voice signal is identified as text, understands the phonetic order of user, executes corresponding operating.

It in one example, further include that specific user is identified by voiceprint analysis, only to the sound comprising specific user's voice Sound signal processes.

In one example, electronic equipment is one of smart phone, smartwatch, intelligent ring.

In one example, using neural network model judge in voice signal the voice whether spoken comprising user and The wind noise sound that the air-flow shock microphone of generation of speaking generates.This is merely illustrative, and other machines learning algorithm can be used.

In one example, it identifies the voice whether spoken comprising people in voice signal and whether speaks generation comprising people Air-flow hit microphone generate wind noise sound include:

The voice whether spoken comprising user in identification voice signal；

In response to determining the voice spoken in voice signal comprising user, the phoneme in voice is identified, by voice signal table It is shown as aligned phoneme sequence；

For each phoneme in aligned phoneme sequence, determine whether the phoneme is phoneme of feeling elated and exultant, it may be assumed that when user's sounding phoneme There is air-flow to come out from mouth；

It according to fixed length of window cutting is sound clip sequence by voice signal；

Using frequecy characteristic, identify whether each sound clip includes wind noise；

The segment that wind noise is identified as in phoneme and sound fragment sequence venting one's pent-up feelings in phoneme of speech sound sequence is compared, together When by aligned phoneme sequence non-phoneme venting one's pent-up feelings and wind noise segment make comparisons, when phoneme venting one's pent-up feelings and wind noise segment registration are higher than Certain threshold value, and when non-phoneme venting one's pent-up feelings and non-wind noise segment registration are lower than certain threshold value, judge include in the voice signal User speak generation air-flow hit microphone generate wind noise sound.

The sound characteristic made an uproar in identification voice signal comprising wind；

It include wind noise in voice signal in response to determining, identification voice signal includes voice signal；

It include voice signal, the corresponding aligned phoneme sequence of recognition of speech signals in voice signal in response to determining；

It makes an uproar feature for the wind in voice signal, the wind for calculating each moment is made an uproar characteristic strength；

For each phoneme in aligned phoneme sequence, phoneme intensity venting one's pent-up feelings is obtained according to data model predetermined；

It is made an uproar the consistency of feature and aligned phoneme sequence by analyzing wind based on Gaussian Mixture Bayesian model, registration is higher than one When determining threshold value, judge in the voice signal comprising user speak generation air-flow hit microphone generate wind noise sound.

Two, the voice of the difference based on the received voice signal of multiple microphones inputs triggering

Electronic equipment such as mobile phone is built-in with the electronic equipment of multiple microphones, and electronic equipment has memory and centre Device is managed, computer executable instructions is stored on memory, can be held when computer executable instructions are executed by central processing unit The voice of row the present embodiment inputs triggering method.

As shown in Fig. 2, in step s 201, analyzing the voice signal of multiple microphone acquisitions.

In one example, multiple microphones include at least three microphone, constitute microphone array system, pass through sound The time difference that signal reaches each microphone can estimate spatial position of the sound source relative to smart machine.

Here voice signal includes the amplitude of such as voice signal, frequency etc..

In step S202, based on the voice signal of multiple microphones acquisition, judge whether user is closely opposite Electronic equipment is spoken.

In one example, judging whether user closely speaks against electronic equipment includes:

User's mouth is calculated relative to microphone using the time difference reached on array between the voice signal of each microphone The position of array,

When distance of user's mouth apart from electronic equipment is less than certain threshold value, determine user closely against electronics Equipment is spoken.

In one example, the distance threshold is 10 centimetres.

In step S203, in response to determining that user is closely speaking against electronic equipment, by microphone acquisition Voice input processing of the voice signal as user.

In one example, which is processed as the input of the voice of user and includes:

According to the difference of distance between speaker's mouth and electronic equipment, different disposal is done to the voice input of user.Example Such as, when distance is 0-3cm, the voice input of activation voice assistant response user；When distance is 3-10cm, activation wechat is answered With the voice input of program response user, voice messaging is sent to good friend；

Judge whether the voice signal spoken in the voice signal of at least one microphone acquisition comprising user,

The voice signal spoken in voice signal in response to determining the acquisition of at least one microphone comprising user, from wheat Voice signal is extracted in the voice signal of gram elegance collection,

Whether the amplitude difference for judging the voice signal extracted from the voice signal that different microphones acquire is more than predetermined When threshold value,

In response to determining that Magnitude Difference is more than predetermined threshold, confirmation user closely speaks against electronic equipment.

In the above example, can also include:

Defining the maximum microphone of voice signal amplitude in multiple microphones is response microphones,

The difference of microphone according to response does different processing to the voice input of user.For example, when response microphones are When the microphone of smart phone bottom, the voice assistant on smart phone is activated；When response microphones are at the top of smart phone When microphone, activate recorder function by the voice record of user to storage equipment；

In one example, judging whether user closely speaks against electronic equipment includes: that utilization is trained in advance Machine learning model, handle the voice signal of multiple microphones, judge whether user is closely saying against electronic equipment Words.Generally, prepare training sample data, selected machine learning model is then trained using training sample data, in reality Test (is also named) when applying in border sometimes, and the voice signal (as test sample) of multiple microphones captures is inputted machine learning mould Type, obtained output indicate whether user is closely speaking against electronic equipment.As an example, machine learning model is for example For deep learning neural network, support vector machine, decision tree etc..

In one example, the voice that user speaks includes: user with normal quantity one's voice in speech, and user is with small volume One's voice in speech, user with vocal cords, do not speak the sound of sending by tune.

In one example, it includes following a kind of or more for voice signal being inputted done processing as the voice of user Kind: by sound signal storage to electronic equipment can storage medium；Voice signal is sent by internet；By sound Voice signal in signal is identified as text, store on electronic equipment can storage medium；By the voice letter in voice signal Number it is identified as text, is sent by internet；Voice signal in voice signal is identified as text, understands the language of user Sound instruction, executes corresponding operating.

It further include that specific user is identified by voiceprint analysis, only to the sound comprising specific user's voice in an example Signal processes.

As an example, electronic equipment is smart phone, smartwatch, intelligent ring, tablet computer etc..

The present embodiment identifies whether user is closely right using the difference of voice signal between built-in different microphones Electronic equipment speak, and then decide whether start voice input, have many advantages, such as identification reliably, calculation method is simple.

Three, the voice input triggering identified based on the mode of mumbling

Mumble the mode for referring to that speaking volume is less than (for example normally talking with other people) volume of normally speaking.It whispers Words include two ways.One is what vocal cords did not shook to mumble and (be commonly called as secret words), and another kind is that vocal cords shake It mumbles.Under the mode that mumbles that vocal cords do not shake, the sound of generation mainly includes that air is issued by throat, mouth Sound and mouth in tongue tooth issue sound.Under the mode that mumbles of vocal cords vibration, the sound of sending is in addition to packet The sound generated under the mode that mumbles not shaken containing vocal cords further includes the sound that vocal cords vibration generates.But compared to normal The tongue of volume, during the mumbling of vocal cords vibration, it is smaller that vocal cords shake degree, the vocal cords of generation vibration sound compared with It is small.The frequency range for the sound that vocal cords do not shake the sound for the generation that mumbles and vocal cords vibration generates is different, can distinguish.Sound It mumbles with vibration and speaks and can be distinguished by volume threshold with the normal quantity of vocal cords vibration, specific threshold value can mention Preceding setting can also be set by user.

Exemplary method: being filtered the voice signal of microphone acquisition, extracts two parts of signals, and respectively vocal cords shake The raw acoustic constituents V1 of movable property and air pass through sound V2 that throat, mouth issue and that tongue tooth issues in mouth.Work as V1 When being less than certain threshold value with the energy ratio of V2, determine that user is mumbling.

Under normal circumstances, mumbling could only detect when user distance microphone is closer, such as distance When less than 30 centimetres.And define mumbling in the case of short distance and inputted as voice, it is that one kind is easy to learn for a user Practise and understand and facilitate the interactive mode of operation, can exempt explicit wake operation, for example, pressing it is specific wake up key or Person is to wake up word by voice.And this mode is in most of actual use, it will not be by false triggering.

Fig. 3 is shown the electronic equipment according to an embodiment of the present invention equipped with microphone and is identified based on the mode of mumbling Voice input triggering method overview flow chart.There is memory and central processing unit equipped with the electronic equipment of microphone, It is stored with computer executable instructions on memory, basis is able to carry out when computer executable instructions are executed by central processing unit The voice of the embodiment of the present invention inputs triggering method.

As shown in figure 3, whether judging in the voice signal of microphone acquisition in step S301 comprising voice signal.

In step s 302, in response to including voice signal in the voice signal of confirmation microphone acquisition, judge that user is It is no to mumble, i.e., it speaks in a manner of lower than normal quantity.

In step S303, in response to determining that user is mumbling, sound is believed without any wake operation Number be used as voice input processing.

Mumble may include vocal cords not sounding the two ways that mumbles to mumble with vocal cords sounding.

In one example, voice input triggering method can also include: and sentence in response to determining that user is mumbling Doing vocal cords, sounding does not mumble or is doing mumbling for vocal cords sounding disconnected user, different according to the result of judgement, Different processing is done to voice signal.

As an example, different processing is to give voice input to different application programs to handle.For example it normally speaks The voice assistant for exactly controlling mobile phone, mumbles and exactly controls wechat, and do not speak vocal cords be exactly to do phonetic transcription notes by sounding.

As an example, judging that the signal characteristic whether user uses when mumbling may include volume, frequency spectrum spy Sign, Energy distribution etc..

As an example, judging user, doing vocal cords, sounding does not mumble or is doing when mumbling of vocal cords sounding The signal characteristic used includes volume, spectrum signature, Energy distribution etc..

As an example, judging whether user may include: to handle microphone using machine learning model mumbling The voice signal of acquisition, judges whether user is mumbling.

As an example, machine learning model can be convolutional neural networks model or Recognition with Recurrent Neural Network model.

As an example, judge user do vocal cords not sounding mumble or in the packet that mumbles for doing vocal cords sounding Include: using machine learning model, the voice signal of processing microphone acquisition, judging user, sounding does not mumble doing vocal cords Or doing mumbling for vocal cords sounding.

By in sound signal storage to electronic equipment can storage medium；

Voice signal is sent by internet；

Voice signal in voice signal is identified as text, is sent by internet；

As an example, voice input triggering method can also include: by voiceprint analysis identify specific user, only to comprising The voice signal of specific user's voice processes.

As an example, electronic equipment can be smart phone, smartwatch, intelligent ring etc..

Related mumble mode and detection method, as an example, following bibliography can be referred to:

Zhang,Chi,and John HL Hansen."Analysis and classification of speech mode:whispered through shouted."Eighth Annual Conference of the International Speech Communication Association.2007.

Meenakshi,G.Nisha,and Prasanta Kumar Ghosh."Robust whisper activity detection using long-term log energy variation of sub-band signal."IEEE Signal Processing Letters 22.11(2015):1859-1863.

Four, the voice of the Distance Judgment of the voice signal based on microphone inputs triggering

Below with reference to the totality of the voice input triggering method of the Distance Judgment of voice signal of Fig. 4 description based on microphone Flow chart.

As shown in figure 4, in step 401, the voice signal for handling microphones capture judges to whether there is in voice signal Voice signal.

In step 402, in response to confirming there are voice signal in voice signal, the voice signal based on microphone acquisition Further judge whether the mouth distance of intelligent electronic device and user are less than predetermined threshold.

In step 403, in response to determining that electronic equipment and user's mouth distance are less than predetermined threshold, microphone is acquired Voice signal as voice input processing.

In one example, predetermined threshold is 10 centimetres.

Voice signal may include one of following items or combination: user is spoken the sound of sending with normal quantity；With Family mumbles the sound of sending；User's vocal cords not speak the sound of generation by sounding.

In one example, judge to use when whether the mouth distance of intelligent electronic device and user are less than predetermined threshold Feature includes temporal signatures and frequency domain character in voice signal, including volume, spectrum energy.

In one example, whether the mouth distance for judging intelligent electronic device and user is less than predetermined threshold packet It includes: using the data of deep neural network model processing microphone acquisition, judging intelligent electronic device and the mouth distance of user Whether predetermined threshold is less than.

In one example, whether the mouth distance for judging intelligent electronic device and user is less than predetermined threshold packet Include: voice signal of the record user when not doing voice input, by voice signal that microphone currently acquires with not do voice defeated Fashionable voice signal is made comparisons, if it is determined that the voice signal volume that microphone currently acquires is more than when not doing voice input The certain threshold value of the volume of voice signal judges that the mouth distance of intelligent electronic device and user are less than predetermined threshold.

In one example, voice input triggering further includes that specific user is identified by voiceprint analysis, only to comprising specific The voice signal of user speech processes.

In one example, electronic equipment is smart phone, smartwatch, intelligent ring etc..

Fig. 5 to Fig. 8 shows that the microphone of smart electronics portable equipment is placed into mouth closer distance by several users Position, the voice that user issues at this time will be inputted as voice.Wherein, Fig. 5 and Fig. 6 is the feelings that there is microphone in mobile phone upper end respectively The microphone of mobile phone when user has interactive voice intention, can be moved to 0~10 centimeters of mouth in this case by condition, It directly speaks and can be used as voice input.Fig. 7 is that mobile phone lower end has the case where microphone, has microphone similar with aforementioned upper end Seemingly, two kinds of postures are not mutual exclusions, and interaction side can be implemented in any one posture if mobile phone upper and lower side has microphone Case.Fig. 8 be corresponding equipment be smartwatch when the case where, with above equipment be mobile phone the case where it is similar.It is above-mentioned to trigger gesture Explanation be exemplary, and non-exclusive, and be also not necessarily limited to disclosed various equipment and microphone situation.

Voice input is received as single microphone is used and triggers a specific embodiment of voice input, it can be with The voice input received first by analyzing single microphone judges whether it is voice, and special by the short distance of analysis voice Some features, such as microphone plosive, near field wind are made an uproar, air blowing sound, energy, spectrum signature, temporal signatures, judge electronic equipment Itself whether it is less than given threshold value at a distance from the mouth of user, and judges that voice inputs whether source belongs to by Application on Voiceprint Recognition User can be serviced, in conjunction with the above several points to determine whether inputting microphone signal as voice.

Voice input is received as dual microphone is used and triggers a specific embodiment of voice input, by dividing The feature difference for analysing dual microphone input signal, such as energy feature, spectrum signature, judge sounding position whether close to one of them Then microphone passes through by the signal difference of dual microphone to which shielding environment noise, separation voice are to corresponding monophonic The characteristic analysis method of above-mentioned single microphone judges that electronic equipment itself is less than given threshold value at a distance from the mouth of user, and User can be serviced by judging whether voice input source belongs to by Application on Voiceprint Recognition, in conjunction with the above several points to determine whether signal is made For voice input.

Voice input is received as multi-microphone array is used and triggers a specific embodiment of voice input, is led to The difference for crossing the signal for the voice input that comparative analysis difference microphone receives, by separating near field voice letter from environment Number, whether identification includes voice with detection voice signal, judges voice signal by the auditory localization technology of multi-microphone array User's mouth position and the distance between equipment whether be less than predetermined threshold, and by Application on Voiceprint Recognition judge voice input come Whether source, which belongs to, can service user, in conjunction with the above several points to determine whether inputting signal as voice.

In one example, detect that position of articulation is located at certainly by analysis voice signal in smart electronics portable equipment Near body namely mobile device is located at user's mouth closer location, and smart electronics portable equipment is just using voice signal as voice Input understands that the voice of user inputs and completes phase in conjunction with natural language processing technique according to the difference of task and context Answering for task.

Microphone is not limited to aforementioned exemplary, but may include one of following items or a combination thereof: built in equipment Single microphone；Dual microphone built in equipment；Multi-microphone array built in equipment；External wireless microphone；And external wired wheat Gram wind.

As previously mentioned, smart electronics portable equipment can be mobile phone, equipped with ears bluetooth headset, having with microphone Line earphone or other microphone sensors.

Smart electronics portable equipment can be one of wrist-watch, and intelligent ring, watch intelligent wearable device.

Smart electronics portable equipment is that head-wearing type intelligent shows equipment, is equipped with microphone or multi-microphone group.

In one example, after the input application of electronic apparatus activating voice, feedback output, feedback output packet can be made Include one of vibration, voice, image or a combination thereof.

The scheme of each embodiment of the present invention can provide following one or more of advantages:

Various embodiments of the present invention are described above, above description is exemplary, and non-exclusive, and It is not limited to disclosed each embodiment.Without departing from the scope and spirit of illustrated each embodiment, for this skill Many modifications and changes are obvious for the those of ordinary skill in art field.Therefore, protection scope of the present invention is answered This is subject to the protection scope in claims.

Claims

1. a kind of intelligent electronic device configured with microphone, the smart electronics portable equipment operates as follows carries out base with user In the interaction of voice input:

The voice signal of processing microphones capture judges in voice signal with the presence or absence of voice signal；

In response to, there are voice signal, the voice signal based on microphone acquisition further judges intelligence electricity in confirmation voice signal Whether the mouth of sub- equipment and user distance are less than predetermined threshold；And

In response to determining that electronic equipment and user's mouth distance are less than predetermined threshold, using the voice signal of microphone acquisition as language Sound input processing.

2. intelligent electronic device according to claim 1, predetermined threshold is 3 centimetres.

3. intelligent electronic device according to claim 1, predetermined threshold is 1 centimetre.

4. intelligent electronic device according to claim 1, there are also close to optical sensor at the microphone of electronic equipment, by close Optical sensor judges whether there is object proximity electronic equipment.

5. intelligent electronic device according to claim 1, there are also range sensors at the microphone of electronic equipment, pass through distance and pass Sensor directly measures electronic equipment at a distance from user's mouth.

6. intelligent electronic device according to claim 1 judges smart electronics by the voice signal property of microphone collection Whether the mouth of equipment and user distance are less than predetermined threshold.

7. intelligent electronic device according to claim 1, the voice signal includes one of following items or combination:

User is spoken the sound of sending with normal quantity；

User mumbles the sound of sending；

User's vocal cords not speak the sound of generation by sounding.

8. intelligent electronic device according to claim 1, further include:

In response to determining that user is closely speaking against electronic equipment,

Judge user one of as follows in sounding, comprising:

User with normal quantity one's voice in speech,

User with small volume one's voice in speech,

User with vocal cords, do not speak the sound of sending by tune；And

It is different according to the result of judgement, different processing is done to voice signal.

9. a kind of interactive voice awakening method executed by the intelligent electronic device configured with microphone, including operate be used for as follows Intelligent electronic device and user carry out the interaction inputted based on voice:

Processing microphones capture voice signal judge in voice signal whether there is voice signal,

In response to, there are voice signal, the voice signal based on microphone acquisition further judges intelligence electricity in confirmation voice signal Whether the mouth of sub- equipment and user distance are less than predetermined threshold；

10. a kind of computer-readable medium is stored thereon with computer executable instructions, computer executable instructions are by computer Interactive voice awakening method is able to carry out when execution, the interactive voice awakening method includes: