CN107705785A

CN107705785A - Sound localization method, intelligent sound box and the computer-readable medium of intelligent sound box

Info

Publication number: CN107705785A
Application number: CN201710647123.8A
Authority: CN
Inventors: 高聪
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Baidu Online Network Technology Beijing Co Ltd; Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2017-08-01
Filing date: 2017-08-01
Publication date: 2018-02-16

Abstract

The present invention provides a kind of sound localization method of intelligent sound box, intelligent sound box and computer-readable medium.Its method includes：If it is determined that when needing the voice signal that collection target sound source is sent, obtain at least two groups of signal receiving modules on intelligent sound box and the first voice signal for waking up word is preset to the carrying for receiving target sound source transmission；Two signal receiving modules for obtaining each group signal receiving module centering receive the time difference of the first voice signal；According to the first voice signal and each signal receiving module to receiving time difference of the first voice signal, it is determined that sending the orientation of the target sound source of the first voice signal.Technical scheme, target sound source can be positioned under the more scene of sound source, so, intelligent sound box can only gather the voice signal of the target sound source of orientation, and then provide service for user corresponding to the target sound source；But also it can effectively enrich the function of intelligent sound box so that the use of intelligent sound box is more flexibly, conveniently.

Description

Sound localization method, intelligent sound box and the computer-readable medium of intelligent sound box

【Technical field】

The present invention relates to Computer Applied Technology field, more particularly to a kind of sound localization method of intelligent sound box, intelligence Audio amplifier and computer-readable medium.

【Background technology】

With the development of science and technology, smart machine steps into the family of user, intelligentized furniture environment is formed.Such as intelligence Can audio amplifier be used as a kind of intelligent equipment in smart home, can help user look into music, look into weather, chat, talk with etc., Therefore intelligent sound box is needed with speech recognition, semantic parsing, content service, the generation of words art, voice TTS (TextToSpeech； TTS the functions such as feedback) are reported.

In the prior art, intelligent sound box is both provided with the wake-up word of acquiescence, and intelligent sound box may be at stopping when not working Dormancy state.When user needs intelligent sound box to start, can by way of voice calling intelligent audio amplifier wake-up word, intelligent sound Case detects that the wake-up word of itself is waken up, and just launches into working condition.The speech polling (Query) of user's input is received, Then speech recognition, semantic parsing, voice inquirement Query Query Result are carried out, feedback letter is then generated according to Query Result Art if breath, and art if feedback information is subjected to TTS conversions, report feedback information to user by way of voice.

But in the prior art, intelligent sound box can not position to sound source, if multiple users around intelligent sound box are same When calling intelligent audio amplifier when, multiple users can be caused intelligent sound box work chaotic, can not realize language equivalent to multi-acoustical Sound Query inquiry.

【The content of the invention】

The invention provides a kind of sound localization method of intelligent sound box, intelligent sound box and computer-readable medium, it is used for Realize positioning of the intelligent sound box to sound source.

The present invention provides a kind of sound localization method of intelligent sound box, and methods described includes：

If it is determined that when needing the voice signal that collection target sound source is sent, at least two groups of signals receptions on intelligent sound box are obtained Module is to receiving default the first voice signal for waking up word of carrying that the target sound source is sent；The default wake-up word is used to supply The target sound source wakes up the intelligent sound box；

Two signal receiving modules for obtaining signal receiving module centering described in each group receive the first voice letter Number time difference；

According to first voice signal and each signal receiving module to receive first voice signal when Between it is poor, it is determined that sending the orientation of the target sound source of first voice signal.

Still optionally further, in method as described above, received according to first voice signal and each signal Module is to receiving time difference of first voice signal, it is determined that sending the side of the target sound source of first voice signal After position, methods described also includes：

It is fixed to be lighted in rotational positioning cue mark to the orientation of the target sound source and/or to the orientation of the target sound source Position indicator lamp, to inform user corresponding to the target sound source, the orientation of the target sound source has been determined.

Still optionally further, in method as described above, two letters of each signal receiving module centering are obtained Number receiving module receives the time difference of first voice signal, specifically includes：

With first group of signal receiving module of at least two groups signal receiving module centerings to for object of reference, described in selection The candidate direction θ of target sound source；

For each signal receiving module pair, according to the candidate direction θ of the target sound source, the letter corresponding to acquisition Two signal receiving modules of number receiving module centering receive the time difference t0 of first voice signal, wherein the t0 For the function on the θ.

Still optionally further, in method as described above, received according to first voice signal and each signal Module is to receiving time difference of first voice signal, it is determined that sending the side of the target sound source of first voice signal Position, is specifically included：

Signal receiving module, will letter described in each group to the time difference t0 of reception first voice signal according to each group First voice signal that two signal receiving modules of number receiving module centering receive is carried out at alignment in time Reason；

Signal receiving module described in each group is calculated to the phase of first voice signal after corresponding two registration process Guan Xing；

Signal receiving module described in each group is superimposed to the corresponding correlation, obtains overall relevancy；

It is the target sound to obtain the candidate direction θ corresponding to the target sound source when overall relevancy takes maximum The target direction in source.

Still optionally further, in method as described above, according to each signal receiving module to receiving first language The time difference t0 of sound signal, described first that two of each signal receiving module centering signal receiving modules are received Voice signal carries out registration process in time, specifically includes：

By each signal receiving module to first received in two first voice signals of reception described first Time difference t0 described in delay of speech signals, or two first voice signals by each signal receiving module to reception In after first voice signal that receives shift to an earlier date the time difference t0, to cause two first voice signals in the time Upper alignment.

Still optionally further, in method as described above, at least two groups of signal receiving module docking on intelligent sound box are obtained Before receiving default the first voice signal for waking up word of carrying that the target sound source is sent, methods described also includes：

It is determined that need to gather the voice signal that the target sound source is sent；

Still optionally further, it is determined that needing to gather the voice signal that the target sound source is sent, specifically include：

Obtain carrying default first voice signal for waking up word that the target sound source is sent；

The default wake-up word is extracted from first voice signal；

Extract the vocal print feature of first voice signal；

According to it is pre-stored it is described it is default wake up word and the corresponding relation of the vocal print feature of the target sound source, described in judgement It is default to wake up whether word matches with the vocal print feature of first voice signal；

If matching, it is determined that needing to gather the voice signal that the target sound source is sent.

Still optionally further, in method as described above, the carrying default wake-up that the target sound source is sent is obtained Before first voice signal of word, methods described also includes：

Receive carrying default second voice signal for waking up word that user speech corresponding to the target sound source inputs；

The default wake-up word is extracted from second voice signal；

Extract the vocal print feature of second voice signal, the vocal print feature as the target sound source；

Establish and store the corresponding relation of the default vocal print feature for waking up word and the target sound source.

The present invention provides a kind of intelligent sound box, and the intelligent sound box includes：

Signal acquisition module, for if it is determined that when needing the voice signal that collection target sound source is sent, obtaining intelligent sound box On at least two groups of signal receiving modules to receiving default the first voice signal for waking up word of carrying that the target sound source is sent；Institute The default word that wakes up is stated to be used to wake up the intelligent sound box for the target sound source；

Time difference acquisition module, for obtaining two signal receiving modules of signal receiving module centering described in each group Receive the time difference of first voice signal；

Locating module, for according to first voice signal and each signal receiving module to receiving described first The time difference of voice signal, it is determined that sending the orientation of the target sound source of first voice signal.

Still optionally further, in intelligent sound box as described above, in addition to：

Indicating module is positioned, on rotational positioning cue mark to the orientation of the target sound source and/or to the mesh The orientation of mark sound source lights positioning light, and to inform user corresponding to the target sound source, the orientation of the target sound source is Through being determined.

Still optionally further, in intelligent sound box as described above, the time difference acquisition module, it is specifically used for：

Still optionally further, in intelligent sound box as described above, the locating module, it is specifically used for：

Still optionally further, in intelligent sound box as described above, the locating module, specifically for each signal is connect Module is received to the time difference described in first delay of speech signals that is first received in two first voice signals of reception T0, or by each signal receiving module to first language that receives after in two first voice signals of reception Sound signal shifts to an earlier date the time difference t0, to cause two first voice signals to align in time.

Still optionally further, in intelligent sound box as described above, the intelligent sound box also includes：

Determining module, the voice signal sent for determining to need to gather the target sound source；

Further, the determining module, is specifically used for：

The default wake-up word is extracted from first voice signal；

Extract the vocal print feature of first voice signal；

Receiving module, the carrying default wake-up word inputted for receiving user speech corresponding to the target sound source Second voice signal；

Extraction module, for extracting the default wake-up word from second voice signal；

The extraction module, it is additionally operable to extract the vocal print feature of second voice signal, as the target sound source Vocal print feature；

Module is established, for establishing and storing the default wake-up word pass corresponding with the vocal print feature of the target sound source System.

The present invention also provides a kind of intelligent sound box, including multiple microphones for receiving and transmitting signal；The intelligent sound box is also Including：

One or more processors；

Memory, for storing one or more programs,

When one or more of programs are by one or more of computing devices so that one or more of processing Device realizes the sound localization method of intelligent sound box as described above.

The present invention also provides a kind of computer-readable medium, is stored thereon with computer program, the program is held by processor The sound localization method of intelligent sound box as described above is realized during row.

Sound localization method, intelligent sound box and the computer-readable medium of the intelligent sound box of the present invention, however, it is determined that needs are adopted During the voice signal that collection target sound source is sent, by obtaining on intelligent sound box at least two groups of signal receiving modules to receiving target sound Default the first voice signal for waking up word of carrying that source is sent；Two signals for obtaining each group signal receiving module centering receive mould Block receives the time difference of the first voice signal；According to the first voice signal and each signal receiving module to receiving the first voice letter Number time difference, it is determined that sending the orientation of the target sound source of the first voice signal.Technical scheme, can sound source compared with Under more scenes, target sound source is positioned, so, intelligent sound box can only gather the voice of the target sound source of orientation Signal, and then provide service for user corresponding to the target sound source；But also the function of intelligent sound box can be effectively enriched, make Obtain the use of intelligent sound box more flexibly, conveniently.

【Brief description of the drawings】

Fig. 1 is the flow chart of the sound localization method embodiment one of the intelligent sound box of the present invention.

Fig. 2 is a kind of application scenario diagram of the sound localization method of the intelligent sound box of the present invention.

Fig. 3 is another application scenario diagram of the sound localization method of the intelligent sound box of the present invention.

Fig. 4 is the flow chart of the sound localization method embodiment two of the intelligent sound box of the present invention.

Fig. 5 is the structure chart of the intelligent sound box embodiment one of the present invention.

Fig. 6 is the structure chart of the intelligent sound box embodiment two of the present invention.

Fig. 7 is the structure chart of the intelligent sound box embodiment three of the present invention.

Fig. 8 is a kind of exemplary plot of intelligent sound box provided by the invention.

【Embodiment】

In order that the object, technical solutions and advantages of the present invention are clearer, below in conjunction with the accompanying drawings with specific embodiment pair The present invention is described in detail.

Fig. 1 is the flow chart of the sound localization method embodiment one of the intelligent sound box of the present invention.As shown in figure 1, this implementation The sound localization method of the intelligent sound box of example, specifically may include steps of：

100th, if it is determined that when needing the voice signal that collection target sound source is sent, at least two groups of signals on intelligent sound box are obtained Receiving module is to receiving default the first voice signal for waking up word of carrying that target sound source is sent；

The executive agent of the sound localization method of the intelligent sound box of the present embodiment is intelligent sound box.The default of the present embodiment calls out Word of waking up is used to wake up intelligent sound box for target sound source.The target sound source of the present embodiment is preferably the use interactive with intelligent sound box Family.In order to receive the voice Query of user, and the voice Query based on user reports feedback information, this implementation to user Signal receiving module and signal transmitting module are provided with the intelligent sound box of example.For example, the signal receiving module on intelligent sound box It can become one with signal transmitting module, such as can be the microphone for being integrated in intelligent sound box, realize to from four sides eight Side signal reception and feedback information is to all the winds reported.Alternatively, can be symmetrical on intelligent sound box in the present embodiment Ground is provided with even number microphone, is such as uniformly arranged four microphones or four groups of microphones in the surrounding of intelligent sound box, every group Two can be included.

, can be by multi-acoustical calling intelligent audio amplifier, because intelligent sound box can not be simultaneously in the usage scenario of the present embodiment Multiple voice Query are handled, intelligent sound box passes through own analysis, it may be determined that need to gather the language that target sound source therein is sent Sound signal, for example, it may be determined that need to gather the voice signal being initially received, or by other strategies from multi-acoustical Target sound source therein is obtained, and determines to need to gather the voice signal that target sound source is sent.Now need further to target Sound source is positioned, in the present embodiment, it is necessary first to obtains on intelligent sound box at least two groups of signal receiving modules to receiving target Default the first voice signal for waking up word of carrying that sound source is sent, with by means of each group signal receiving module to receive first Voice signal positions to target sound source.

The signal receiving module of the present embodiment is to that can be the microphone pair on intelligent sound box, and every group of microphone is to including two Individual microphone, the two microphones are any two microphone on intelligent sound box.Because one group of microphone is to that may navigate to Target sound source from symmetrical two orientation, cause to position not accurate enough.In the present embodiment, in order to target sound source Orientation is positioned, it is necessary to select at least two groups of microphones to being positioned to the orientation of target sound source.Selected in the present embodiment The relation at least two groups microphones pair selected is not limited.For example, one of which is two adjacent microphones, another pair can be with To be adjacent, or two relative microphones on diagonal.Due to different microphone distance objective sound sources Position it is different, then differ at the time of the same voice signal sent for the target sound source that different microphones receives.

In the present embodiment, however, it is determined that during the voice signal for needing collection target sound source to send, then need to obtain intelligent sound box On at least two groups of signal receiving modules to receiving default the first voice signal for waking up word of carrying that target sound source is sent, for example, If the default wake-up word of intelligent sound box is " great Bai ", it " great Bai, great Bai " voice signal, then can be target that user, which sends voice, The first voice signal that sound source is sent.That is, in the present embodiment, can be according to mesh when being positioned to target sound source The first voice signal that mark sound source wakes up intelligent sound box using default wake-up word is positioned, without gathering target sound source again Other voice signals.

101st, two signal receiving modules for obtaining each group signal receiving module centering receive the time of the first voice signal Difference；

If for example, signal receiving module to for microphone pair when, due to different microphone to target sound source distance not It is identical, therefore, for each microphone centering two microphones receive target sound source the first voice signal existence time it is poor. Alternatively, the step 101, can specifically include：With first group of signal receiving module of at least two groups signal receiving module centerings Candidate direction θ to for object of reference, choosing target sound source；For each signal receiving module pair, according to the candidate side of target sound source To θ, two signal receiving modules of signal receiving module centering corresponding to acquisition receive the time difference t0 of the first voice signal, its Middle t0 is the function on θ.

Specifically, can first goal-selling sound source in the present embodiment because microphone is to can not accurately learn the time difference Candidate direction θ, be then based on the candidate direction θ of target sound source, can represent that two microphones of each microphone centering receive The time difference t0 of first voice signal of target sound source.Target sound source in the present embodiment is far field sound source, i.e., target sound source with The distance between microphone is far longer than the distance between each microphone, now it is considered that the voice signal that target sound source is sent It is that each microphone is transmitted in a manner of parallel lines.Can now select first group of microphone of at least two groups microphone centerings to for Object of reference, it relative to the candidate direction of first group of microphone pair of the object of reference is θ to take target sound source.Due at least two groups of Mikes The distance between two microphones of wind centering each group microphone centering are known, therefore according to the candidate direction of target sound source The distance between two microphones of each group microphone centering, it can be received with identifying two microphones of each group microphone centering The first voice signal time difference t0, now t0 is the function of the candidate direction θ on target sound source, and the time of target sound source It can be that is, target sound source can be positioned in the space either one of 0-360 degree in space on 0-360 degree unspecified angles to select direction θ On position.

A kind of such as application scenario diagram of the sound localization method for the intelligent sound box that Fig. 2 is the present invention.As shown in Fig. 2 its Middle A, B and C are respectively the microphone on intelligent sound box, and A and B, A and C composition microphones pair, target sound can be taken in the present embodiment The first voice signal that source is sent is transmitted in the form of parallel wave to each microphone.As shown in Fig. 2 using microphone to A and B as Object of reference, the candidate direction that can take target sound source is θ, is then boost line BO perpendicular to the parallel wave direction of target sound source, AO is that the first voice signal reaches microphone B with reaching microphone A range difference, AO distance, delta d=L × cos θ, wherein L Equal to the distance between microphone A and B.Further, microphone A and B receives the first voice signal that target sound source is sent Time difference t0 is equal to AO distance divided by velocity of sound V, and so, time difference t0 can be expressed as：T0=Δ d/V=L × cos θ/V, i.e., Time difference t0 is the function on θ.

Now, L is also equal to A and C, AC distance for another group of microphone in Fig. 2.On intelligent sound box, microphone To the line of A and C line perpendicular to microphone to A and B, now received according to the geometrical relationship of triangle, microphone A and C The time difference t0 for the first voice signal that target sound source is sent can be expressed as：T0=L × sin θ/V.

Such as another application scenario diagram of the sound localization method for the intelligent sound box that Fig. 3 is the present invention.With above-mentioned Fig. 2's Processing mode is similar, and the time difference t0 that microphone A and B receive the first voice signal that target sound source is sent can be expressed as：t0 =Δ d/V=L × cos θ/V.And if the straight line where microphone A and C is parallel to the parallel wave of the first voice signal, now Microphone A and C receive the length L ' that the length of the time difference t0 of the first voice signal that target sound source is sent equal to AC is in AC Divided by velocity of sound V.

Above-mentioned Fig. 2 and Fig. 3 are only the citing of two kinds of special screnes, in practical application, for the intelligent sound under any scene At least two groups of signal receiving modules pair of case, first group of signal receiving module therein can be always selected to for object of reference, obtaining The candidate direction θ of target sound source, and can be according to the position relationship of each group signal receiving module pair on intelligent sound box, by each group Two signal receiving modules of signal receiving module centering receive the time difference of the first voice signal using the candidate of target sound source Direction θ shows.

102nd, according to the first voice signal and each signal receiving module to receiving time difference of the first voice signal, it is determined that Send the orientation of the target sound source of the first voice signal.

When signal receiving module to for microphone pair when, for every group of microphone pair, the first language that two microphones receive The time difference of sound signal can be expressed as the candidate direction θ of target sound source function, and candidate direction θ can choose 0-360 degree In the range of any orientation angle.Furthermore, it is possible to the first voice signal that two microphones are received aligns in time, So, two the first voice signals should have most strong correlation.Then by way of traveling through the angle in each orientation, obtain Take the angle of two the first voice signal correlation maximums, just for target sound source orientation angle.

That is the step 102, specifically may include steps of：

(a1) the time difference t0 according to each group signal receiving module to the first voice signal of reception, by each signal receiving module The first voice signal that two signal receiving modules of centering receive carries out registration process in time；

, can be by each signal receiving module in two the first voice signals of reception during registration process in the present embodiment The the first delay of speech signals time difference t0 first received, or each signal receiving module is believed two the first voices of reception The the first voice signal pre-set time difference t0 received after in number, to cause two the first voice signals to align in time.

(a2) correlation of each group signal receiving module to the first voice signal after corresponding two registration process is calculated；

(a3) each signal receiving module is superimposed to corresponding correlation, obtains overall relevancy；

(a4) target side that candidate direction θ corresponding to target sound source when overall relevancy takes maximum is target sound source is obtained To.

If when it is determined that sending the orientation of the target sound source of the first voice signal, a pair of microphones pair are only selected, in Fig. 2 Microphone A and C, now, as shown in Fig. 2 may AC left side also exist one it is symmetrically standby on AC with target sound source Select target sound source, the correlation of the first voice signal after registration process can also reach maximum, can not now uniquely determine The orientation of target sound source.And if choosing microphone again to A and B, can be with unique true to determine the orientation of target sound source jointly The orientation for the sound source that sets the goal.Therefore, it is necessary to obtain at least two groups of signal receiving modules to such as microphone pair, ability in the present embodiment The orientation for the target sound source for sending the first voice signal can be uniquely determined.

Specifically, for each group of signal receiving module pair, two signals of this group of signal receiving module centering are received The first voice signal that module receives aligns out, and registration process mode refers to above-mentioned record.For example, can be for certain microphone To a and b, a receives the first voice signal Y2 after first receiving the first voice signal Y1, b, time difference t0, can receive a The the first voice signal Y1 delay t0 arrived, or the first voice signal Y2 that can receive b shift to an earlier date t0.In the present embodiment, If with the first delay of speech signals that will first receive, the first voice signal after now postponing is Y1 '.Then can calculate Y1 ' and Y2 correlation.For each group of microphone pair, corresponding correlation can be obtained in the manner described above, then will Each group microphone is added to corresponding correlation, just obtains overall relevancy；Because Y1 ' delays t0, t0 is the letter on θ again Number, therefore, above-mentioned steps (a4), each candidate side corresponding to target sound source can be determined by way of traveling through each θ To the value of overall relevancy corresponding to θ, it is target sound to obtain candidate direction θ corresponding to target sound source when overall relevancy takes maximum The target direction in source.

If that is, selection target sound source candidate direction θ, be exactly just sound source true directions, then this When, the first voice signal that two microphones receive is alignd it according to the candidate direction θ time differences calculated on a timeline Afterwards, there should be most strong correlation between two paths of signals., whereas if selection candidate direction θ, be not sound source true side To, then after the time difference alignment calculated according to candidate direction θ, between the first voice signal that two microphones receive Correlation dies down.Therefore, can be by detecting the first voice signal of the alignment of two microphones corresponding to each candidate direction θ Correlation, possibility of first voice signal from each candidate direction θ can be judged, correlation is stronger, illustrates that sound more has It is probably incident from θ directions.

Still optionally further, step 102 is " according to the first voice signal and each signal receiving module to receiving the first voice After the time difference of signal, it is determined that sending the orientation of the target sound source of the first voice signal ", it can also include：Rotational positioning refers to Indicating, which is remembered in the orientation of target sound source and/or to the orientation of target sound source, lights positioning light, to inform target sound source pair The user answered, the orientation of target sound source have been determined.

That is, intelligent sound box after being positioned to target sound source, it is necessary to which the user to the target sound source makes necessarily Feedback, to inform that the target sound source that user sent to it positions, user can be interactive with the intelligent sound box, by intelligent sound Case provides service.In the present embodiment, positioning cue mark can be provided with intelligent sound box, the positioning Warning Mark can be one Individual rotatable pointer, after intelligent sound box positions to the target sound source, the positioning cue mark can be rotated to target sound In the orientation in source, so, the user in this orientation can see its target sound source and be positioned.Or on the intelligent sound box also Positioning light can be provided with, such as the positioning light can be arranged on the pointer of positioning cue mark, so, intelligence After the orientation of audio amplifier positioning target sound source, positioning light can also be lighted to the orientation of target sound source, to inform target sound User corresponding to source, the orientation of target sound source have been determined.Above two mode can be used alone, and can also combine makes With.

The sound localization method of the intelligent sound box of the present embodiment, however, it is determined that need to gather the voice signal that target sound source is sent When, preset wake-up word to receiving the carrying that target sound source is sent by obtaining at least two groups of signal receiving modules on intelligent sound box First voice signal；Two signal receiving modules for obtaining each group signal receiving module centering receive the time of the first voice signal Difference；According to the first voice signal and each signal receiving module to receiving time difference of the first voice signal, it is determined that sending first The orientation of the target sound source of voice signal.The technical scheme of the present embodiment, can be under the more scene of sound source, to target sound source Positioned, so, intelligent sound box can only gather the voice signal of the target sound source of orientation, and then be the target sound source Corresponding user provides service；But also it can effectively enrich the function of intelligent sound box so that the use of intelligent sound box is more Flexibly, conveniently.

Fig. 4 is the flow chart of the sound localization method embodiment two of the intelligent sound box of the present invention.As shown in figure 4, this implementation The sound localization method of the intelligent sound box of example, on the basis of the technical scheme of above-described embodiment, is further situated between in further detail Continue technical scheme.As shown in figure 4, the sound localization method of the intelligent sound box of the present embodiment, can also specifically include Following technical scheme：

200th, default the second voice signal for waking up word of carrying of user speech input corresponding to target sound source is received；

201st, default wake-up word is extracted from the second voice signal；

202nd, the vocal print feature of the second voice signal, the vocal print feature as target sound source are extracted；

203rd, establish and store the corresponding relation of the default vocal print feature for waking up word and target sound source；

In the application scenarios of the sound localization method of the intelligent sound box of the present embodiment, the intelligent sound box can support increase to set Put default wake-up word.For example, on the basis of the wake-up word of the intelligent sound box of prior art is immutable, in the present embodiment, intelligence Can the owner of audio amplifier can be that oneself or family members set default wake-up word in intelligent sound box.For example, if intelligent sound box is silent It is small A to recognize wake-up word, and owner is first people with intelligent sound box dialogue after intelligent sound box is bought, and the acquiescence wakes up Word can be the privately owned wake-up word of the owner.Agree to by owner, owner can also allow other family members users on intelligent sound box Its privately owned wake-up word is set.For example, in a comparable manner, when user corresponding to target sound source sets privately owned wake-up word, target User corresponding to sound source can call " De-Lovely " with the second voice signal of the default wake-up word of phonetic entry carrying, such as the user, The default wake-up word that the De-Lovely is set for the user of the target sound source to the intelligent sound box.Now intelligent sound box is from the second voice Default wake-up word is extracted in signal；And extract the vocal print feature of the second voice signal, the vocal print feature as target sound source；Then Establish and store the corresponding relation of the default vocal print feature for waking up word and target sound source.That is, call the default wake-up word The vocal print feature of voice signal be necessary for vocal print feature in the corresponding relation, or the voice using vocal print feature calling The wake-up word carried in signal is necessary for the default wake-up word in corresponding relation, and otherwise intelligent sound box can be ignored.

The process of the corresponding relation of above-mentioned foundation and the default vocal print feature for waking up word and target sound source of storage can be advance Carry out, be easy to subsequently directly detect using the corresponding relation.

204th, default the first voice signal for waking up word of carrying that target sound source is sent is obtained；

205th, default wake-up word is extracted from the first voice signal；

206th, the vocal print feature of the first voice signal is extracted；

207th, according to the corresponding relation of the vocal print feature of pre-stored default wake-up word and target sound source, default wake up is judged Whether word matches with the vocal print feature of the first voice signal；If matching, perform step 208；Otherwise, step 209 is performed；

208th, determine to need to gather the voice signal that target sound source is sent, terminate.

That is, the default wake-up word and the corresponding relation of the vocal print feature of target sound source that are obtained according to step 203, sentence Whether the vocal print feature that the default wake-up word and step 206 that disconnected step 205 obtains extract matches.If matching, now determine to need The voice signal that collection target sound source is sent, you can to perform the technical scheme of above-mentioned embodiment illustrated in fig. 1.

209th, determine that default wake-up word and vocal print feature mismatch, any operation wouldn't be performed.

Or alternatively, if intelligent sound box determines that current preset wakes up word and vocal print feature and mismatched, now intelligent sound box Certain voice message " sorry, the wake-up word that you use is wrong, temporarily can not provide service to you " etc. can also be made Similar prompting message.

The sound localization method of the intelligent sound box of the present embodiment, by using such scheme, it can also further use sound Line and default wake-up word realize the determination to voice signal together.In the mode of the present embodiment, for same intelligent sound box, no Same user can set different wake-up words, and the user can be waken up into word and the vocal print of the user in intelligent sound box The corresponding relation storage of feature, so each user can only wake up the intelligent sound box using its privately owned wake-up word, and be handed over it Mutually, so, intelligent sound box has good identification to the vocal print feature and wake-up word of each voice signal, can not only strengthen The accuracy that intelligent sound box positions to voice signal, but also the using experience degree of user can be greatly enhanced.

Fig. 5 is the structure chart of the intelligent sound box embodiment one of the present invention.As shown in figure 5, the intelligent sound box of the present embodiment, tool Body can include：Signal acquisition module 10, time difference acquisition module 11 and locating module 12.

Wherein signal acquisition module 10 is used to, if it is determined that when needing the voice signal that collection target sound source is sent, obtain intelligence At least two groups of signal receiving modules are to receiving default the first voice signal for waking up word of carrying that target sound source is sent on audio amplifier；In advance If wake up word to be used to wake up intelligent sound box for target sound source；

Time difference acquisition module 11 is used to obtain the two of each group signal receiving module centering of the acquisition of signal acquisition module 10 Individual signal receiving module receives the time difference of the first voice signal；

Locating module 12 is used for the first voice signal and time difference acquisition module obtained according to signal acquisition module 10 11 obtain each signal receiving modules to receive the first voice signal time difference, it is determined that sending the target sound of the first voice signal The orientation in source.

The intelligent sound box of the present embodiment, the realization principle and technique effect of auditory localization are realized by using above-mentioned module It is identical with realizing for above-mentioned related method embodiment, the record of above-mentioned related method embodiment is may be referred in detail, herein no longer Repeat.

Fig. 6 is the structure chart of the intelligent sound box embodiment two of the present invention.As shown in fig. 6, the intelligent sound box of the present embodiment, On the basis of the technical scheme of above-mentioned embodiment illustrated in fig. 5, it can also further include following technical scheme.

As shown in fig. 6, the intelligent sound box of the present embodiment, can also include positioning indicating module 13.

Positioning indicating module 13 is used for the orientation of the target sound source positioned in rotational positioning cue mark to locating module 12 The orientation of target sound source upper and/or to the positioning of locating module 12 lights positioning light, is used with informing corresponding to target sound source Family, the orientation of target sound source have been determined.

Still optionally further, in the intelligent sound box of the present embodiment, time difference acquisition module 11 is specifically used for：

With first group of signal receiving module of at least two groups signal receiving module centerings to for object of reference, choosing target sound source Candidate direction θ；

For each signal receiving module pair, according to the candidate direction θ of target sound source, signal receiving module pair corresponding to acquisition In two signal receiving modules receive the time difference t0 of the first voice signal, wherein t0 is function on θ.

Still optionally further, in the intelligent sound box of the present embodiment, locating module 12 is specifically used for：

Time difference t0 according to each group signal receiving module to the first voice signal of reception, by each group signal receiving module pair In two signal receiving modules receive the first voice signal carry out registration process in time；

Calculate correlation of each group signal receiving module to the first voice signal after corresponding two registration process；

Each group signal receiving module is superimposed to corresponding correlation, obtains overall relevancy；

Obtain the target direction that candidate direction θ corresponding to target sound source when overall relevancy takes maximum is target sound source.

Still optionally further, in the intelligent sound box of the present embodiment, locating module 12 is specifically used for each signal receiving module Received to the first delay of speech signals time difference t0 first received in two the first voice signals of reception, or by each signal Module is to the first voice signal pre-set time difference t0 for receiving after in two the first voice signals of reception, to cause two One voice signal aligns in time.

Still optionally further, as shown in fig. 6, the intelligent sound box of the present embodiment also includes：

Determining module 14 is used to determine to need to gather the voice signal that target sound source is sent.

Still optionally further, in the intelligent sound box of the present embodiment, determining module 14 is specifically used for：

Obtain default the first voice signal for waking up word of carrying that target sound source is sent；

Default wake-up word is extracted from the first voice signal；

Extract the vocal print feature of the first voice signal；

According to the corresponding relation of the vocal print feature of pre-stored default wake-up word and target sound source, judge it is default wake up word with Whether the vocal print feature of the first voice signal matches；

If matching, it is determined that needing to gather the voice signal that target sound source is sent.

Accordingly, when it is determined that needing the voice signal that collection target sound source is sent, trigger signal obtains determining module 14 At least two groups of signal receiving modules preset wake-up word to receiving the carrying that target sound source is sent on the acquisition intelligent sound box of module 10 First voice signal.

Receiving module 15 is used to receive default the second language for waking up word of carrying that user speech corresponding to target sound source inputs Sound signal；

Default wake-up word is extracted in the second voice signal that extraction module 16 is used to receive from receiving module 15；

Extraction module 16 is additionally operable to extract the vocal print feature for the second voice signal that receiving module 15 receives, as target sound The vocal print feature in source；

Establish the vocal print spy that module 17 is used for the default wake-up word and target sound source for establishing and storing the extraction of extraction module 16 The corresponding relation of sign.

Accordingly, determining module 14 is used for according to the pre-stored default wake-up word and target sound source for establishing the foundation of module 17 Vocal print feature corresponding relation, judge default to wake up whether word matches with the vocal print feature of the first voice signal.

Fig. 7 is the structure chart of the intelligent sound box embodiment three of the present invention.As shown in fig. 7, the intelligent sound box of the present embodiment, bag Include multiple microphone (not shown)s for receiving and transmitting signal.For example, multiple microphones of the intelligent sound box can uniformly divide For cloth on the housing of intelligent sound box, multiple microphones are used for the voice Query for receiving user, are additionally operable to the voice based on user Query reports feedback information to user.The intelligent sound box of the present embodiment also includes：One or more processors 30, and storage Device 40, memory 40 be used for store one or more programs, when the one or more programs stored in memory 40 by one or Multiple processors 30 are performed so that one or more processors 30 are realized such as the intelligent sound box of figure 1 above-embodiment illustrated in fig. 4 Sound localization method.In embodiment illustrated in fig. 7 exemplified by including multiple processors 30.

For example, Fig. 8 is a kind of exemplary plot of intelligent sound box provided by the invention.Fig. 8 is shown suitable for being used for realizing this hair The exemplary intelligent sound box 12a of bright embodiment block diagram.The intelligent sound box 12a that Fig. 8 is shown is only an example, should not be right The function and use range of the embodiment of the present invention bring any restrictions.

As shown in figure 8, the intelligent sound box 12a of the present embodiment is showed in the form of universal computing device.Intelligent sound box 12a's Component can include but is not limited to：One or more processor 16a, system storage 28a, connection different system component (bag Include system storage 28a and processor 16a) bus 18a.

Bus 18a represents the one or more in a few class bus structures, including memory bus or Memory Controller, Peripheral bus, graphics acceleration port, processor or the local bus using any bus structures in a variety of bus structures.Lift For example, these architectures include but is not limited to industry standard architecture (ISA) bus, MCA (MAC) Bus, enhanced isa bus, VESA's (VESA) local bus and periphery component interconnection (PCI) bus.

Intelligent sound box 12a typically comprises various computing systems computer-readable recording medium.These media can be it is any can be by The usable medium that intelligent sound box 12a is accessed, including volatibility and non-volatile media, moveable and immovable medium.

System storage 28a can include the computer system readable media of form of volatile memory, such as deposit at random Access to memory (RAM) 30a and/or cache memory 32a.Intelligent sound box 12a may further include it is other it is removable/no Movably, volatile/non-volatile computer system storage medium.Only as an example, storage system 34a can be used for reading and writing Immovable, non-volatile magnetic media (Fig. 8 is not shown, is commonly referred to as " hard disk drive ").Although not shown in Fig. 8, can To provide the disc driver being used for may move non-volatile magnetic disk (such as " floppy disk ") read-write, and to removable non-volatile Property CD (such as CD-ROM, DVD-ROM or other optical mediums) read-write CD drive.In these cases, it is each to drive Dynamic device can be connected by one or more data media interfaces with bus 18a.System storage 28a can include at least one Individual program product, the program product have one group of (for example, at least one) program module, and these program modules are configured to perform The function of the above-mentioned each embodiments of Fig. 1-Fig. 6 of the present invention.

Program with one group of (at least one) program module 42a/utility 40a, such as system can be stored in and deposited In reservoir 28a, such program module 42a include --- but being not limited to --- operating system, one or more application program, Other program modules and routine data, the reality of network environment may be included in each or certain combination in these examples It is existing.Program module 42a generally performs the function and/or method in above-mentioned each embodiments of Fig. 1-Fig. 6 described in the invention.

Intelligent sound box 12a can also be with one or more external equipment 14a (such as keyboard, sensing equipment, display 24a Deng) communication, the equipment communication interacted with intelligent sound box 12a can be also enabled a user to one or more, and/or with causing Any equipment that intelligent sound box 12a can be communicated with one or more of the other computing device (such as network interface card, modem Etc.) communication.This communication can be carried out by input/output (I/O) interface 22a.Also, intelligent sound box 12a can also lead to Cross network adapter 20a and one or more network (such as LAN (LAN), wide area network (WAN) and/or public network, example Such as internet) communication.As illustrated, network adapter 20a is communicated by bus 18a with intelligent sound box 12a other modules.Should When understanding, although not shown in the drawings, other hardware and/or software module can be used with reference to intelligent sound box 12a, including it is but unlimited In：Microcode, device driver, redundant processor, external disk drive array, RAID system, tape drive and data Backup storage system etc..

Processor 16a is stored in program in system storage 28a by operation, so as to perform various function application and Data processing, such as realize the sound localization method of the intelligent sound box shown in above-described embodiment.

The present invention also provides a kind of computer-readable medium, is stored thereon with computer program, the program is held by processor The sound localization method of the intelligent sound box as shown in above-mentioned embodiment is realized during row.

The computer-readable medium of the present embodiment can be included in the system storage 28a in above-mentioned embodiment illustrated in fig. 8 RAM30a, and/or cache memory 32a, and/or storage system 34a.

With the development of science and technology, the route of transmission of computer program is no longer limited by tangible medium, can also be directly from net Network is downloaded, or is obtained using other modes.Therefore, the computer-readable medium in the present embodiment can not only include tangible Medium, invisible medium can also be included.

The computer-readable medium of the present embodiment can use any combination of one or more computer-readable media. Computer-readable medium can be computer-readable signal media or computer-readable recording medium.Computer-readable storage medium Matter for example may be-but not limited to-system, device or the device of electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor, or Combination more than person is any.The more specifically example (non exhaustive list) of computer-readable recording medium includes：With one Or the electrical connections of multiple wires, portable computer diskette, hard disk, random access memory (RAM), read-only storage (ROM), Erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disc read-only storage (CD-ROM), light Memory device, magnetic memory device or above-mentioned any appropriate combination.In this document, computer-readable recording medium can Be it is any include or the tangible medium of storage program, the program can be commanded execution system, device or device use or Person is in connection.

Computer-readable signal media can include in a base band or as carrier wave a part propagation data-signal, Wherein carry computer-readable program code.The data-signal of this propagation can take various forms, including --- but It is not limited to --- electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be Any computer-readable medium beyond computer-readable recording medium, the computer-readable medium can send, propagate or Transmit for by instruction execution system, device either device use or program in connection.

The program code included on computer-readable medium can be transmitted with any appropriate medium, including --- but it is unlimited In --- wireless, electric wire, optical cable, RF etc., or above-mentioned any appropriate combination.

It can be write with one or more programming languages or its combination for performing the computer that operates of the present invention Program code, described program design language include object oriented program language-such as Java, Smalltalk, C++, Also include conventional procedural programming language-such as " C " language or similar programming language.Program code can be with Fully perform, partly perform on the user computer on the user computer, the software kit independent as one performs, portion Divide and partly perform or performed completely on remote computer or server on the remote computer on the user computer. Be related in the situation of remote computer, remote computer can pass through the network of any kind --- including LAN (LAN) or Wide area network (WAN)-be connected to subscriber computer, or, it may be connected to outer computer (such as carried using Internet service Pass through Internet connection for business).

In several embodiments provided by the present invention, it should be understood that disclosed system, apparatus and method can be with Realize by another way.For example, device embodiment described above is only schematical, for example, the unit Division, only a kind of division of logic function, can there is other dividing mode when actually realizing.

The unit illustrated as separating component can be or may not be physically separate, show as unit The part shown can be or may not be physical location, you can with positioned at a place, or can also be distributed to multiple On NE.Some or all of unit therein can be selected to realize the mesh of this embodiment scheme according to the actual needs 's.

In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, can also That unit is individually physically present, can also two or more units it is integrated in a unit.Above-mentioned integrated list Member can both be realized in the form of hardware, can also be realized in the form of hardware adds SFU software functional unit.

The above-mentioned integrated unit realized in the form of SFU software functional unit, can be stored in one and computer-readable deposit In storage media.Above-mentioned SFU software functional unit is stored in a storage medium, including some instructions are causing a computer It is each that equipment (can be personal computer, server, or network equipment etc.) or processor (processor) perform the present invention The part steps of embodiment methods described.And foregoing storage medium includes：USB flash disk, mobile hard disk, read-only storage (Read- Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disc or CD etc. it is various Can be with the medium of store program codes.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention God any modification, equivalent substitution and improvements done etc., should be included within the scope of protection of the invention with principle.

Claims

1. a kind of sound localization method of intelligent sound box, it is characterised in that methods described includes：

If it is determined that when needing the voice signal that collection target sound source is sent, at least two groups of signal receiving modules on intelligent sound box are obtained Default the first voice signal for waking up word of carrying sent to receiving the target sound source；The default wake-up word is used for described in confession Target sound source wakes up the intelligent sound box；

Two signal receiving modules for obtaining signal receiving module centering described in each group receive first voice signal Time difference；

According to first voice signal and each signal receiving module to receiving time difference of first voice signal, It is determined that send the orientation of the target sound source of first voice signal.

2. according to the method for claim 1, it is characterised in that connect according to first voice signal and each signal Module is received to receiving the time difference of first voice signal, it is determined that send the target sound source of first voice signal After orientation, methods described also includes：

In rotational positioning cue mark to the orientation of the target sound source and/or positioning is lighted to the orientation of the target sound source to refer to Show lamp, to inform user corresponding to the target sound source, the orientation of the target sound source has been determined.

3. according to the method for claim 1, it is characterised in that described in obtain each signal receiving module centering two Signal receiving module receives the time difference of first voice signal, specifically includes：

With first group of signal receiving module of at least two groups signal receiving module centerings to for object of reference, choosing the target The candidate direction θ of sound source；

For each signal receiving module pair, according to the candidate direction θ of the target sound source, the signal corresponding to acquisition connects Two signal receiving modules for receiving module centering receive the time difference t0 of first voice signal, wherein the t0 is pass In the function of the θ.

4. according to the method for claim 3, it is characterised in that connect according to first voice signal and each signal Module is received to receiving the time difference of first voice signal, it is determined that send the target sound source of first voice signal Orientation, specifically include：

Signal receiving module connects signal described in each group to the time difference t0 of reception first voice signal according to each group First voice signal for receiving two signal receiving modules receptions of module centering carries out registration process in time；

Signal receiving module described in each group is calculated to the correlation of first voice signal after corresponding two registration process；

It is the target sound source to obtain the candidate direction θ corresponding to the target sound source when overall relevancy takes maximum Target direction.

5. according to the method for claim 4, it is characterised in that according to each signal receiving module to receiving described first The time difference t0 of voice signal, described that two of each signal receiving module centering signal receiving modules are received One voice signal carries out registration process in time, specifically includes：

By each signal receiving module to first voice that is first received in two first voice signals of reception Time difference t0 described in signal delay, or by each signal receiving module to after in two first voice signals of reception First voice signal received shifts to an earlier date the time difference t0, to cause two first voice signals right in time Together.

6. according to the method for claim 1, it is characterised in that obtain at least two groups of signal receiving modules pair on intelligent sound box Before receiving default the first voice signal for waking up word of carrying that the target sound source is sent, methods described also includes：

Further, it is determined that needing to gather the voice signal that the target sound source is sent, specifically include：

The default wake-up word is extracted from first voice signal；

Extract the vocal print feature of first voice signal；

According to the corresponding relation of the pre-stored default vocal print feature for waking up word and the target sound source, judge described default Wake up whether word matches with the vocal print feature of first voice signal；

7. according to the method for claim 6, it is characterised in that obtain described preset of carrying that the target sound source is sent and call out Wake up before first voice signal of word, methods described also includes：

The default wake-up word is extracted from second voice signal；

8. a kind of intelligent sound box, it is characterised in that the intelligent sound box includes：

Signal acquisition module, for if it is determined that when needing the voice signal that collection target sound source is sent, obtaining intelligent sound box up to Few two groups of signal receiving modules are to receiving default the first voice signal for waking up word of carrying that the target sound source is sent；It is described pre- If wake up word to be used to wake up the intelligent sound box for the target sound source；

Time difference acquisition module, two signal receiving modules for obtaining signal receiving module centering described in each group receive The time difference of first voice signal；

Locating module, for according to first voice signal and each signal receiving module to receiving first voice The time difference of signal, it is determined that sending the orientation of the target sound source of first voice signal.

9. intelligent sound box according to claim 8, it is characterised in that also include：

Indicating module is positioned, on rotational positioning cue mark to the orientation of the target sound source and/or to the target sound The orientation in source lights positioning light, to inform user corresponding to the target sound source, the orientation of the target sound source by It is determined that.

10. intelligent sound box according to claim 8, it is characterised in that the time difference acquisition module, be specifically used for：

11. intelligent sound box according to claim 10, it is characterised in that the locating module, be specifically used for：

12. intelligent sound box according to claim 11, it is characterised in that the locating module, specifically for will be each described Signal receiving module is to described in first delay of speech signals that is first received in two first voice signals of reception Time difference t0, or by each signal receiving module to described in receiving after in two first voice signals of reception First voice signal shifts to an earlier date the time difference t0, to cause two first voice signals to align in time.

13. intelligent sound box according to claim 8, it is characterised in that the intelligent sound box also includes：

Further, the determining module, is specifically used for：

The default wake-up word is extracted from first voice signal；

Extract the vocal print feature of first voice signal；

14. intelligent sound box according to claim 13, it is characterised in that the intelligent sound box also includes：

Receiving module, the second of the carrying default wake-up word inputted for receiving user speech corresponding to the target sound source Voice signal；

The extraction module, it is additionally operable to extract the vocal print feature of second voice signal, the vocal print as the target sound source Feature；

Module is established, for establishing and storing the corresponding relation of the default vocal print feature for waking up word and the target sound source.

15. a kind of intelligent sound box, including multiple microphones for receiving and transmitting signal；Characterized in that, the intelligent sound box also wraps Include：

One or more processors；

Memory, for storing one or more programs,

When one or more of programs are by one or more of computing devices so that one or more of processors are real The now method as described in any in claim 1-7.

16. a kind of computer-readable medium, is stored thereon with computer program, it is characterised in that the program is executed by processor Methods of the Shi Shixian as described in any in claim 1-7.