CN116819441A

CN116819441A - Sound source positioning and controlling method, device, equipment and storage medium

Info

Publication number: CN116819441A
Application number: CN202310508840.8A
Authority: CN
Inventors: 王子腾; 纳跃跃; 田彪; 付强
Original assignee: Alibaba Damo Institute Hangzhou Technology Co Ltd
Current assignee: Alibaba Damo Institute Hangzhou Technology Co Ltd
Priority date: 2023-05-08
Filing date: 2023-05-08
Publication date: 2023-09-29

Abstract

The present disclosure relates to a sound source localization, control method, apparatus, device, and storage medium. The method and the device detect wake-up events through multiple paths of audio signals collected by the intelligent equipment, and determine a covariance matrix for controlling sound source signals of the intelligent equipment according to the covariance matrix of the multiple paths of audio signals and the covariance matrix of interference signals when the wake-up events occur. Further, according to the covariance matrix of the sound source signal and the guide vectors respectively corresponding to the plurality of positioning points distributed in the preset range of the intelligent equipment, determining the directional responses respectively corresponding to the plurality of positioning points, and taking the positioning point with the largest directional response in the plurality of positioning points as the positioning point of the sound source. Because the occurrence of the wake-up event means that the intelligent device can detect the sound source signals from the multipath audio signals, the sound source localization is carried out when the occurrence of the wake-up event is detected, the influence of the interference signals on the sound source localization can be effectively avoided or reduced, and the sound source localization accuracy is improved.

Description

Sound source positioning and controlling method, device, equipment and storage medium

Technical Field

The disclosure relates to the field of information technology, and in particular, to a sound source positioning and controlling method, a device, equipment and a storage medium.

Background

Currently, for some intelligent devices, such as intelligent speakers, a sound source controlling the intelligent speaker may be positioned by a certain algorithm. For example, the distance, azimuth, elevation, etc. of the sound source relative to the smart speaker is determined.

However, since some interference signals such as noise generated from home appliances such as an electric fan, a television set, or talking sounds between users are always generated around the smart device. Thereby causing the intelligent device to be unable to accurately locate the sound source.

Disclosure of Invention

In order to solve the above technical problems or at least partially solve the above technical problems, the present disclosure provides a sound source positioning method, a sound source control device, and a storage medium, so as to accurately position a sound source.

In a first aspect, an embodiment of the present disclosure provides a sound source localization method, including:

acquiring multiple paths of audio signals acquired by intelligent equipment;

detecting event occurrence of waking up the intelligent device according to the multi-channel audio signals;

determining a covariance matrix for controlling a sound source signal of the intelligent device according to the covariance matrix of the multipath audio signals and the covariance matrix of the interference signals obtained according to the multipath audio signals;

According to the covariance matrix of the sound source signal and the guide vectors respectively corresponding to a plurality of positioning points distributed in the preset range of the intelligent equipment, determining the directional responses respectively corresponding to the plurality of positioning points, and taking the positioning point with the largest directional response among the plurality of positioning points as the positioning point of the sound source.

In a second aspect, an embodiment of the present disclosure provides a control method of an intelligent device, where the method includes:

acquiring multiple paths of audio signals acquired by intelligent equipment;

according to the covariance matrix of the sound source signal and the guide vectors respectively corresponding to a plurality of positioning points distributed in the preset range of the intelligent equipment, determining the directional responses respectively corresponding to the plurality of positioning points, and taking the positioning point with the largest directional response among the plurality of positioning points as the positioning point of the sound source;

and controlling the intelligent equipment to indicate the azimuth of the sound source relative to the intelligent equipment according to the positioning point of the sound source.

In a third aspect, an embodiment of the present disclosure provides a method for controlling an intelligent sound box, where the method includes:

acquiring multiple paths of audio signals collected by an intelligent sound box;

detecting the occurrence of an event of waking up the intelligent sound box according to the multipath audio signals;

determining a covariance matrix for controlling a sound source signal of the intelligent sound box according to the covariance matrix of the multipath audio signals and the covariance matrix of the interference signals obtained according to the multipath audio signals;

according to the covariance matrix of the sound source signal and the guide vectors respectively corresponding to a plurality of positioning points distributed in the preset range of the intelligent sound box, determining the directional responses respectively corresponding to the plurality of positioning points, and taking the positioning point with the largest directional response among the plurality of positioning points as the positioning point of the sound source;

determining the azimuth of the sound source relative to the intelligent sound box according to the positioning point of the sound source;

and according to the azimuth, controlling an indicator lamp corresponding to the azimuth in the intelligent sound box to be lightened.

In a fourth aspect, embodiments of the present disclosure provide a sound source localization apparatus, including:

the acquisition module is used for acquiring multiple paths of audio signals acquired by the intelligent equipment;

The detection module is used for detecting the occurrence of an event of waking up the intelligent equipment according to the multipath audio signals;

the first determining module is used for determining a covariance matrix for controlling sound source signals of the intelligent equipment according to the covariance matrix of the multipath audio signals and the covariance matrix of interference signals obtained according to the multipath audio signals;

and the second determining module is used for determining the directional responses respectively corresponding to the plurality of positioning points according to the covariance matrix of the sound source signal and the guide vectors respectively corresponding to the plurality of positioning points distributed in the preset range of the intelligent equipment, and taking the positioning point with the largest directional response among the plurality of positioning points as the positioning point of the sound source.

In a fifth aspect, embodiments of the present disclosure provide an electronic device, including:

a memory;

a processor; and

a computer program;

wherein the computer program is stored in the memory and configured to be executed by the processor to implement the method according to the first, second and third aspects.

In a sixth aspect, embodiments of the present disclosure provide a computer readable storage medium having stored thereon a computer program for execution by a processor to implement the methods of the first, second, and third aspects.

According to the sound source positioning, control method, device, equipment and storage medium, a wake-up event is detected through multiple paths of audio signals collected by the intelligent equipment, and when the wake-up event occurs, a covariance matrix for controlling the sound source signals of the intelligent equipment is determined according to the covariance matrix of the multiple paths of audio signals and the covariance matrix of the interference signals. Further, according to the covariance matrix of the sound source signal and the guide vectors respectively corresponding to the plurality of positioning points distributed in the preset range of the intelligent equipment, determining the directional responses respectively corresponding to the plurality of positioning points, and taking the positioning point with the largest directional response in the plurality of positioning points as the positioning point of the sound source. Because the occurrence of the wake-up event means that the intelligent device can detect the sound source signal from the multipath audio signals, namely, the interference signal is insufficient to influence the intelligent device to identify and analyze the sound source signal, the sound source is positioned when the occurrence of the wake-up event is detected, and the influence of the interference signal on the sound source positioning can be effectively avoided or reduced. Therefore, even if an interference signal exists around the intelligent device, the sound source can be accurately positioned by the embodiment.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.

In order to more clearly illustrate the embodiments of the present disclosure or the solutions in the prior art, the drawings that are required for the description of the embodiments or the prior art will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.

FIG. 1 is a schematic diagram of a coordinate system provided by an embodiment of the present disclosure;

fig. 2 is a flowchart of a sound source localization method provided in an embodiment of the present disclosure;

fig. 3 is a schematic diagram of an application scenario provided in an embodiment of the present disclosure;

fig. 4 is a schematic diagram of an application scenario provided in another embodiment of the present disclosure;

FIG. 5 is a schematic diagram of an anchor point provided by another embodiment of the present disclosure;

FIG. 6 is a flow chart of a sound source localization method provided in another embodiment of the present disclosure;

fig. 7 is a schematic diagram of an application scenario provided in another embodiment of the present disclosure;

FIG. 8 is a flow chart of a sound source localization method provided in another embodiment of the present disclosure;

fig. 9 is a schematic diagram of an application scenario provided in another embodiment of the present disclosure;

Fig. 10 is a schematic structural diagram of a sound source positioning device according to an embodiment of the present disclosure;

fig. 11 is a schematic structural diagram of a control device of an intelligent device according to an embodiment of the disclosure;

fig. 12 is a schematic structural diagram of a control device of an intelligent sound box according to an embodiment of the disclosure;

fig. 13 is a schematic structural diagram of an embodiment of an electronic device according to an embodiment of the disclosure.

Detailed Description

In order that the above objects, features and advantages of the present disclosure may be more clearly understood, a further description of aspects of the present disclosure will be provided below. It should be noted that, without conflict, the embodiments of the present disclosure and features in the embodiments may be combined with each other.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced otherwise than as described herein; it will be apparent that the embodiments in the specification are only some, but not all, embodiments of the disclosure.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or fully authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region, and provide corresponding operation entries for the user to select authorization or rejection.

In addition, the multi-mode information processing method provided by the application can relate to the following terms for explanation, and the detailed contents are as follows:

sound source localization: and determining a coordinate system based on the device-side microphone array, and determining the azimuth angle, the elevation angle and the distance of the sound source signal. It is assumed that a smart device, such as a smart box, comprises a plurality of microphones, such as microphone 1, microphone 2, … microphone N shown in fig. 1, which constitute a microphone array. A three-dimensional coordinate system is established by taking the center of the microphone array as an origin O, x represents the x-axis of the three-dimensional coordinate system, y represents the y-axis of the three-dimensional coordinate system, and z represents the z-axis of the three-dimensional coordinate system. The sound source is, for example, a user, who controls the smart device by means of a speech signal. For example, controlling a smart device to turn on, turn off, play music, broadcast weather, etc. The intelligent device can locate the sound source through a certain algorithm. For example, the distance, azimuth, elevation, etc. of the sound source relative to the smart speaker is determined. The distance, azimuth and elevation are specifically shown in fig. 1. The angle between Ox and OA may be used as the azimuth angle, assuming that the projection point of the sound source on the xOy plane is a. The length of the line between the origin O and the sound source may be taken as the distance. The angle between the line and the OA may be referred to as the elevation angle.

In view of this problem, embodiments of the present disclosure provide a sound source localization method, which is described below in connection with specific embodiments.

Fig. 2 is a flowchart of a sound source localization method according to an embodiment of the present disclosure. The method can be executed by a sound source positioning device, the device can be realized in a software and/or hardware mode, the device can be configured in electronic equipment, such as a server or a terminal, wherein the terminal specifically comprises intelligent equipment such as a mobile phone, a computer, a tablet personal computer, an intelligent sound box and the like. The server may specifically be a cloud server, where the sound source localization method may be executed by the cloud server, and a plurality of computing nodes (cloud servers) may be deployed in the cloud server, where each computing node has processing resources such as computation and storage. At the cloud, a service may be provided by multiple computing nodes, although one computing node may provide one or more services. The cloud may provide the service by providing a service interface to the outside, and the user invokes the service interface to use the corresponding service. The service interface includes a software development kit (Software Development Kit, abbreviated as SDK), an application program interface (Application Programming Interface, abbreviated as API), and the like.

In addition, the sound source localization method described in the present embodiment may also be performed by an intelligent device such as an intelligent sound box. Specifically, the sound source localization method described in this embodiment may be applied to an application scenario as shown in fig. 3. As shown in fig. 3, the application scenario includes a smart device 31 and a user 32, wherein the user 32 is a sound source controlling the smart device 31. The following describes the method in detail with reference to fig. 3, and as shown in fig. 2, the method specifically includes the following steps:

s201, acquiring multiple paths of audio signals acquired by the intelligent equipment.

The smart device 31 shown in fig. 3 is provided with a plurality of audio acquisition modules, such as a plurality of microphones, which constitute a microphone array. In this embodiment, if an audio signal collected by one microphone is referred to as a single audio signal, audio signals collected by a plurality of microphones may be referred to as multiple audio signals. For each audio signal, the audio signal may include both the control voice of the user 32 to the smart device 31 and the interference signal in the environment surrounding the smart device 31. The interfering signal may include noise from a household appliance such as an electric fan, a television, etc., and/or the interfering signal may include interfering speech, for example, when the user 32 is talking with other users, the speech of the other users may be considered to be interfering speech. In addition, if the smart device 31 further includes an audio playing module, for example, a speaker, the interference signal may further include an audio signal played by the speaker, that is, an echo.

S202, detecting event occurrence of waking up the intelligent equipment according to the multipath audio signals.

Specifically, the smart device 31 may detect, in real time, whether a wake-up event occurs according to the multiple audio signals collected by the smart device 31, where the wake-up event may specifically be an event of waking up the smart device 31. When the smart device 31 detects the occurrence of a wake-up event, the following steps are performed.

S203, determining a covariance matrix for controlling the sound source signal of the intelligent device according to the covariance matrix of the multipath audio signals and the covariance matrix of the interference signals obtained according to the multipath audio signals.

When the smart device 31 detects the occurrence of a wake-up event, a covariance matrix for controlling the sound source signal of the smart device 31 may be determined according to the covariance matrix of the multi-path audio signal and the covariance matrix of the interference signal. The sound source signal may be a control voice of the user 32 to the smart device 31. Wherein the covariance matrix of the interference signal is obtained according to the multipath audio signals.

S204, determining the corresponding directional responses of the plurality of positioning points according to the covariance matrix of the sound source signal and the guide vectors respectively corresponding to the plurality of positioning points distributed in the preset range of the intelligent equipment, and taking the positioning point with the largest directional response among the plurality of positioning points as the positioning point of the sound source.

Specifically, the smart device 31 may determine a preset range around it, for example, the preset range may be a three-dimensional space region centered on the origin of coordinates O, and the three-dimensional space region may be a spherical region, a square region, or a rectangular parallelepiped region. Further, the smart device 31 selects a plurality of anchor points from the preset range, and calculates guide vectors corresponding to the respective anchor points. According to the guiding vectors respectively corresponding to the positioning points and the covariance matrix of the sound source signals, the pointing response respectively corresponding to the positioning points can be calculated. It will be appreciated that the directional responses for different anchor points may be different, and therefore, of the plurality of anchor points, the directional response for a portion of the anchor points is larger and the directional response for a portion of the anchor points is smaller. The embodiment can select the locating point with the largest pointing response from the plurality of locating points, and takes the locating point with the largest pointing response as the locating point of the sound source.

According to the embodiment of the disclosure, the wake-up event is detected through the multipath audio signals collected by the intelligent equipment, and when the wake-up event occurs, the covariance matrix of the sound source signal for controlling the intelligent equipment is determined according to the covariance matrix of the multipath audio signals and the covariance matrix of the interference signal. Further, according to the covariance matrix of the sound source signal and the guide vectors respectively corresponding to the plurality of positioning points distributed in the preset range of the intelligent equipment, determining the directional responses respectively corresponding to the plurality of positioning points, and taking the positioning point with the largest directional response in the plurality of positioning points as the positioning point of the sound source. Because the occurrence of the wake-up event means that the intelligent device can detect the sound source signal from the multipath audio signals, namely, the interference signal is insufficient to influence the intelligent device to identify and analyze the sound source signal, the sound source is positioned when the occurrence of the wake-up event is detected, and the influence of the interference signal on the sound source positioning can be effectively avoided or reduced. Therefore, even if an interference signal exists around the intelligent device, the sound source can be accurately positioned by the embodiment.

The method described in the above embodiment may also be applied to the application scenario shown in fig. 4, for example, the smart device described above may be a smart speaker as shown in fig. 4. The intelligent sound box comprises a microphone array and a loudspeaker. Specifically, not only the sound source controlling the intelligent sound box, but also the interference signal are included around the intelligent sound box. For example, the interference signals include noise generated by home appliances such as an electric fan and a television, interference voices of other users, audio signals played by speakers, and the like. Thus, each of the multiple audio signals collected by the microphone array includes a sound source signal and an interference signal. The intelligent sound box can perform echo cancellation for each path of audio signals in the multipath audio signals, namely, the echo in each path of audio signals is cancelled, and the data stream after echo cancellation is obtained. The echo is the audio signal played by the speaker. Furthermore, the intelligent sound box can process the data stream after echo cancellation according to a voice awakening algorithm to obtain the existence probability of an awakening keyword, namely the awakening word, and detect whether an awakening event occurs according to the existence probability of the awakening keyword. When the occurrence of the wake-up event is detected, the intelligent sound box can conduct sound source positioning according to the existence probability of the wake-up keyword and the data stream after echo cancellation, and a positioning result is obtained.

Optionally, detecting, according to the multiple audio signals, an event occurrence of waking up the smart device includes: detecting the existence probability of the wake-up keyword according to the multipath audio signals; and if the existence probability meets a preset condition, determining that an event for waking up the intelligent equipment occurs.

For example, the smart speaker shown in fig. 4 may detect multiple audio signals in real time, and detect the existence probability of a wake-up keyword according to the multiple audio signals in real time, where the wake-up keyword is a wake-up word for waking up the smart speaker. When the existence probability of the wake-up keyword satisfies a preset condition, for example, when the existence probability of the wake-up keyword is greater than a preset threshold, determining that a wake-up event occurs.

Optionally, detecting the existence probability of the wake-up keyword according to the multi-path audio signal includes: respectively carrying out echo cancellation on the multiple paths of audio signals to obtain multiple paths of first time-frequency domain signals; and detecting the existence probability of the awakening keywords according to the multipath first time-frequency domain signals.

For example, the smart speaker includes N microphones, i.e., N channels, and a microphone array formed by the N microphones may collect N audio signals. After echo cancellation is performed on each path of audio signal, a path of time-frequency domain signal can be obtained, and the time-frequency domain signal is recorded as a first time-frequency domain signal. Therefore, after echo cancellation is performed on the N paths of audio signals, N paths of first time-frequency domain signals can be obtained. The N first time-frequency domain signals may be echo cancelled data streams as described above. Further, according to the N paths of first time-frequency domain signals, the existence probability of the awakening keywords is detected.

Optionally, echo cancellation is performed on the multiple paths of audio signals to obtain multiple paths of first time-frequency domain signals, including: performing Fourier transform on the multiple paths of audio signals respectively to obtain second time-frequency domain signals corresponding to the multiple paths of audio signals respectively; and carrying out echo cancellation on the second time-frequency domain signals corresponding to the multiple paths of audio signals respectively to obtain multiple paths of first time-frequency domain signals.

For example, each audio signal may be denoted as x _n (t), n represents a channel index, and t represents a time index. For x _n (t) obtaining a time-frequency domain signal X after Fourier transform (Fourier transform) _n (t, f), f represents the frequency domain index. The time-frequency domain signal may be denoted as a second time-frequency domain signal. That is, one audio signal is fourier transformed to obtain a second time-frequency domain signal, and N audio signals are fourier transformed to obtain N second time-frequency domain signals.

Further, echo cancellation is performed for a second time-frequency domain signal corresponding to each path of audio signal. For example, the second time-frequency domain signal corresponding to the audio signal on the nth channel is denoted as X _n (t,f)。X _n (t, f) after echo cancellation algorithm, the first time-frequency domain signal is recorded as S _n (t, f). Therefore, the N paths of second time-frequency domain signals are subjected to an echo cancellation algorithm to obtain N paths of first time-frequency domain signals.

Further, after the N paths of first time-frequency domain signals pass through a voice awakening algorithm, the existence probability p (t) of the awakening keywords is obtained. At some point, if p (t) satisfies a preset condition, for example, p (t) is greater than a preset threshold, determining that a wake-up event occurs.

Optionally, the covariance matrix of the multiple audio signals is determined according to the multiple first time-frequency domain signals.

For example, the smart speaker may determine a covariance matrix of the N audio signals from the N first time-frequency domain signals. The covariance matrix comprises n×n elements, i.e., the covariance matrix comprises N rows and N columns. Assuming that m represents the mth row of N rows and N represents the nth column of N columns, then the element Cor1 on the mth row and nth column in the covariance matrix _m,n (t, f) can be expressed as the following formula (1):

wherein S is _m (t, f) represents the mth first time-frequency domain of the N first time-frequency domain signalsThe signal is transmitted to the host computer via the communication network,representing an nth first time-frequency domain signal S of the N first time-frequency domain signals _n Complex conjugate of (t, f). E represents the desire.

Optionally, the covariance matrix of the interference signal is determined according to a preset probability, and the multiple paths of first time-frequency domain signals, where the preset probability is related to the existence probability.

For example, while the covariance matrix of the N audio signals is calculated, the covariance matrix of the interference signal may also be calculated, where the covariance matrix also includes n×n elements, i.e., the covariance matrix includes N rows and N columns. Assuming that m represents the mth row of N rows and N represents the nth column of N columns, then the element Cor2 on the mth row and nth column in the covariance matrix _m,n (t, f) can be expressed as the following formula (2):

wherein S is _m (t, f) represents an mth first time-frequency domain signal of the N first time-frequency domain signals,representing an nth first time-frequency domain signal S of the N first time-frequency domain signals _n Complex conjugate of (t, f). E represents the desire. q (t) represents a preset probability related to the existence probability p (t) of the wake keyword (key word) as described above.

Optionally, if the existence probability is greater than a preset value, the preset probability is 0; if the existence probability is smaller than or equal to a preset value, the preset probability is 1.

For example, q (t) =0 when p (t) is greater than a preset value (threshold). Q (t) =1 when p (t) is less than or equal to a preset value.

When the wake-up event is detected, the intelligent sound box can perform the function of the covariance matrix of the N paths of audio signals and the covariance of the interference signals And (3) matrix, calculating covariance matrix of the sound source signal. The covariance matrix comprises n×n elements, i.e., the covariance matrix comprises N rows and N columns. Assuming that m represents the mth row of N rows and N represents the nth column of N columns, then the element Cor on the mth row and nth column of the covariance matrix _m,n (t, f) can be expressed as the following formula (3):

Cor _m,n (t,f)＝Cor1 _m,n (t,f)-α*Cor2 _m,n (t,f) (3)

where α represents a weight factor.

It will be appreciated that the manner of calculating each element in the covariance matrix of the sound source signal is shown in formula (3), so that each element in the covariance matrix of the sound source signal, i.e. n×n elements in the covariance matrix of the sound source signal, can be calculated according to formula (3), and this embodiment marks the covariance matrix of the sound source signal as Cor (t, f).

In this embodiment, a coordinate system similar to that shown in fig. 1 can be established with the center of the microphone array in the smart speaker as the origin of coordinates. In the coordinate system, the smart speaker may determine a preset range around it, for example, the preset range may be a three-dimensional space region centered on the origin of coordinates O, and the three-dimensional space region may be a spherical region, a square region, a rectangular parallelepiped region, or the like. Further, the intelligent sound box selects a plurality of positioning points from the preset range, and calculates the guide vectors corresponding to the positioning points respectively. According to the guiding vectors respectively corresponding to the positioning points and the covariance matrix of the sound source signals, the pointing response respectively corresponding to the positioning points can be calculated. In addition, locating the sound source includes determining a distance, azimuth, elevation, etc. of the sound source relative to the intelligent speaker. Therefore, this embodiment will be schematically described with an azimuth angle as an example.

As shown in fig. 5, a dashed line 51 represents a 360 degree azimuth range centered on the origin of coordinates O. In the present embodiment, some square points can be uniformly selected from the 360-degree azimuth range, for example, P square points are selected, the square point shown by 51 is any one of the P square points, which is only schematically illustrated herein and is not particularly limitedAnd (5) setting. Taking the square point 51 as an example, the angle between Ox and the line between the origin of coordinates O and the square point 51 may be regarded as the azimuth angle corresponding to the square point 51. Similarly, azimuth angles corresponding to other azimuth points can be determined. Further, for the P square points, a vector (vector) corresponding to each square point is calculated, for example, a vector corresponding to the P-th square point in the P square points is denoted as v _p (f) A. The invention relates to a method for producing a fibre-reinforced plastic composite Further, according to v _p (f) And covariance matrix Cor (t, f) of sound source signal to calculate directional response R corresponding to the square point _p (t)。R _p (t) can be expressed as the following formula (4):

wherein,,representing v _p (f) Is a complex conjugate of (a) and (b).

It will be appreciated that, in addition to the P-th square point of the P square points, the directional responses corresponding to the other square points can also be calculated by referring to the calculation method shown in the formula (4). After calculating the directional response corresponding to each of the P square points, the embodiment may determine the square point with the largest directional response from the P square points, for example, the square point 52 shown in fig. 5 corresponds to the largest directional response, and then the azimuth corresponding to the square point 52 may be used as the azimuth of the sound source.

The embodiment adopts the existence probability of the wake-up keywords output by the voice wake-up model to assist the sound source positioning process, so that an additional noise reduction model is not needed. Meanwhile, the voice wake-up model can respond to preset wake-up keywords, so that the embodiment can accurately position the sound source under the scenes of voice interference and low signal to noise ratio. Experiments show that the accuracy of the wake-up sound source positioning is more than 90% when the Signal-to-Noise Ratio (SNR) is 0dB, the Noise interference is diffused, and the point sound source interference is generated. Therefore, the embodiment provides a high-noise-resistance wake-up sound source positioning algorithm, which has high accuracy and low complexity.

Fig. 6 is a flowchart of a control method of an intelligent device according to another embodiment of the disclosure. The method may be performed by a smart device or may be performed by a cloud server. For example, a cloud server is schematically illustrated. As shown in fig. 7, the cloud server 71 may communicate with the smart device 31, and the specific communication manner may be a wireless communication manner or other communication manners, which is not limited herein. In this embodiment, the method specifically includes the following steps:

S601, acquiring multiple paths of audio signals acquired by intelligent equipment.

For example, a microphone array is provided in the smart device 31, which can collect multiple audio signals. Further, the smart device 31 may send the multiplexed audio signal to the cloud server 71.

S602, detecting event occurrence of waking up the intelligent equipment according to the multipath audio signals.

The cloud server 71 may detect whether a wake-up event occurs in real time according to the multiple audio signals collected by the smart device 31, where the wake-up event may specifically be an event of waking up the smart device 31. When the cloud server 71 detects that a wake event occurs, the following steps are performed.

S603, determining a covariance matrix for controlling the sound source signal of the intelligent device according to the covariance matrix of the multipath audio signals and the covariance matrix of the interference signals obtained according to the multipath audio signals.

For example, when the cloud server 71 detects that a wake-up event occurs, a covariance matrix for controlling the sound source signal of the smart device 31 may be determined according to the covariance matrix of the multi-channel audio signal and the covariance matrix of the interference signal. The sound source signal may be a control voice of the user 32 to the smart device 31. The calculation process of the covariance matrix of the multipath audio signal, the covariance matrix of the interference signal, and the covariance matrix of the sound source signal is described above, and will not be repeated here.

S604, determining the directional responses respectively corresponding to the plurality of positioning points according to the covariance matrix of the sound source signal and the guide vectors respectively corresponding to the plurality of positioning points distributed in the preset range of the intelligent equipment, and taking the positioning point with the largest directional response among the plurality of positioning points as the positioning point of the sound source.

Specifically, the cloud server 71 may establish a coordinate system with the center of the smart device 31 as the origin of coordinates, for example, the coordinate system shown in fig. 1. Further, the cloud server 71 may determine a preset range around the smart device 31 in the coordinate system, for example, the preset range may be a three-dimensional space region centered on the origin O of coordinates, and the three-dimensional space region may be a spherical region, a square region, a rectangular parallelepiped region, or the like. Further, the cloud server 71 selects a plurality of anchor points from the preset range, and calculates the guide vectors corresponding to the anchor points respectively. According to the guiding vectors respectively corresponding to the positioning points and the covariance matrix of the sound source signals, the pointing response respectively corresponding to the positioning points can be calculated. And the locating point with the largest pointing response is used as the locating point of the sound source.

And S605, controlling the intelligent equipment to indicate the azimuth of the sound source relative to the intelligent equipment according to the positioning point of the sound source.

For example, after the cloud server 71 determines the location point of the sound source, the cloud server 71 may send a control instruction to the smart device 31, where the control instruction is used to control the smart device 31 to indicate the azimuth of the sound source relative to the smart device 31. For example, a preset button 72 is provided on the smart device 31, and the cloud server 71 may control the smart device 31 to rotate such that the preset button 72 is aligned with or directed toward the user 32.

Fig. 8 is a flowchart of a control method of an intelligent sound box according to another embodiment of the disclosure. The method may be performed by a smart speaker or may be performed by a cloud server. For example, a smart speaker implementation is schematically illustrated. In this embodiment, the method specifically includes the following steps:

s801, acquiring multiple paths of audio signals acquired by the intelligent sound box.

For example, a microphone array is arranged in the intelligent sound box, and the microphone array can collect multiple paths of audio signals.

S802, detecting the occurrence of an event of waking up the intelligent sound box according to the multipath audio signals.

For example, the smart speaker may detect, in real time, whether a wake-up event, which may specifically be an event that wakes up the smart speaker, occurs based on the multiple audio signals. When the intelligent sound box detects that the wake-up event occurs, the following steps are executed.

S803, determining a covariance matrix for controlling the sound source signal of the intelligent sound box according to the covariance matrix of the multipath audio signals and the covariance matrix of the interference signals obtained according to the multipath audio signals.

For example, when the smart speaker detects the occurrence of a wake-up event, a covariance matrix for controlling a sound source signal of the smart speaker may be determined according to a covariance matrix of the multi-path audio signal and a covariance matrix of an interference signal. The sound source signal may be control speech of a sound source. The calculation process of the covariance matrix of the multipath audio signal, the covariance matrix of the interference signal, and the covariance matrix of the sound source signal is described above, and will not be repeated here.

S804, determining the corresponding directional responses of the plurality of positioning points according to the covariance matrix of the sound source signal and the guide vectors respectively corresponding to the plurality of positioning points distributed in the preset range of the intelligent sound box, and taking the positioning point with the largest directional response among the plurality of positioning points as the positioning point of the sound source.

Specifically, the smart speaker may establish a coordinate system using the center of the smart speaker as the origin of coordinates, for example, the coordinate system shown in fig. 9. Further, the smart speaker may determine a preset range around it in the coordinate system, for example, the preset range may be a three-dimensional space region centered on the origin of coordinates O, and the three-dimensional space region may be a spherical region, a square region, a rectangular parallelepiped region, or the like. Further, the intelligent sound box selects a plurality of positioning points from the preset range, and calculates the guide vectors corresponding to the positioning points respectively. According to the guiding vectors respectively corresponding to the positioning points and the covariance matrix of the sound source signals, the pointing response respectively corresponding to the positioning points can be calculated. And the anchor point with the largest pointing response is used as the anchor point of the sound source, for example, the anchor point 90 shown in fig. 9 is the anchor point of the sound source.

S805, determining the azimuth of the sound source relative to the intelligent sound box according to the locating point of the sound source.

For example, from the anchor point 90, the azimuth of the sound source relative to the smart box may be determined.

S806, according to the azimuth, the indication lamp corresponding to the azimuth in the intelligent sound box is controlled to be turned on.

In addition, the smart speaker in this embodiment may be further provided with a plurality of indicator lamps, for example, 91 shown in fig. 9 is any one of the plurality of indicator lamps. Further, the intelligent sound box can control the corresponding indicator lamp 92 to be lightened according to the azimuth angle of the sound source relative to the intelligent sound box.

It can be understood that, according to the positioning point of the sound source, the control mode of the intelligent sound box or the intelligent device is not limited to the above embodiments, and in other embodiments, the intelligent sound box or the intelligent device may be controlled by other control modes, which is not described herein.

Fig. 10 is a schematic structural diagram of a sound source positioning device according to an embodiment of the present disclosure. The sound source localization apparatus provided in the embodiment of the present disclosure may execute the processing flow provided in the embodiment of the sound source localization method, as shown in fig. 10, the sound source localization apparatus 100 includes:

An acquisition module 101, configured to acquire multiple paths of audio signals acquired by an intelligent device;

the detection module 102 is configured to detect, according to the multiple audio signals, an event occurrence that wakes up the intelligent device;

a first determining module 103, configured to determine a covariance matrix for controlling a sound source signal of the smart device according to the covariance matrix of the multiple audio signals and a covariance matrix of an interference signal obtained according to the multiple audio signals;

and the second determining module 104 is configured to determine, according to the covariance matrix of the sound source signal and the guide vectors respectively corresponding to the plurality of positioning points distributed in the preset range of the intelligent device, the directional responses respectively corresponding to the plurality of positioning points, and use the positioning point with the largest directional response among the plurality of positioning points as the positioning point of the sound source.

Optionally, when the detection module 102 detects that an event of waking up the smart device occurs according to the multiple audio signals, the detection module is specifically configured to:

detecting the existence probability of the wake-up keyword according to the multipath audio signals;

and if the existence probability meets a preset condition, determining that an event for waking up the intelligent equipment occurs.

Optionally, the detecting module 102 is specifically configured to, when detecting the existence probability of the wake-up keyword according to the multiple audio signals:

Respectively carrying out echo cancellation on the multiple paths of audio signals to obtain multiple paths of first time-frequency domain signals;

and detecting the existence probability of the awakening keywords according to the multipath first time-frequency domain signals.

Optionally, the detection module 102 performs echo cancellation on the multiple paths of audio signals respectively, so as to obtain multiple paths of first time-frequency domain signals, which are specifically configured to:

performing Fourier transform on the multiple paths of audio signals respectively to obtain second time-frequency domain signals corresponding to the multiple paths of audio signals respectively;

and carrying out echo cancellation on the second time-frequency domain signals corresponding to the multiple paths of audio signals respectively to obtain multiple paths of first time-frequency domain signals.

Optionally, if the existence probability is greater than a preset value, the preset probability is 0;

if the existence probability is smaller than or equal to a preset value, the preset probability is 1.

The sound source positioning device of the embodiment shown in fig. 10 may be used to implement the technical solution of the above-mentioned method embodiment, and its implementation principle and technical effects are similar, and will not be described herein again.

Fig. 11 is a schematic structural diagram of a control device of an intelligent device according to an embodiment of the disclosure. The control device of the smart device provided in the embodiment of the present disclosure may execute the processing flow provided in the control method embodiment of the smart device, as shown in fig. 11, where the control device 110 of the smart device includes:

an acquisition module 111, configured to acquire multiple paths of audio signals acquired by the intelligent device;

the detection module 112 is configured to detect, according to the multiple audio signals, an event occurrence of waking up the smart device;

a first determining module 113, configured to determine a covariance matrix for controlling a sound source signal of the smart device according to the covariance matrix of the multiple audio signals and a covariance matrix of an interference signal obtained according to the multiple audio signals;

the second determining module 114 is configured to determine, according to the covariance matrix of the sound source signal and the guide vectors respectively corresponding to the plurality of positioning points distributed in the preset range of the intelligent device, the directional responses respectively corresponding to the plurality of positioning points, and take the positioning point with the largest directional response among the plurality of positioning points as the positioning point of the sound source;

and the control module 115 is used for controlling the intelligent device to indicate the azimuth of the sound source relative to the intelligent device according to the positioning point of the sound source.

The control device of the intelligent device in the embodiment shown in fig. 11 may be used to implement the technical solution of the above method embodiment, and its implementation principle and technical effects are similar, and are not described herein again.

Fig. 12 is a schematic structural diagram of a control device for an intelligent sound box according to an embodiment of the disclosure. The control device for an intelligent sound box provided in the embodiments of the present disclosure may execute the processing flow provided in the control method embodiment of the intelligent sound box, as shown in fig. 12, where the control device 120 for an intelligent sound box includes:

the acquisition module 121 is configured to acquire multiple paths of audio signals acquired by the intelligent sound box;

the detection module 122 is configured to detect, according to the multiple audio signals, an event occurrence of waking up the smart speaker;

a first determining module 123, configured to determine a covariance matrix for controlling a sound source signal of the smart speaker according to the covariance matrix of the multiple audio signals and a covariance matrix of an interference signal obtained according to the multiple audio signals;

the second determining module 124 is configured to determine, according to the covariance matrix of the sound source signal and the guide vectors respectively corresponding to the plurality of positioning points distributed in the preset range of the intelligent sound box, the directional responses respectively corresponding to the plurality of positioning points, and take the positioning point with the largest directional response among the plurality of positioning points as the positioning point of the sound source;

A third determining module 125, configured to determine, according to a positioning point of the sound source, a position of the sound source relative to the intelligent sound box;

and the control module 126 is used for controlling the indication lamp corresponding to the azimuth in the intelligent sound box to be lightened according to the azimuth.

The control device of the intelligent sound box in the embodiment shown in fig. 12 may be used to implement the technical solution of the above method embodiment, and its implementation principle and technical effects are similar, and are not repeated here.

The above describes the internal functions and structures of a sound source localization device, a control device for an intelligent device, and a control device for an intelligent sound box, which can be implemented as an electronic device. Fig. 13 is a schematic structural diagram of an embodiment of an electronic device according to an embodiment of the disclosure. As shown in fig. 13, the electronic device includes a memory 131 and a processor 132.

The memory 131 is used to store programs. In addition to the programs described above, the memory 131 may also be configured to store other various data to support operations on the electronic device. Examples of such data include instructions for any application or method operating on the electronic device, contact data, phonebook data, messages, pictures, videos, and the like.

The memory 131 may be implemented by any type of volatile or non-volatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

The processor 132 is coupled to the memory 131, and executes a program stored in the memory 131 for:

acquiring multiple paths of audio signals acquired by intelligent equipment;

Alternatively, the processor 132 is further configured to:

acquiring multiple paths of audio signals acquired by intelligent equipment;

Alternatively, the processor 132 is further configured to:

Further, as shown in fig. 13, the electronic device may further include: communication component 133, power component 134, audio component 135, display 136, and other components. Only some of the components are schematically shown in fig. 13, which does not mean that the electronic device only comprises the components shown in fig. 13.

The communication component 133 is configured to facilitate communication between the electronic device and other devices, either wired or wireless. The electronic device may access a wireless network based on a communication standard, such as WiFi,2G, or 3G, or a combination thereof. In one exemplary embodiment, the communication component 133 receives a broadcast signal or broadcast-related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component 133 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

A power supply assembly 134 provides power to the various components of the electronic device. The power components 134 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for electronic devices.

The audio component 135 is configured to output and/or input audio signals. For example, the audio component 135 includes a Microphone (MIC) configured to receive external audio signals when the electronic device is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may be further stored in the memory 131 or transmitted via the communication component 133. In some embodiments, audio component 135 further comprises a speaker for outputting audio signals.

The display 136 includes a screen, which may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation.

In addition, the embodiment of the present disclosure further provides a computer readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the sound source localization method, the intelligent device, or the control method of the intelligent sound box described in the foregoing embodiment.

It should be noted that in this document, relational terms such as "first" and "second" and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The foregoing is merely a specific embodiment of the disclosure to enable one skilled in the art to understand or practice the disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown and described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A sound source localization method, wherein the method comprises:

acquiring multiple paths of audio signals acquired by intelligent equipment;

2. The method of claim 1, wherein detecting an event occurrence from the multi-channel audio signal that wakes up the smart device comprises:

3. The method of claim 2, wherein detecting the probability of the presence of wake-up keywords from the multi-channel audio signal comprises:

4. A method according to claim 3, wherein echo cancelling is performed on the multiple audio signals to obtain multiple first time-frequency domain signals, respectively, including:

5. The method of claim 4, wherein the covariance matrix of the plurality of channels of audio signals is determined from the plurality of channels of first time-frequency domain signals.

6. The method of claim 4, wherein the covariance matrix of the interfering signal is determined from a preset probability, and the plurality of first time-frequency domain signals, the preset probability being related to the presence probability.

7. The method of claim 6, wherein the preset probability is 0 if the existence probability is greater than a preset value;

8. A control method of an intelligent device, the method comprising:

acquiring multiple paths of audio signals acquired by intelligent equipment;

9. The control method of the intelligent sound box is characterized by comprising the following steps:

10. A sound source localization apparatus, comprising:

11. An electronic device, comprising:

a memory;

a processor; and

a computer program;

wherein the computer program is stored in the memory and configured to be executed by the processor to implement the method of any one of claims 1-9.

12. A computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the method of any of claims 1-9.