CN112799016A

CN112799016A - Sound source positioning method, sound source positioning device, computer-readable storage medium and electronic equipment

Info

Publication number: CN112799016A
Application number: CN202011552864.6A
Authority: CN
Inventors: 胡玉祥
Original assignee: Beijing Horizon Information Technology Co Ltd
Current assignee: Beijing Horizon Information Technology Co Ltd
Priority date: 2020-12-24
Filing date: 2020-12-24
Publication date: 2021-05-14
Anticipated expiration: 2040-12-24
Also published as: CN112799016B

Abstract

The embodiment of the disclosure provides a sound source positioning method, a sound source positioning device, a computer readable storage medium and an electronic device. The sound source positioning method comprises the following steps: carrying out voice separation on original mixed audio signals of a sound source collected by a microphone array to obtain multi-channel separated audio signals; determining a time period of a wakeup word from the multi-channel separated audio signal; determining a mixed multi-channel audio signal corresponding to a time period in which a wakeup word is positioned from an original mixed audio signal; determining a single-channel audio signal where a wake-up word is located from the multi-channel separated audio signal; and positioning the sound source based on the mixed multi-channel audio signal and the single-channel audio signal where the awakening word is located. The scheme can greatly improve the accuracy of sound source positioning.

Description

Sound source positioning method, sound source positioning device, computer-readable storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of sound source localization, and in particular, to a sound source localization method, a sound source localization apparatus, a computer-readable storage medium, and an electronic device.

Background

With the continuous development of intelligent voice interaction technology, more and more intelligent interaction devices are in operation. For example, smart televisions, smart speakers, smart homes, smart robots, in-vehicle smart interactive devices, and so forth. Awakening the intelligent interaction device through the awakening word, people can perform voice interaction with the intelligent interaction device, and the intelligent interaction device is instructed to complete operations such as music playing and weather broadcasting.

After the intelligent interaction equipment is awakened, the direction information of the awakening word can be determined according to the voice signal picked up by the microphone, and the voice is directionally picked up according to the direction of the awakening word, so that noise interference is reduced. However, when the volume of the external interference sound source is larger than the volume of the awakening word sent by the user, the positioning result of the intelligent interaction device is usually the direction of the interference sound source, so that the accuracy of sound source positioning is greatly reduced, and the human-computer interaction experience is influenced.

Disclosure of Invention

In view of this, embodiments of the present disclosure provide a sound source positioning method, a sound source positioning device, a computer-readable storage medium, and an electronic device, which can greatly improve the accuracy of sound source positioning.

According to a first aspect of the embodiments of the present disclosure, there is provided a sound source localization method, including: carrying out voice separation on original mixed audio signals of a sound source collected by a microphone array to obtain multi-channel separated audio signals; determining a time period of a wakeup word from the multi-channel separated audio signal; determining a mixed multi-channel audio signal corresponding to a time period in which a wakeup word is positioned from an original mixed audio signal; determining a single-channel audio signal where a wake-up word is located from the multi-channel separated audio signal; and positioning the sound source based on the mixed multi-channel audio signal and the single-channel audio signal where the awakening word is located.

According to a second aspect of the embodiments of the present disclosure, there is provided a sound source localization apparatus including: the voice separation module is used for carrying out voice separation on the original mixed audio signals of the sound source collected by the microphone array so as to obtain multi-channel separated audio signals; the first determining module is used for determining the time period of the awakening word from the multi-channel separated audio signal; the second determining module is used for determining a mixed multi-channel audio signal corresponding to the time period of the awakening word from the original mixed audio signal; the third determining module is used for determining a single-channel audio signal where the awakening word is located from the multi-channel separated audio signal; and the positioning module is used for positioning the sound source based on the mixed multi-channel audio signal and the single-channel audio signal where the awakening word is located.

According to a third aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium storing a computer program for executing the sound source localization method according to any one of the above.

According to a fourth aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including: a processor; a memory for storing the processor-executable instructions; the processor is configured to execute any one of the sound source localization methods described above.

According to the sound source positioning method, the sound source positioning device, the computer readable storage medium and the electronic equipment, the time period of the awakening word is determined, and the sound source positioning is carried out by combining the mixed multi-channel audio signal corresponding to the time period of the awakening word, so that the interference of the audio signals in other time periods can be eliminated, meanwhile, the efficiency of subsequent processing can be improved, and the processing amount can be reduced; in addition, the single-channel audio signal where the awakening word is located is determined from the multi-channel separated audio signal after the voice separation, so that the single-channel signal which only comprises or mainly comprises the awakening word is obtained, the awakening word can be positioned in a targeted manner, and the accuracy of sound source positioning is improved.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing in more detail embodiments of the present disclosure with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the principles of the disclosure and not to limit the disclosure. In the drawings, like reference numbers generally represent like parts or steps.

Fig. 1a is a schematic view of an intelligent home scene to which an embodiment of the present disclosure is applied.

Fig. 1b is a schematic view of a vehicle scene to which an embodiment of the present disclosure is applied.

Fig. 2 is a schematic flow chart of a sound source localization method according to an exemplary embodiment of the present disclosure.

Fig. 3 is a schematic structural diagram of a sound source localization apparatus according to an exemplary embodiment of the present disclosure.

Fig. 4 is a schematic flowchart illustrating a single-channel audio signal for determining a wakeup word from multiple channels of separated audio signals in a sound source localization method according to an exemplary embodiment of the present disclosure.

Fig. 5 is a schematic flowchart illustrating a sound source localization method according to another exemplary embodiment of the present disclosure.

Fig. 6 is a schematic structural diagram of a sound source localization apparatus according to an exemplary embodiment of the present disclosure.

Fig. 7 is a block diagram illustrating a sound source localization apparatus according to an embodiment of the present disclosure.

Fig. 8 is a block diagram illustrating a third determining module of a sound source localization apparatus according to an exemplary embodiment of the disclosure.

Fig. 9 is a block diagram illustrating a positioning module of a sound source positioning device according to an exemplary embodiment of the present disclosure.

Fig. 10 is a block diagram illustrating a positioning unit of a sound source positioning device according to an exemplary embodiment of the present disclosure.

Fig. 11 is a block diagram illustrating an electronic device according to an exemplary embodiment of the disclosure.

Detailed Description

Hereinafter, example embodiments according to the present disclosure will be described in detail with reference to the accompanying drawings. It is to be understood that the described embodiments are merely a subset of the embodiments of the present disclosure and not all embodiments of the present disclosure, with the understanding that the present disclosure is not limited to the example embodiments described herein.

Summary of the application

In the existing sound source positioning mode, when the volume of an external interference sound source is larger than the volume of a wake-up word sent by a user, the positioning result of the intelligent interaction device is usually the direction of the interference sound source, which usually results in the reduction of positioning accuracy and the influence on human-computer interaction experience because the intelligent interaction device cannot position the wake-up word signal in a targeted manner.

In view of the above technical problems, the basic concept of the present disclosure is to provide a sound source positioning method, a sound source positioning device, a computer-readable storage medium, and an electronic device, which can eliminate the interference of audio signals in other time periods, and simultaneously improve the efficiency of subsequent processing and reduce the processing amount by determining the time period in which a wakeup word is located and performing sound source positioning by combining a mixed multi-channel audio signal corresponding to the time period in which the wakeup word is located; in addition, the single-channel audio signal where the awakening word is located is determined from the multi-channel separated audio signal after the voice separation, so that the single-channel signal which only comprises or mainly comprises the awakening word is obtained, the awakening word can be positioned in a targeted manner, and the accuracy of sound source positioning is improved.

It should be noted that the sound source positioning method may be applied to a scene that needs a wakeup word to wake up an intelligent interactive device, such as an intelligent home, an intelligent automobile, and an intelligent robot, and the disclosure is not limited to this specifically.

The specific implementation form of the intelligent interaction device may be adjusted according to an actual application scenario, for example, the intelligent interaction device may be an intelligent interaction robot, an intelligent vehicle, or the like.

Exemplary System

Fig. 1a is a schematic view of an intelligent home scene to which an embodiment of the present disclosure is applied. As shown in fig. 1a, an intelligent home scene to which the embodiment of the present disclosure is applied includes an intelligent interactive robot 10, a wakeup word sound source 20, and at least one interference sound source 30. The intelligent interactive robot 10 includes a microphone array for acquiring audio signals. For example a linear microphone array comprising a plurality of microphone elements (e.g. the

microphone elements

11, 12, 13, 14 in fig. 1 a) distributed at predetermined distance intervals on the same straight line. The wakeup word sound source 20 may be a user who sends a wakeup word signal, and when the microphone array in the intelligent interactive robot 10 collects a wakeup word signal sent by the user, the intelligent interactive robot 10 may be waken up, so that the intelligent interactive robot 10 performs a corresponding operation in response to a specific wakeup word. The disturbing sound source 30 may be a device (e.g. a television, a game machine) or a person, etc. emitting a disturbing signal.

By adopting the sound source positioning method provided by the embodiment of the present disclosure, even if the volume of the interfering sound source 30 (e.g., a television) is greater than the volume of the awakening word sound source 20 (e.g., a user sending an awakening word signal), the intelligent interactive robot 10 can accurately position the awakening word sound source 20, and will not position the position of the interfering sound source 30 with a greater volume as the position of the awakening word sound source 20.

Fig. 1b is a schematic view of a vehicle scene to which an embodiment of the present disclosure is applied. As shown in fig. 1b, the vehicle scene to which the embodiment of the present disclosure is applied includes a smart vehicle 40, a wake-up word sound source 20, and at least one interfering sound source 30. The smart vehicle 40 includes a distributed microphone array for collecting wake-up word signals issued by personnel inside the vehicle. For example, the distributed microphone array comprises 4 microphone units (e.g. the

microphone units

41, 42, 43, 44 in fig. 1 b), which can be respectively arranged on four vehicle doors or at four seats, and it should be understood that the arrangement position and the arrangement mode of the microphone array are not particularly limited by the present invention. The wake-up word sound source 20 may be a person emitting a wake-up word signal inside the vehicle, and the interference sound source 30 may be a person emitting an interference signal inside the vehicle, or may be wind sound, and the like, which is not specifically limited in the present disclosure.

By adopting the sound source positioning method provided by the embodiment of the present disclosure, even if the volume of the interfering sound source 30 is greater than the volume of the awakening word sound source 20, the intelligent vehicle 40 can accurately position the awakening word sound source 20, and the position of the interfering sound source 30 with the greater volume is not positioned as the position of the awakening word sound source 20.

Exemplary method

Fig. 2 is a schematic flow chart of a sound source localization method according to an exemplary embodiment of the present disclosure. The embodiment can be applied to an intelligent interactive device, as shown in fig. 2, the sound source localization method includes the following steps.

Step 201, performing voice separation on an original mixed audio signal of a sound source collected by a microphone array to obtain a multi-channel separated audio signal.

Referring to fig. 1a and 1b, the microphone array is composed of a plurality of microphone units arranged according to a certain rule, the number of the microphone units is two or more, for example, 2, 4, 6, etc., and the present disclosure does not specifically limit the number of the microphone units in the microphone array. In addition, the microphone array may be a linear microphone array, a distributed microphone array, or the like, and the specific arrangement form of the microphone array is not limited in the present disclosure. It should be understood that the distributed microphone array may also be arranged in a regular pattern, such as a ring, a sphere, etc., and the present disclosure does not specifically limit the arrangement of the microphone elements.

The microphone array can collect audio signals within a certain range, and in the process of collecting the wake-up word signal, interference signals (such as television sound, music sound, sound of chatty of surrounding people and the like) other than the wake-up word signal can be collected. Therefore, the original mixed audio signal collected by the microphone array may include the wake-up word signal and the interference signal mixed together.

The number of channels of the original mixed audio signal is determined by the number of microphone elements in the microphone array. Specifically, if the number of microphone units in the microphone array is N, the acquired original mixed audio signal corresponds to N channels, where each microphone unit corresponds to an original mixed audio signal of one channel, and a wake-up word signal and an interference signal in the original mixed audio signal of each channel are mixed together.

In order to distinguish the wake-up word signal from the interference signal for the purpose of positioning the wake-up word signal in a targeted manner, it is necessary to perform voice separation on the original mixed audio signal. In an embodiment of the present disclosure, a speech separation algorithm may be used to separate the wake-up word signal from the interference signal, for example, the speech separation algorithm may include dereverberation processing, beamforming/blind source separation processing, noise suppression processing, and the like, which is not specifically limited by the present disclosure.

The original mixed audio signal is subjected to voice separation to obtain a multi-channel separated audio signal, the multi-channel separated audio signal comprises a plurality of single-channel separated audio signals, and the awakening word signal and the interference signal are respectively in different single-channel separated audio signals.

The number of channels of the multi-channel split audio signal is also determined by the number of microphones in the microphone array. For example, the number of microphone units in the microphone array is N, and the acquired original mixed audio signal corresponds to N channels, then the original mixed audio signal of the N channels is subjected to voice separation, and the obtained multi-channel separated audio signal also corresponds to N channels, that is, a single-channel separated audio signal including N channels. In other words, the number of channels before and after the speech separation is constant, and is the same as the number N of microphone elements in the microphone array.

Through the voice separation processing, the awakening word signal and the interference signal can be separated, so that the awakening word signal and the interference signal are respectively positioned in different channels of the N channels corresponding to the multi-channel separation audio signal. For example, the wake-up word signal is in one of the N channels corresponding to the multi-channel split audio signal and the interference signal is in another of the N channels corresponding to the multi-channel split audio signal.

Step 202, determining a time period of the wakeup word from the multi-channel separated audio signal.

Specifically, each single-channel separated audio signal in the multi-channel separated audio signals may be respectively input to the wakeup word decoder for wakeup word recognition, so as to obtain a time period in which the wakeup word is located.

For example, a single-channel separated audio signal corresponding to one of the N channels corresponding to the multi-channel separated audio signal may be input to a first wakeup word decoder for wakeup word recognition, and a single-channel separated audio signal corresponding to another one of the N channels corresponding to the multi-channel separated audio signal may be input to a second wakeup word decoder for wakeup word recognition, etc. That is, the multi-channel separated audio signal may have a one-to-one correspondence with the plurality of wakeup word decoders. When any awakening word decoder in the plurality of awakening word decoders recognizes the awakening word, the continuous recognition of the awakening word by other awakening word decoders can be stopped.

Specifically, the wakeup word decoder may recognize the wakeup word through the neural network model to obtain a time period of the wakeup word, where the time period may include a start time point and an end time point of the wakeup word, and the present disclosure does not specifically limit an expression form of the time period.

Specifically, each of the single-channel separated audio signals may be respectively input to a neural network model for wake-up word recognition, and when a wake-up word is recognized by the neural network model, a time period of the wake-up word signal may be output. It should be understood that the present disclosure does not specifically limit the training process, model type, model structure of the neural network model for wake word recognition.

Step 203, determining a mixed multi-channel audio signal corresponding to the time period of the wakeup word from the original mixed audio signal.

Specifically, according to the time period output by the wakeup word decoder, a portion corresponding to the time period in which the wakeup word is located, that is, the mixed multi-channel audio signal, may be extracted from the original mixed audio signal. Because the wake-up word signal and the interference signal in the original mixed audio signal are mixed, the wake-up word signal and the interference signal in the mixed multi-channel audio signal corresponding to the time period of the wake-up word in the original mixed audio signal are also mixed.

Step 204, determining a single-channel audio signal where the awakening word is located from the multi-channel separated audio signal.

As described above, by performing voice separation on the original mixed audio signal, a multi-channel separated audio signal can be obtained, where the multi-channel separated audio signal includes a plurality of single-channel separated audio signals, and the wake-up word signal and the interference signal are respectively in different single-channel separated audio signals in the multi-channel separated audio signal. Therefore, the single-channel audio signal where the awakening word is located can be extracted from the multi-channel separated audio signal, wherein the single-channel audio signal where the awakening word is located is a signal which only includes or mainly includes the awakening word, so that the sound source of the awakening word can be located according to the single-channel audio signal where the awakening word is located in the following process.

Step 205, based on the mixed multi-channel audio signal and the single-channel audio signal where the wake-up word is located, a sound source is located.

For example, the single-channel audio signal where the mixed multi-channel audio signal and the wake-up word are located may be directly input to the trained neural network model to obtain the direction of the sound source output by the neural network model; the method includes the steps of obtaining a single-channel audio signal, a mixed multi-channel audio signal and a wake-up word, performing time-frequency transformation on the single-channel audio signal, extracting frequency domain features and the like, inputting the pre-processed data into a trained neural network model, and obtaining the direction of a sound source output by the neural network model. It should be understood that the present disclosure is not particularly limited as to the specific manner in which sound sources are located based on mixing multi-channel audio signals and the single-channel audio signal in which the wake-up word is located.

According to the sound source positioning method provided by the embodiment of the disclosure, the time period of the awakening word is determined, and the sound source positioning is performed by combining the mixed multi-channel audio signal corresponding to the time period of the awakening word, so that the interference of the audio signals in other time periods can be eliminated, meanwhile, the efficiency of subsequent processing can be improved, and the processing amount can be reduced; in addition, the single-channel audio signal where the awakening word is located is determined from the multi-channel separated audio signal after the voice separation, so that the single-channel signal which only comprises or mainly comprises the awakening word is obtained, the awakening word can be positioned in a targeted manner, and the accuracy of sound source positioning is improved.

Step 204 is described in detail below in conjunction with fig. 3 and 4. Fig. 3 is a schematic structural diagram of a sound source localization apparatus according to an exemplary embodiment of the present disclosure. The sound source localization apparatus includes a voice separation module 310, a wakeup word decoder 320, a data extraction module 330, a data extraction module 340, and a sound source localization module 350. The voice separation module 310 may perform voice separation on the original mixed audio signal X to obtain a multi-channel separated audio signal Y; the multi-channel separated audio signal Y is input into the awakening word decoder 320 for awakening word recognition, and the audio signal channel ch where the awakening word is located can be obtained_wkpAnd timestamp t of the wakeup word_stamp(ii) a The data extraction module 330 may extract the time stamp t from the wakeup word_stampExtraction of wake-up words from an original mixed audio signal XMixed multi-channel audio signal X corresponding to the time segment in which it is located_t(ii) a The data extraction module 340 may extract the time stamp t from the wakeup word_stampAnd the audio signal channel ch in which the wake-up word is located_wkpExtracting single-channel audio signal Y where awakening words are located from multi-channel separated audio signal Y_wkp(ii) a Mixing a multi-channel audio signal X_tAnd the single-channel audio signal y in which the wake-up word is located_wkpThe input to the sound source localization module 350 may obtain the orientation of the sound source output by the sound source localization module 350 with respect to the microphone array.

As shown in fig. 4, step 204 may include steps 2041 to 2043.

Step 2041, identify a wake-up word from the multi-channel separated audio signal.

Specifically, as shown in fig. 3, the wake-up word decoder 320 may be used to sequentially identify the wake-up words in the multi-channel separated audio signal Y, for example, each single-channel separated audio signal in the multi-channel separated audio signal Y may be input to the wake-up word decoder 320 for wake-up word identification. Taking a microphone array with the number of microphone units being 2 as an example, an original 2-channel mixed audio signal X is processed by a voice separation algorithm to obtain a 2-channel separated audio signal Y, a first single-channel separated audio signal in the 2-channel separated audio signal Y is input into a first awakening word decoder to perform awakening word recognition, a second single-channel separated audio signal is input into a second awakening word decoder to perform awakening word recognition, that is, the 2-channel separated audio signal and the 2 awakening word decoders are in one-to-one correspondence. When any one of the 2 awakening word decoders recognizes the awakening word, the other awakening word decoder can stop continuously recognizing the awakening word. Similarly, when the number of the microphone units is N, the original N-channel mixed audio signal can obtain an N-channel separated audio signal through a voice separation algorithm, the N-channel separated audio signal is respectively input into the N wakeup word decoders to perform wakeup word recognition, and when one of the wakeup word decoders recognizes a wakeup word, the other wake word decoders can stop continuously recognizing the wakeup word.

Step 2042, determine the time period of the wake-up word and the audio signal channel of the wake-up word.

As shown in FIG. 3, the wake word decoder 320 can recognize the wake word through the neural network model to obtain the audio signal channel ch where the wake word is located_wkpAnd timestamp t of the wakeup word_stamp. Time stamp t_stampCan include a wakeup word start time point and an end time point, so that the wakeup word can be performed according to the time stamp t_stampDetermining a time period of the awakening word; audio signal channel ch where awakening word is located_wkpThe corresponding channel in the audio signal Y is separated for the wake-up word in the multi-channel.

Taking the microphone array with the number of microphone units being 2 as an example, the first single-channel separated audio signal contains a wakeup word signal, the second single-channel separated audio signal contains an interference signal, and the wakeup word decoder can recognize the wakeup words in the first single-channel separated audio signal. Therefore, the audio signal channel ch where the wake-up word outputted from the wake-up word decoder 320 is located_wkpNamely 1.

Step 2043, determining a single-channel audio signal from the audio signal channel where the wakeup word is located according to the time period where the wakeup word signal is located.

Specifically, as shown in fig. 3, the data extraction module 340 may be adopted to extract and wake up word time stamp t from a first single-channel separated audio signal in the multi-channel separated audio signal Y_stampCorresponding parts, i.e. single-channel audio signals y_wkp。

The multichannel separated audio signals Y are respectively input into the plurality of awakening word decoders to obtain the audio signal channels where the awakening words are located and the timestamps corresponding to the awakening words, so that single-channel audio signals corresponding to the timestamps can be accurately extracted.

In an embodiment of the disclosure, the wake word decoder 320 may recognize a neural network model for the trained wake words, and the wake word decoder 320 may separate tones from multiple channelsObtaining the audio signal channel ch where the wake-up word is located in the audio signal Y_wkpAnd timestamp t of the wake-up word_stamp. It should be understood that the present disclosure does not specifically limit the training process, model type, and model structure of the wake word recognition neural network model.

Through awakening the word recognition neural network model, awakening word recognition is carried out on the multi-channel separation audio signal Y, then the audio signal channel where the awakening word is located and the corresponding timestamp are obtained, and the accuracy of extracting the single-channel audio signal where the awakening word is located can be further improved.

In an embodiment of the present disclosure, the step 205 includes: extracting a first frequency domain characteristic from the mixed multi-channel audio signal, wherein the first frequency domain characteristic is used for representing a relative transfer function between frequency domain signals corresponding to different microphone units in a microphone array; extracting a second frequency domain feature from the single-channel audio signal where the awakening word is located, wherein the second frequency domain feature is used for representing the frequency domain energy value of the single-channel audio signal; based on the first frequency-domain feature and the second frequency-domain feature, the sound source is localized.

Specifically, the single-channel audio signal where the mixed multi-channel audio signal and the wake-up word are located can be converted from a time-domain signal to a frequency-domain signal through a time-frequency conversion process, so that a subsequent neural network model can perform calculation based on a middle sub-band of the single-channel audio signal where the mixed multi-channel audio signal and the wake-up word are located. For example, a 16kHz sampling frequency is adopted to sample a time domain signal, and the effective frequency band of the obtained frequency domain signal is 0-8 kHz. Because the low-frequency robustness is weak, and when the microphone unit interval is large, the high frequency does not satisfy the Nyquist spatial sampling theorem, in an embodiment of the present disclosure, the subsequent neural network model may perform sound source localization based on a 100-4 kHz frequency band in the frequency domain signal of the single-channel audio signal in which the mixed multi-channel audio signal and the wakeup word are located, thereby effectively reducing the amount of computation.

In addition, because the relative transfer function between the frequency domain signals corresponding to different microphone units can reflect the spatial information of the sound source, and the frequency domain energy value of the single-channel audio signal where the awakening word is located can more accurately represent the content of the awakening word, the sound source can be more accurately positioned under the condition of reducing the subsequent neural network model operation amount through the frequency domain feature extraction process.

In one embodiment of the present disclosure, where the microphone array is a linear microphone array, each of the plurality of predetermined azimuth classes corresponds to an angular range of the sound source relative to the linear microphone array.

For example, for a full spatial 180 ° direction, assuming a spatial resolution of 5 ° (i.e., one preset orientation class for every 5 °), the preset orientation classes are 36. The neural network model can output the probabilities of the 36 classes, the angle range corresponding to the class with the maximum probability is the positioning result of the sound source, and the intelligent interaction equipment can directionally pick up the audio signals in the angle range and make corresponding responses.

According to the technical scheme provided by the embodiment of the disclosure, the space where the sound source is located is divided into a plurality of preset azimuth categories of the relative linear microphone array, and the probability of the preset azimuth categories is output by utilizing the neural network model based on the single-channel audio signal where the mixed multi-channel audio signal and the awakening word are located, so that the complex sound source positioning problem is converted into the classification problem, and the complexity of the neural network model is reduced.

In another embodiment of the present disclosure, where the microphone array is a distributed microphone array, each of the plurality of preset orientation classes corresponds to a location area in space in which the distributed microphone array is located.

For example, a distributed microphone array applied in a vehicle includes 4 microphone units, respectively disposed at upper sides of 4 doors. The vehicle interior may be divided into 4 location areas (e.g., a main driving location area, a sub driving location area, a two-row left location area, and a two-row right location area), the 4 location areas being in one-to-one correspondence with the 4 preset orientation categories.

The trained neural network model can output the probability that the sound source is respectively located in the 4 position areas (namely, the probability of the sound source belonging to the 4 preset position categories) based on the first frequency domain feature and the second frequency domain feature, wherein the position area corresponding to the preset position category with the highest probability is the position of the sound source. For example, the result output by the neural network model is (0.6, 0.2, 0.2, 0), wherein the probability that the sound source is located in the main driving position area is 0.6, and the probability is the maximum, so that the main driving position area is determined to be the position of the sound source.

According to the technical scheme provided by the embodiment of the disclosure, the distributed space of the distributed microphone array is divided into a plurality of preset position areas, and the probability of the plurality of preset position areas is output by utilizing the neural network model based on the single-channel audio signal where the mixed multi-channel audio signal and the awakening word are located, so that the complex sound source positioning problem is converted into the classification problem, and the complexity of the neural network model is reduced.

When the neural network model for sound source positioning is trained, aiming at wake-up word signals sent by sound sources located in each preset direction, a microphone array is adopted to collect mixed multi-channel audio signals, the mixed multi-channel audio signals are subjected to wake-up word recognition by using a method similar to the wake-up word recognition, single-channel audio signals where corresponding wake-up words are located and mixed multi-channel audio signals corresponding to time periods where the wake-up words are located are obtained, the mixed multi-channel audio signals are used as sample data to be input into the neural network model, and the preset direction is used as marking information to train the neural network model.

Step 205 is described in detail below in conjunction with fig. 5 and 6. Fig. 5 is a schematic flowchart illustrating a sound source localization method according to another exemplary embodiment of the present disclosure. Fig. 6 is a schematic structural diagram of a sound source localization apparatus according to an exemplary embodiment of the present disclosure. As shown in fig. 6, the sound source localization apparatus includes a time-frequency transform module 610, a time-frequency transform module 620, a feature extraction module 630, a normalization module 640, and a neural network model 650.

As shown in fig. 5, in the sound source positioning method provided by the embodiment of the present disclosure, the step 205 may include steps 2051 to 2053.

Step 2051, extracting a first frequency domain feature from the mixed multi-channel audio signal, where the first frequency domain feature is used to characterize a relative transfer function between frequency domain signals corresponding to different microphone units in the microphone array.

In particular, the time-frequency transform module 610 may be employed to blend the multi-channel audio signal X_tThe time-frequency transformation processing is performed to obtain a first frequency domain signal of a plurality of frequency points, and the feature extraction module 630 is used to perform feature extraction on the first frequency domain signal for each frequency point of the plurality of frequency points to obtain a real part and an imaginary part of a relative transfer function value between frequency domain signals corresponding to different microphones of the microphone array, and the real part and the imaginary part are used as first frequency domain features.

Due to mixing of the multi-channel audio signal X_tThe information in (1) is presented in a specific time series manner, and belongs to a time domain signal. To obtain information on the frequency, amplitude, phase, etc. of the signal, it is necessary to mix the multi-channel audio signal X_tAnd performing time-frequency transformation processing to convert the time-frequency transformation processing into a first frequency domain signal. In particular, the time-frequency transform module 610 may mix the multi-channel audio signal X through a fourier transform process_tConverting from a time domain signal to a frequency domain signal, such as short time fourier transform processing (STFT), it should be understood that other time frequency transform processing may also be employed, and the disclosure is not limited in this regard.

As shown in FIG. 6, a mixed multi-channel audio signal X_tTime domain audio signals comprising M (M being the number of microphone elements in the microphone array) channels, respectively x₁(t)、x₂(t)、x₃(t)、…、x_M(t) performing short-time fourier transform (STFT) on the M time-domain audio signals to obtain a first frequency-domain signal of a plurality of bins, that is, x (k) [ x ])₁(k)x₂(k)…x_M(k)]Where K is the frequency point number, K is 1,2,3 … K, and K is the maximum frequency index, that is, there are K frequency points.

Further, the feature extraction module 630 is used to perform feature extraction on the first frequency domain signal for each of the multiple frequency points, so as to obtain a first frequency domain feature.

In an embodiment of the present disclosure, the real part and the imaginary part of the relative transfer function values between the frequency domain signals corresponding to different microphone elements of the microphone array may be used as features. The relative transfer function formula between the frequency domain signals corresponding to the mth microphone unit and the nth microphone unit can be expressed as:

wherein, X_m(k) Is the frequency domain signal of the k frequency point corresponding to the m microphone unit, X_n(k) Is the frequency domain signal of the K frequency point corresponding to the nth microphone unit, K is the frequency point number, K is 1,2,3 … K, K is the maximum frequency index, ()^*Representing a complex conjugate.

For M microphone units, each frequency point corresponds to

Relative transfer function values, K frequency points correspond to

Relative transfer function values. Due to RTF_mn(k) For complex numbers, the real and imaginary parts of each relative transfer function value are separately input, and thus, a dimension of

First frequency domain characteristic X of_in(k)。

And step 2052, extracting a second frequency domain feature from the single-channel audio signal where the awakening word is located, wherein the second frequency domain feature is used for representing a frequency domain energy value of the single-channel audio signal.

Specifically, the time-frequency conversion module 620 may be used to perform time-frequency conversion processing on the single-channel audio signal where the wakeup word is located, so as to obtain a second frequency domain signal of the multiple frequency points, and the normalization module 640 is used to perform normalization processing on the second frequency domain signal for each frequency point of the multiple frequency points, so as to obtain a second frequency domain characteristic.

Due to the single-channel audio signal Y in which the wake-up word extracted from the multi-channel split audio signal Y is located_wkpBelong toA time domain signal. In order to obtain the single-channel audio signal y_wkpFrequency, amplitude, phase, etc., of the single-channel audio signal y_wkpPerforming time-frequency transformation to convert into second frequency domain signal Y_wkp(k) And k represents a frequency point number. In particular, the time-frequency transform module 620 may process the single-channel audio signal y through a fourier transform_wkpConverting from a time domain signal to a frequency domain signal, such as short time fourier transform processing (STFT), it should be understood that other time frequency transform processing may also be employed, and the disclosure is not limited in this regard.

Specifically, the normalization processing on the second frequency domain signal by using the normalization module 640 may include: the second frequency domain signal is subjected to amplitude normalization processing or energy normalization processing, etc., which is not specifically limited by the present disclosure.

Specifically, the normalization formula can be as follows:

wherein k represents a frequency point number; k is the maximum frequency index; different values of p represent different normalization modes, wherein p is 1 to represent normalization according to amplitude, and p is 2 to represent normalization according to energy; y is_wkp(k) And the complex frequency domain signal represents the k frequency point.

By carrying out normalization processing on the second frequency domain signals of different frequency points, the influence caused by different magnitude between the frequency domain signals of different frequency points can be eliminated, and thus the convergence rate of the subsequent neural network training process is increased.

It should be understood that the above formula is merely an exemplary description, and the present disclosure does not specifically limit the normalization formula.

Since w (K) is a real number, the second frequency domain characteristics y corresponding to the K frequency points_in(k) Has a dimension of K × 1.

Step 2053, a sound source is localized based on the first frequency domain feature and the second frequency domain feature.

In particular, the first frequency domain characteristic described above may be usedX_in(k) And a second frequency domain characteristic y_in(k) The trained neural network model 650 is input to obtain the azimuth of the sound source output by the neural network model 650. The neural network model may be a classification model such as DNN, CNN, RNN, etc., and the structure and type of the neural network model 650 are not specifically limited by this disclosure.

As described above, X_in(k) Has the dimension of

y_in(k) Is K × 1, the input feature dimension of the neural network model 650 is thus

The output of the neural network model 650 may be probabilities of a plurality of preset bearing classes of the sound source; and the azimuth corresponding to the preset azimuth class with the highest probability in the probabilities of the plurality of preset azimuth classes is the azimuth of the sound source.

It should be understood that the present disclosure does not specifically limit the structure of the neural network model 650. For example, in an embodiment of the present disclosure, the neural network model 650 includes an input layer, a hidden layer 1, a hidden layer 2, and an output layer, and the neural network parameters may be as shown in table 1. The input layer parameter is the first frequency domain characteristic X_in(k) And a second frequency domain characteristic y_in(k) Dimension of

The output of the output layer may be a probability of 36 classes (0 to 180 degrees, at 5 degree intervals).

TABLE 1

Exemplary devices

The disclosed apparatus embodiments may be used to perform the disclosed method embodiments. For details not disclosed in the embodiments of the apparatus of the present disclosure, refer to the embodiments of the method of the present disclosure.

Fig. 7 is a schematic structural diagram of a sound source positioning device according to an embodiment of the present disclosure. The sound source positioning device has the function of implementing the embodiment shown in fig. 2, and the function can be implemented by hardware, and can also be implemented by hardware executing corresponding software. As shown in fig. 7, the sound source localization apparatus 700 may include: a voice separation module 710, a first determination module 720, a second determination module 730, a third determination module 740, and a location module 750.

And the voice separation module 710 is configured to perform voice separation on the original mixed audio signal of the sound source collected by the microphone array to obtain a multi-channel separated audio signal.

A first determining module 720, configured to determine a time period in which the wake-up word is located from the multi-channel separated audio signal.

The second determining module 730 is configured to determine, from the original mixed audio signal, a mixed multi-channel audio signal corresponding to a time period in which the wakeup word is located.

The third determining module 740 is configured to determine a single-channel audio signal in which the wake-up word is located from the multi-channel separated audio signal.

And a positioning module 750 configured to position a sound source based on the mixed multi-channel audio signal and the single-channel audio signal where the wake-up word is located.

According to the sound source positioning device provided by the embodiment of the disclosure, the time period of the awakening word is determined, and the sound source positioning is performed by combining the mixed multi-channel audio signal corresponding to the time period of the awakening word, so that the interference of the audio signals in other time periods can be eliminated, meanwhile, the efficiency of subsequent processing can be improved, and the processing amount can be reduced; in addition, the single-channel audio signal where the awakening word is located is determined from the multi-channel separated audio signal after the voice separation, the single-channel signal which only comprises or mainly comprises the awakening word is obtained, and the awakening word can be positioned in a targeted manner, so that the accuracy of sound source positioning is improved.

Fig. 8 is a block diagram illustrating a third determining module of a sound source localization apparatus according to an exemplary embodiment of the disclosure. The embodiment shown in fig. 8 of the present disclosure is extended on the basis of the embodiment shown in fig. 7 of the present disclosure, and the differences between the embodiment shown in fig. 8 and the embodiment shown in fig. 7 are emphasized below, and the descriptions of the same parts are omitted.

As shown in fig. 8, in the sound source positioning apparatus provided in the embodiment of the present disclosure, the third determining module 730 may include: a recognition unit 7310, a first determination unit 7320, and a second determination unit 7330.

A recognition unit 7310 for recognizing the wake-up word from the multi-channel separated audio signal.

The first determining unit 7320 is configured to determine a time period in which the wake-up word is located and an audio signal channel in which the wake-up word is located.

The second determining unit 7330 is configured to determine a single-channel audio signal from the audio signal channel where the wakeup word is located according to the time period where the wakeup word signal is located.

In another embodiment of the present disclosure, the identifying unit 7310 is configured to identify a wake-up word from the multi-channel separated audio signal by using a neural network model.

Fig. 9 is a block diagram illustrating a positioning module of a sound source positioning device according to an exemplary embodiment of the present disclosure. The embodiment shown in fig. 9 of the present disclosure is extended on the basis of the embodiment shown in fig. 7 of the present disclosure, and the differences between the embodiment shown in fig. 9 and the embodiment shown in fig. 7 are emphasized below, and the descriptions of the same parts are omitted.

As shown in fig. 9, in the sound source positioning device provided in the embodiment of the present disclosure, the positioning module 750 may include: a first extraction unit 7510, a second extraction unit 7520, and a positioning unit 7530.

A first extraction unit 7510 configured to extract a first frequency-domain feature from the mixed multi-channel audio signal, where the first frequency-domain feature is used to characterize a relative transfer function between frequency-domain signals corresponding to different microphone elements in the microphone array.

The second extracting unit 7520 is configured to extract a second frequency domain feature from the single-channel audio signal where the wakeup word is located, where the second frequency domain feature is used to represent a frequency domain energy value of the single-channel audio signal.

A localization unit 7530 for localizing the sound source based on the first frequency domain feature and the second frequency domain feature.

Fig. 10 is a block diagram illustrating a positioning unit of a sound source positioning device according to an exemplary embodiment of the present disclosure. The embodiment shown in fig. 10 of the present disclosure is extended on the basis of the embodiment shown in fig. 9 of the present disclosure, and the differences between the embodiment shown in fig. 10 and the embodiment shown in fig. 9 are emphasized below, and the descriptions of the same parts are omitted.

As shown in fig. 10, in the sound source localization apparatus provided in the embodiment of the present disclosure, the localization unit 7530 may include: an acquisition subunit 7531 and a positioning subunit 7532.

An obtaining subunit 7531, configured to obtain probabilities of a plurality of preset bearing classes of the sound source based on the first frequency domain feature and the second frequency domain feature.

A positioning subunit 7532, configured to position the sound source according to a preset bearing category corresponding to a maximum probability in the probabilities of the multiple preset bearing categories.

In one embodiment of the present disclosure, where the microphone array is a distributed microphone array, each of the plurality of preset orientation classes corresponds to a location area in space in which the distributed microphone array is located.

In an embodiment of the present disclosure, the first extraction unit 7510 is configured to perform time-frequency transform processing on a mixed multi-channel audio signal to obtain a first frequency domain signal of multiple frequency points, and perform feature extraction on the first frequency domain signal for each of the multiple frequency points to obtain a real part and an imaginary part of a relative transfer function value between frequency domain signals corresponding to different microphone units of a microphone array, where the real part and the imaginary part are used as first frequency domain features; the second extraction unit 7520 is configured to perform time-frequency transformation on the single-channel audio signal where the wakeup word is located, obtain a second frequency domain signal of the multiple frequency points, and perform normalization processing on the second frequency domain signal for each of the multiple frequency points, so as to obtain a second frequency domain characteristic.

It should be noted that the first determining module 920, the second determining module 930, and the third determining module may be actually the same software or hardware module, or may be different software or hardware modules; the first determining unit 7320 and the second determining unit 7330 may be actually the same software or hardware module, or may be different software or hardware modules; the first extraction unit 7510 and the second extraction unit 7520 may be actually the same software or hardware module, or may be different software or hardware modules; the embodiments of the present disclosure are not limited thereto.

Exemplary electronic device

Next, an electronic apparatus according to an embodiment of the present disclosure is described with reference to fig. 11. Fig. 11 is a block diagram illustrating an electronic device according to an exemplary embodiment of the disclosure. As shown in fig. 11, electronic device 1100 includes one or more processors 1110 and memory 1120.

The processor 1110 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 1100 to perform desired functions.

The memory 1120 may include one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer-readable storage medium and executed by processor 1110 to implement the sound source localization methods of the various embodiments of the present disclosure described above and/or other desired functions. Various contents such as an input signal, a signal component, a noise component, etc. may also be stored in the computer-readable storage medium.

In one example, the electronic device 1100 may further include: an input device 1130 and an output device 1140, which are interconnected by a bus system and/or other form of connection mechanism (not shown).

For example, when the electronic device is a smart home device, the input device 1130 may be a microphone or a microphone array for collecting the voice of the user. When the electronic device is a stand-alone device, the input device 1130 may be a communication network connector for receiving the collected input signal from an external mobile device. The input device 1130 may also include, for example, a keyboard, a mouse, and the like.

The output device 1140 may output various information including the determined position information, direction information, and the like to the outside. The output devices 1140 may include, for example, a display, speakers, a printer, and a communication network and its connected remote output devices, among others.

Of course, for simplicity, only some of the components of the electronic device 1100 relevant to the present disclosure are shown in fig. 11, omitting components such as buses, input/output interfaces, and the like. In addition, electronic device 1100 may include any other suitable components depending on the particular application.

Exemplary computer program product and computer-readable storage Medium

In addition to the above-described methods and apparatus, embodiments of the present disclosure may also be a computer program product comprising computer program instructions which, when executed by a processor, cause the processor to perform the steps in the sound source localization method according to various embodiments of the present disclosure described in the above-mentioned "exemplary methods" section of this specification.

The computer program product may write program code for carrying out operations for embodiments of the present disclosure in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present disclosure may also be a computer-readable storage medium having stored thereon computer program instructions which, when executed by a processor, cause the processor to perform the steps in the sound source localization method according to various embodiments of the present disclosure described in the "exemplary methods" section above in this specification.

The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The foregoing describes the general principles of the present disclosure in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present disclosure are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present disclosure. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the disclosure is not intended to be limited to the specific details so described.

The block diagrams of devices, apparatuses, systems referred to in this disclosure are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".

It is also noted that in the devices, apparatuses, and methods of the present disclosure, each component or step can be decomposed and/or recombined. These decompositions and/or recombinations are to be considered equivalents of the present disclosure.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit embodiments of the disclosure to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims

1. A sound source localization method, comprising:

carrying out voice separation on original mixed audio signals of a sound source collected by a microphone array to obtain multi-channel separated audio signals;

determining a time period of a wakeup word from the multi-channel separated audio signal;

determining a mixed multi-channel audio signal corresponding to the time period of the awakening word from the original mixed audio signal;

determining a single-channel audio signal where the awakening word is located from the multi-channel separated audio signal;

and positioning the sound source based on the mixed multi-channel audio signal and the single-channel audio signal where the awakening word is located.

2. The method of claim 1, wherein the determining the single-channel audio signal in which the wake-up word is located from the multi-channel separated audio signal comprises:

identifying a wake-up word from the multi-channel separated audio signal;

determining the time period of the awakening word and the audio signal channel of the awakening word;

and determining the single-channel audio signal from the audio signal channel where the awakening word is located according to the time period where the awakening word signal is located.

3. The method of claim 2, wherein the identifying a wake-up word from the multi-channel separated audio signal comprises:

and identifying a wake-up word from the multi-channel separated audio signal by using a neural network model.

4. The method of any of claims 1 to 3, wherein said localizing the sound source based on the mixed multi-channel audio signal and the single-channel audio signal in which the wake-up word is located comprises:

extracting first frequency domain features from the mixed multi-channel audio signal, wherein the first frequency domain features are used for representing relative transfer functions between frequency domain signals corresponding to different microphone units in the microphone array;

extracting a second frequency domain feature from the single-channel audio signal where the awakening word is located, wherein the second frequency domain feature is used for representing a frequency domain energy value of the single-channel audio signal;

based on the first frequency-domain feature and the second frequency-domain feature, the sound source is localized.

5. The method of claim 4, wherein the localizing the acoustic source based on the first frequency-domain feature and the second frequency-domain feature comprises:

obtaining probabilities of a plurality of preset azimuth categories of the sound source based on the first frequency domain feature and the second frequency domain feature;

and positioning the sound source according to a preset azimuth type corresponding to the maximum probability in the probabilities of the plurality of preset azimuth types.

6. The method of claim 5, wherein each of the plurality of predetermined orientation classes corresponds to an angular range of the sound source relative to the linear microphone array in the case of the microphone array being a linear microphone array or corresponds to a location area in space of a distributed microphone array in the case of the microphone array being a distributed microphone array.

7. The method of claim 4, wherein said extracting first frequency-domain features from the mixed multi-channel audio signal comprises:

performing time-frequency transformation processing on the mixed multi-channel audio signal to obtain a first frequency domain signal of a plurality of frequency points, and performing feature extraction on the first frequency domain signal aiming at each frequency point in the plurality of frequency points to obtain a real part and an imaginary part of a relative transfer function value between the frequency domain signals corresponding to different microphone units of the microphone array, wherein the real part and the imaginary part are used as first frequency domain features;

wherein, the extracting a second frequency domain feature from the single-channel audio signal where the wake-up word is located includes:

and performing time-frequency conversion processing on the single-channel audio signal where the awakening word is located to obtain a second frequency domain signal of a plurality of frequency points, and performing normalization processing on the second frequency domain signal aiming at each frequency point in the plurality of frequency points to obtain a second frequency domain characteristic.

8. A sound source localization apparatus comprising:

the voice separation module is used for carrying out voice separation on the original mixed audio signals of the sound source collected by the microphone array so as to obtain multi-channel separated audio signals;

the first determining module is used for determining the time period of the awakening word from the multi-channel separated audio signal;

the second determining module is used for determining a mixed multi-channel audio signal corresponding to the time period of the awakening word from the original mixed audio signal;

a third determining module, configured to determine a single-channel audio signal where the wakeup word is located from the multi-channel separated audio signal;

and the positioning module is used for positioning the sound source based on the mixed multi-channel audio signal and the single-channel audio signal where the awakening word is located.

9. A computer-readable storage medium, the storage medium storing a computer program for performing the method of any of the preceding claims 1-7.

10. An electronic device, the electronic device comprising:

a processor;

a memory for storing the processor-executable instructions;

the processor configured to perform the method of any of the preceding claims 1-7.