CN115825853A

CN115825853A - Sound source orientation method and device, sound source separation and tracking method and chip

Info

Publication number: CN115825853A
Application number: CN202310109065.9A
Authority: CN
Inventors: 赛义德·哈格哈特舒尔; 迪兰·理查德·缪尔; 乔宁
Original assignee: Shenzhen Shizhi Technology Co ltd
Current assignee: Shenzhen Shizhi Technology Co ltd
Priority date: 2023-02-14
Filing date: 2023-02-14
Publication date: 2023-03-21

Abstract

The invention discloses a sound source orientation method and a device thereof, a sound source separation and tracking method and a chip, aiming at solving the technical problems of complex calculation, poor anti-interference performance and hard hardware realization of the existing sound source orientation method, the invention carries out zero-crossing pulse coding on sound source data to be processed to obtain a pulse signal of the sound source data to be processed; and based on a pulse neural network, performing direction estimation on pulse signals obtained by zero-crossing pulse coding to obtain the target sound source direction of the sound source data to be processed. The method is simple, good in real-time performance and low in cost, can be easily realized in low-power-consumption hardware, and the chip test result is almost consistent with the computer simulation result, so that the method has commercial application value. The invention is suitable for the field of brain-like computing.

Description

Sound source orientation method and device, sound source separation and tracking method and chip

Technical Field

The invention relates to a sound source orientation method and a device thereof, a sound source separation and tracking method and a chip, in particular to a sound source orientation method and a device thereof, a sound source separation and tracking method and a chip for carrying out sound source orientation with low power consumption and low cost based on a Spiking Neural Network (SNN).

Background

The sound-listening position/direction discrimination is the instinct of the evolution of organisms, and can quickly and effectively identify the direction of a sound source in complex environments such as noise and the like. With the development of artificial intelligence technology, bionic machine vision and machine hearing have wide application in the edge fields of video conferences, intelligent robots, intelligent homes, quality video monitoring systems, intelligent Internet of things and the like.

Some existing methods perform source localization (SSL) based on an artificial neural network (ANN or RNN, etc.) for deep learning, on one hand, the technologies lack a dynamic mechanism inside nerves, are not bionic/intelligent enough, and have to be improved in real-time, and on the other hand, the technologies have large requirements on energy consumption and storage space, are mainly used for large-computation-power terminals for networking, and are not suitable for edge computing and internet of things scenes.

Because the distance and direction between the sound source and each microphone in the microphone array are different, each microphone in the microphone array is likely to receive the speech signal of the sound source, and the quality, speech clarity and accuracy of sound source orientation are inevitably reduced along with the movement of the sound source, the reverberation of a room, the interference of other sound sources, and noise (including but not limited to environmental noise or/and internal noise of electronic equipment, etc.), while the current sound source orientation technology is not bionic and has no high sensitivity and robustness, these factors increase the difficulty of sound source orientation and reduce the real-time property of sound source orientation, affect the audiovisual effect, and reduce the performance of electronic equipment in a speech interactive mode. Therefore, it is generally necessary to perform processing such as noise reduction of a speech signal and sound source separation processing after determining the position of the sound source.

In addition, the accuracy of the sound source method mostly needs to be improved by algorithms such as Singular Value Decomposition (SVD), subspace (subspace), beamforming (beamforming), generalized cross-correlation phase transformation, and the like in the conventional sound source positioning/orienting method, which increases the data processing amount, has a high requirement on the computing performance of the device, and complex computation not only consumes a large amount of storage resources and power consumption, but also is more difficult to implement in low-power-consumption hardware.

If the acoustic discrimination can be realized in a biological or bionic mode, and the relative delay of the incoming signals on different microphones is sensitive, the sound source can be detected quickly in real time, the position or the direction of the sound source can be effectively identified, and a sound source orientation scheme which has the advantages of low consumption of computing or storage resources, low power consumption, low cost and easy realization is a great progress of the machine hearing in the commercial application of the marginal intelligent computing field.

Disclosure of Invention

In order to solve or alleviate some or all of the technical problems, the invention is realized by the following technical scheme:

a first type of sound source direction finding method, said method comprising: carrying out zero-crossing pulse coding on sound source data to be processed to obtain a pulse signal of the sound source data to be processed;

and based on a pulse neural network, performing direction estimation on pulse signals obtained by zero-crossing pulse coding to obtain the target sound source direction of the sound source data to be processed.

In a certain embodiment, the sound source data to be processed is received based on microphones, and the sound source data to be processed received by each microphone is preprocessed and then subjected to zero-crossing pulse coding; wherein the zero-crossing pulse encoding comprises:

performing zero crossing point detection on the sound source data after the preprocessing of each microphone to obtain zero crossing points in the sound source data to be processed received by each microphone and time information corresponding to the zero crossing points;

and carrying out pulse coding on the basis of zero crossing points in the sound source data to be processed received by each microphone and time information corresponding to the zero crossing points to obtain pulse signals of the sound source data to be processed received by each microphone.

In a certain embodiment, performing zero crossing point detection on the sound source data after the pre-processing of each microphone to obtain a zero crossing point in the sound source data to be processed received by each microphone and time information corresponding to each zero crossing point, includes:

for the sound source data preprocessed by each microphone, determining a plurality of target signal point sets of the sound source data preprocessed by the microphone according to the signal value of each signal point in the sound source data preprocessed by the microphone;

obtaining the sum of corresponding signal values of each signal point in each target signal point set according to the signal value of each signal point in each target signal point set;

comparing the sum of corresponding signal values of each signal point in each target signal point set, and determining a target signal point with a local maximum value in each target signal point set and time information corresponding to the target signal point;

and determining zero crossing points in the sound source data to be processed received by the microphone and time information corresponding to the zero crossing points according to target signal points with local maximum values in each target signal point set and time information corresponding to each target signal point.

In some kind of embodiments, the determining, according to the signal value of each signal point in the preprocessed sound source data of the microphone, multiple target signal point sets of the preprocessed sound source data of the microphone includes:

comparing the signal values of the signal points in the sound source data of the microphone after the microphone is preprocessed, and determining the signal points of which the signal values in the sound source data of the microphone are continuously decreased;

and according to the time information corresponding to the signal points of which the signal values continuously decrease in the sound source data after the microphone is preprocessed, grouping the signal points of which the signal values continuously decrease in the sound source data after the microphone is preprocessed to obtain a plurality of target signal point sets.

In some kind of embodiments, the comparing the sums of the corresponding signal values of the signal points in each of the target signal point sets to determine a target signal point having a local maximum value in each of the target signal point sets includes:

aiming at each target signal point set, comparing the sum of corresponding signal values of each signal point in the target signal point set, and determining a candidate target signal point with the sum of initial maximum signal values in the target signal point set;

determining a candidate time period with a local maximum value in the target signal point set according to the time information corresponding to the candidate target signal point;

and comparing the sums of the corresponding signal values of the signal points corresponding to each time information in the candidate time period, determining the signal point with the maximum sum of the signal values, and determining the signal point with the maximum sum of the signal values as the target signal point with the local maximum value in the target signal point set.

In some kind of embodiments, the preprocessing includes performing channel decomposition on the to-be-processed sound source data received by each microphone, so as to obtain that the to-be-processed sound source data received by each microphone is decomposed into multiple frequency channels.

In some embodiments, the preprocessing further includes performing activity detection on the channel component after channel decomposition based on the to-be-processed sound source data received by each microphone to obtain a target frequency; the target frequency is one or more than one frequency;

and determining the sound source component of the sound source data to be processed received by each microphone in the target frequency channel as the sound source data preprocessed by each microphone.

In some kind of embodiments, the performing channel decomposition on the sound source data to be processed received by each microphone includes: and aiming at the sound source data to be processed received by each microphone, filtering the sound source data to be processed received by the microphone through a band-pass filter group, and dividing the sound source data to be processed received by the microphone into a plurality of frequency channels.

In some embodiments, the energy or the sum or the average energy of the channel components of the sound source data to be processed received by each microphone at different frequencies after channel decomposition is calculated within the same time window, so as to obtain the target frequency meeting the preset condition.

In a certain class of embodiments, the preset condition is that the energy or the sum or the average energy is greater than or equal to a first threshold; alternatively, the energy or the sum or average energy is greater than or equal to the first threshold and less than or equal to the second threshold.

In a certain embodiment, the performing direction estimation on a pulse signal obtained by zero-cross pulse coding based on a pulse neural network to obtain a target sound source direction of the sound source data to be processed includes:

inputting the pulse signal of the sound source data to be processed to a feature extraction module, and performing feature extraction on the pulse signal of the sound source data to be processed through the feature extraction module to obtain a pulse feature sequence; the feature extraction module is constructed on the basis of a long-term and short-term memory network;

and inputting the pulse characteristic sequence into the pulse neural network for direction estimation to obtain the target direction of the sound source data to be processed.

In some embodiments, the acoustic source data may be replaced by electromagnetic waves or/and seismic waves or/and radar or/and physiological signals, and accordingly, the microphone may be replaced by a sensor corresponding to the electromagnetic waves or seismic waves or radar or physiological signals.

A first type of sound source direction finding device, said sound source direction finding device comprising: the encoding module is used for carrying out zero crossing encoding on the sound source data to be processed to obtain a pulse signal of the sound source data to be processed;

and the estimation module is used for carrying out direction estimation on the pulse signals obtained by the zero-crossing pulse coding based on the pulse neural network to obtain the target sound source direction of the sound source data to be processed.

In some kind of embodiments, the sound source direction finding apparatus further includes: the preprocessing module is used for preprocessing the sound source data received by the microphones to obtain the sound source data of each microphone after preprocessing;

and the coding module performs zero cross point coding on the sound source data preprocessed by the microphones to obtain pulse signals of the sound source data to be processed received by the microphones.

In some class of embodiments, the preprocessing module comprises: the channel decomposition module is used for carrying out channel decomposition on the sound source data to be processed received by each microphone;

the activity detection module is coupled with the channel decomposition module and is used for carrying out activity detection on the channel components subjected to channel decomposition on the basis of the sound source data to be processed and received by each microphone so as to obtain a target frequency; the target frequency is one or more than one frequency;

In a certain embodiment, the energy or the average energy of the channel component of the sound source data to be processed received by each microphone under different frequencies after channel decomposition is calculated in the same time window, so as to obtain the target frequency meeting the preset condition.

In a certain class of embodiments, the preset condition is that the energy or the sum or average energy is greater than or equal to a first threshold; alternatively, the energy or the sum or average energy is greater than or equal to the first threshold and less than or equal to the second threshold.

In a certain type of embodiment, the encoding module is configured to detect a zero crossing point in the sound source data to be processed received by each microphone and time information corresponding to each zero crossing point;

performing pulse coding based on zero crossing points in the sound source data to be processed received by each microphone and time information corresponding to each zero crossing point,

determining a plurality of target signal point sets of the sound source data preprocessed by the microphones according to the signal values of the signal points in the sound source data preprocessed by the microphones;

and determining zero crossing points in the sound source data to be processed received by the microphone and time information corresponding to the zero crossing points according to target signal points with local maximum values in each target signal point set and the time information corresponding to each target signal point.

In some kind of embodiments, determining a plurality of target signal point sets of the preprocessed sound source data according to the signal value of each signal point in the preprocessed sound source data of the microphone includes:

In some kind of embodiments, the sound source direction finding apparatus further includes: the characteristic extraction module is coupled between the coding module and the pulse neural network, and is used for extracting the characteristics of the pulse signals of the sound source data to be processed, which are received by each microphone and generated by the coding module, so as to obtain a pulse characteristic sequence;

and the pulse neural network carries out direction estimation based on the pulse characteristic sequence to obtain the target direction of the sound source data to be processed.

A sound source separation method, comprising: performing sound source direction estimation on sound source data to be separated by the first-class sound source orientation method, and determining candidate sound sources corresponding to the sound source data to be separated and a target sound source direction of each candidate sound source;

carrying out sound source separation according to the positions of the sound channels for collecting the sound source data to be separated and the target sound source directions of the candidate sound sources to obtain sound signals of each candidate sound source;

and determining a target sound source from a plurality of candidate sound sources according to the sound signals of the candidate sound sources.

A sound source tracking method, the sound source tracking method comprising: determining a target sound source direction of the sound source data by the first type of sound source orientation method as described above; and carrying out sound source tracking based on the target sound source direction of the sound source data.

A first type of chip comprising a first type of sound source direction finding device as described above.

A first type of electronic equipment comprising a first type of sound source direction finding device as described above, or comprising a first type of chip as described above.

The preprocessing device is used for preprocessing sound source data to be processed received by microphones to obtain preprocessed sound source data of each microphone; the pretreatment device comprises: the channel decomposition module is used for carrying out channel decomposition on the sound source data to be processed received by each microphone; the activity detection module is coupled with the channel decomposition module and is used for carrying out activity detection on the channel components subjected to channel decomposition based on the to-be-processed sound source data received by each microphone so as to obtain a target frequency; the target frequency is one or more than one frequency; and determining the sound source component of the sound source data to be processed received by each microphone in the target frequency channel as the sound source data preprocessed by each microphone.

In some class of embodiments, the activity detection module performs independent activity detection or joint activity detection or local joint activity detection;

the independent activity detection is: counting the energy or the energy average value of sound source data to be processed received by each microphone under different frequency channels after channel decomposition to obtain an initial activity frequency meeting a first preset condition; obtaining the target frequency based on the initial activity frequency corresponding to each microphone;

the joint activity detection is: counting the energy sum of all the microphones under different frequency channels respectively, or counting the average value of the signal energy of all the microphones under different frequency channels respectively, so as to obtain a target frequency meeting a second preset condition;

the local joint activity detection is: grouping sound source data to be processed received by all microphones to obtain at least two sound source data combinations; the sound source data combination comprises at least one sound source data to be processed; counting the energy sum of all sound source data in each sound source data combination under different frequency channels, or counting the signal energy average value of all sound source data in each sound source data combination under different frequency channels to obtain the active frequency of each sound source data combination; and determining to obtain a target frequency based on the activity frequency of each sound source data combination.

In a certain class of embodiments, the first preset condition for independent activity detection is that energy or an energy average is maximum, or that energy or an energy average is greater than or equal to a second threshold;

the second preset condition of the joint activity detection is that the energy sum or the energy average value is maximum, or the energy sum or the energy average value is greater than or equal to a second threshold.

In some embodiments, the obtaining the target frequency based on the initial active frequency corresponding to each microphone includes one of:

selecting at least one initial activity frequency as a target frequency based on the frequency of occurrence of each initial activity frequency;

selecting at least one initial activity frequency as a target frequency based on the frequency value of each initial activity frequency;

and clustering the initial activity frequencies corresponding to the microphones, and selecting at least one initial activity frequency as a target frequency.

In some embodiments, the determining a target frequency based on the active frequency of each sound source data combination includes one of:

selecting at least one activity frequency as a target frequency based on the frequency of occurrence of each activity frequency;

selecting at least one active frequency as a target frequency based on the frequency value of each active frequency;

and clustering the activity frequencies corresponding to the sound source data combinations, and selecting at least one activity frequency as a target frequency.

In a certain type of embodiments, the channel decomposition module includes two or more filter banks for preprocessing sound source data to be processed received by each microphone in a microphone array composed of two or more microphones; wherein the number of filter banks is equal to or less than the number of microphones in the array of microphones, each filter bank being coupled to one of the microphones in the array of microphones;

and the filter bank carries out filtering processing, and divides the sound source data to be processed received by the corresponding microphone into a plurality of frequency channels.

In some embodiments, the microphone array is linear or circular or spherical or cross-shaped or spiral.

In some class of embodiments, the microphone array comprises a circular array of 8 microphones.

In some kind of embodiments, the acoustic source data may be replaced by electromagnetic waves or/and seismic waves or/and radar or/and physiological signals, and accordingly, the microphone is replaced by a sensor corresponding to the electromagnetic waves or seismic waves or radar or physiological signals.

A preprocessing method is used for preprocessing sound source data to be processed received by microphones to obtain sound source data of each microphone after preprocessing; the pretreatment method comprises the following steps:

carrying out channel decomposition on the to-be-processed sound source data received by each microphone to obtain to-be-processed sound source data received by each microphone, and decomposing the to-be-processed sound source data into a plurality of frequency channels;

performing activity detection on the channel component subjected to channel decomposition based on the to-be-processed sound source data received by each microphone to obtain a target frequency; the target frequency is one or more than one frequency;

In a certain type of embodiment, the performing activity detection on the channel component after channel decomposition based on the sound source data to be processed received by each microphone to obtain a target frequency includes: carrying out independent activity detection on channel components obtained by carrying out channel decomposition on sound source data to be processed received by each microphone to obtain initial activity frequency corresponding to each microphone; the independent activity detection is to respectively count the energy or the energy average value of each microphone under different frequency channels to obtain the initial activity frequency meeting a first preset condition;

and obtaining the target frequency based on the initial activity frequency corresponding to each microphone.

In a certain class of embodiments, the first preset condition is that the energy or the energy average is maximum, or the energy average is greater than or equal to the second threshold.

In some kind of embodiments, the obtaining the target frequency based on the initial active frequency corresponding to each microphone includes one of:

In a certain type of embodiment, the performing activity detection on the channel component after channel decomposition based on the sound source data to be processed received by each microphone to obtain a target frequency includes: carrying out joint activity detection on channel components obtained by carrying out channel decomposition on sound source data to be processed received by all microphones;

and the joint activity detection is to count the energy sum of all the microphones under different frequency channels respectively, or count the average value of the signal energy of all the microphones under different frequency channels respectively, so as to obtain the target frequency meeting a second preset condition.

In a certain class of embodiments, the second preset condition is that the energy sum or the energy average value is maximum, or the energy sum or the energy average value is greater than or equal to a second threshold.

In a certain type of embodiment, the performing activity detection on the channel component after channel decomposition based on the sound source data to be processed received by each microphone to obtain a target frequency includes: performing local joint activity detection on channel components obtained by performing channel decomposition on sound source data to be processed received by all microphones, wherein the local joint activity detection is as follows:

grouping sound source data to be processed received by all microphones to obtain at least two sound source data combinations; the sound source data combination comprises at least one sound source data to be processed;

counting the energy sum of all sound source data in each sound source data combination under different frequency channels, or counting the signal energy average value of all sound source data in each sound source data combination under different frequency channels to obtain the active frequency of each sound source data combination;

determining to obtain a target frequency based on the activity frequency of each sound source data combination; and determining and obtaining a target frequency channel of the sound source data to be processed received by each microphone according to the target frequency.

A second type of sound source direction finding device comprising a preprocessing unit as described previously; and the number of the first and second groups,

the coding module is coupled with the preprocessing device and used for carrying out pulse coding on the sound source data preprocessed by each microphone to obtain pulse signals corresponding to the sound source data to be processed received by each microphone;

and the estimation module is coupled with the coding module and used for carrying out direction estimation on the pulse signals obtained by the pulse coding based on the pulse neural network to obtain the target sound source direction of the sound source data to be processed.

In certain embodiments, the encoding module zero-crossing pulse encodes.

In some embodiments, detecting a zero crossing point of the sound source data preprocessed by each microphone on the target frequency channel, and generating a pulse at the zero crossing point; wherein the zero-crossing point is an upward zero-crossing point or/and a downward zero-crossing point;

the upward zero-crossing point is a signal point at which the signal amplitude changes from negative to positive, and the downward zero-crossing point is a signal point at which the signal amplitude changes from positive to negative.

In certain class of embodiments, the encoding module performs the steps of:

A second type of sound source orientation method, where the second type of sound source orientation method includes the preprocessing method described above, and obtains sound source data obtained by preprocessing each microphone;

carrying out pulse coding on the sound source data after the preprocessing of each microphone to obtain pulse signals corresponding to the sound source data to be processed received by each microphone;

and based on the pulse neural network, performing direction estimation on the pulse signals obtained by pulse coding to obtain the target sound source direction of the sound source data to be processed.

In some class of embodiments, the pulse encoding is zero-crossing pulse encoding.

A second type of chip comprising a preprocessing unit as described above, or a sound source direction unit of a second type as described above.

A second type of electronic device comprising a sound source localization arrangement of the second type as described above, or comprising a chip of the second type as described above.

Some or all embodiments of the invention have the following beneficial technical effects:

1) The sound source direction estimation scheme of the invention does not need complex algorithms such as beam forming, subspace and the like, and uses the pulse neural network based on event-based (event-based) to realize sound source estimation, and the method is simple, easy to orient, low in power consumption and easy to realize hardware.

2) The pulse coding method based on the zero crossing point can effectively capture phase information required in sound source direction estimation, carries out sound source direction estimation based on relative time delay information, and improves the real-time property, anti-interference property and accuracy of sound source estimation. And further, robust zero-crossing pulse coding is adopted, so that the robustness is improved.

3) The invention applies the adaptive broadband orientation technology to carry out sound source orientation, namely, identifies the active frequency components in the signals in each time interval and uses the active frequency components to carry out positioning, thereby enhancing the real-time property, effectively overcoming the problem of unstable voice signals, having strong environmental adaptability and being suitable for various complex environments.

4) The activity detection of the invention has various implementation modes, independent activity detection or joint activity detection or local joint activity detection, and strong flexibility.

5) The invention can effectively realize DOA estimation of narrow-band and wide-band sound signals. In addition, the sound source direction estimation scheme of the present invention is very fast in response to a sudden change in DoA (e.g., speaker change in a conference room), can rapidly output a DoA angle after switching, and can quickly, efficiently, and accurately track when a sound source moves (e.g., speaker moves).

6) The sound source orientation technology of the invention obtains a better sound source orientation result in the chip, and the difference of the test result of the sound source mutation and tracking by utilizing the chip is very small and can be ignored compared with the computer simulation result. The sound source orientation technology can be effectively implemented in hardware and has commercial application value.

Further advantages will be further described in the preferred embodiments.

The technical solutions/features disclosed above are intended to be summarized in the detailed description, and thus the ranges may not be exactly the same. The technical features disclosed in this section, together with technical features disclosed in the subsequent detailed description and parts of the drawings not explicitly described in the specification, disclose further aspects in a mutually rational combination.

The technical scheme combined by all the technical features disclosed at any position of the invention is used for supporting the generalization of the technical scheme, the modification of the patent document and the disclosure of the technical scheme.

Drawings

FIG. 1 is a schematic view of the direction of arrival in a circular array;

FIG. 2 is a schematic flow chart of a sound source orientation method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a pulse sequence of a low frequency channel after zero crossing pulse coding according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a pulse sequence of the high frequency channel after zero-crossing pulse coding according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a spiking neural network;

fig. 6 is a schematic diagram of array resolution of a linear microphone array provided by an embodiment of the present invention;

FIG. 7 is a schematic time-frequency diagram of a speech signal;

FIG. 8 is a schematic diagram of sound source orientation in an embodiment of the present invention;

fig. 9 is a schematic diagram of sound source orientation after pre-processing of signals received by microphones in accordance with an embodiment of the present invention.

FIG. 10 is a schematic diagram of independent activity detection provided by embodiments of the present invention;

FIG. 11 is a schematic diagram of joint activity detection provided by embodiments of the present invention;

FIG. 12 is a schematic structural diagram of a sound source localization model based on a long-short term memory network and an impulse neural network according to an embodiment of the present invention;

FIG. 13 is a test result of the sound source orientation simulation in the low frequency channel of the present invention;

FIG. 14 shows the comparison result of the sound source orientation test using the low frequency channel between the brain-like chip implemented by the low power consumption hardware and the simulation model according to the present invention;

FIG. 15 shows the results of the sound source orientation simulation test in the high frequency channel of the present invention;

FIG. 16 is the comparison result of the sound source orientation test using the high frequency channel between the brain-like chip implemented by the low power consumption hardware and the simulation model according to the present invention;

fig. 17 is a schematic flow chart of a sound source signal separation method according to an embodiment of the present invention;

FIG. 18 is a schematic flow chart of a sound source tracking method provided by an embodiment of the present invention;

FIG. 19 shows the test results of sound source tracking performed by the brain-like chip implemented in low-power hardware according to the present invention.

Detailed Description

Since various alternatives cannot be exhaustively described, the following will clearly and completely describe the gist of the technical solution in the embodiment of the present invention with reference to the drawings in the embodiment of the present invention. It is to be understood that the invention is not limited to the details disclosed herein, which may vary widely from one implementation to another.

In the present invention, "/" at any position indicates a logical "or" unless it is a division meaning. The ordinal numbers "first," "second," etc. in any position of the invention are used merely as distinguishing labels in description and do not imply an absolute sequence in time or space, nor that the terms in which such a number is prefaced must be read differently than the terms in which it is prefaced by the same term in another definite sentence.

The present invention may be described in terms of various elements combined into various embodiments, which may be combined into various methods, articles of manufacture. In the present invention, even if the points are described only when introducing the method/product scheme, it means that the corresponding product/method scheme explicitly includes the technical features.

The presence or inclusion of a step, module, feature in any location in the disclosure does not imply that such presence is the only exclusive presence, and those skilled in the art are fully enabled to derive other embodiments based on the teachings herein, along with other techniques. The embodiments disclosed herein are generally for the purpose of disclosing preferred embodiments, but this does not imply that the opposite embodiment to the preferred embodiment is excluded/excluded from the present invention, and it is intended to cover the present invention as long as such opposite embodiment solves at least some technical problem of the present invention. Based on the point described in the embodiments of the present invention, those skilled in the art can completely apply the means of substitution, deletion, addition, combination, and order change to some technical features to obtain a technical solution still following the concept of the present invention. Such solutions are also within the scope of protection of the present invention, without departing from the technical idea of the invention. Some important terms and symbols are explained:

neuromorphic (mimicry) chips: the event-driven circuit has the characteristic of event driving, and the event is driven to be calculated or processed when occurring, so that the ultrahigh real-time performance and the ultralow power consumption are realized on a hardware circuit. Neuromorphic chips are classified according to type as neuromorphic chips based on analog or digital or data hybrid circuits.

Spiking Neural Network (SNN): one of the event-driven neuromorphic chips is a third-generation artificial neural network, and has the advantages of abundant space-time dynamics characteristics, various coding mechanisms, event-driven characteristics, low calculation cost and low power consumption. Compared with an artificial neural network ANN, the SNN is more bionic and advanced, and the SNN-based brain-like computing (brain-embedded computing) or neuromorphic computing (neuromorphic computing) has better performance and computing cost than the traditional AI chip. It should be noted that, in the embodiment of the present invention, the type of the impulse neural network is not specifically limited, as long as the neural network driven based on the impulse signal or the event is applicable to the sound source orientation method provided in the embodiment of the present invention, and the impulse neural network may be built according to the actual application scenario, for example, the impulse convolutional neural network (SCNN), the impulse recurrent neural network (SRNN), the long-short term neural network (LSTM), and the like.

Direction of Arrival (DoA): the direction angle of the audio signal output by the sound source reaching the microphone array is different, and the delay of the audio signal output by the sound source reaching the microphone array is different for sound sources with different directions of arrival. The microphone array may have different spatial shapes such as circular, linear, spherical, cross-shaped, spiral, etc. For example, a circular microphone array is used for illustration, as shown in fig. 1, and as shown in fig. 1, it is understood that the microphone array shown in fig. 1 is only an example, and the microphone array is not particularly limited by the embodiment of the present invention. Fig. 1 is a schematic diagram of the direction of arrival in a circular array, the projection of the array elements along the DoA, as a measure of the relative time at which the signal is received at the array elements. Wherein an array element refers to a microphone in a microphone array.

It should be noted that the DoA estimation method of the present invention is not only applicable to audio waves, but also applicable to electromagnetic waves, seismic waves, radar, and similar waves or one-dimensional waves to find the direction or position of a target.

Narrow-band: the bandwidth of the signal is much lower than its center frequency, e.g., the bandwidth of a narrowband signal is in the range of 10-100 MHz.

Narrow band positioning: the narrow-band signal is relatively simple in frequency spectrum and can be regarded as a single-frequency signal. In the case of narrow bands, located in

Will be generated by the microphones of the different microphone arrays

Given the phase shift (phase shift) of the harmonic signal of the incident harmonic, the signal can be influenced by the arrangement of the microphone array set

A defined M-dim array is encoded in response to a vector a (n), where n is the unit norm vector representing the DoA direction of the incident wave, in the unit circle

The DoA vector above, λ is the wavelength, M is the number of microphones in the array,

is the kth microphone in the array arrangement. Understandably, the array response as a function of the DoA vector is indeed a spatial harmonic signal whose frequency depends on the geometry of the microphone array set arrangement

。

Array resolution: characterizing the distance at which two objects can distinguish a DoA in the presence of noise. This resolution depends on the array geometry and more importantlyDepending on the spatial size of the array, e.g. an L-sized linear array with an angular resolution of

. In general, the larger the array, the better its angular resolution.

A grating lobe: while the use of a larger array may result in better resolution, it may lead to grating lobe problems when the number of array elements, and the spatial span of the array as a whole, is limited. When this occurs, the array response vector can exhibit aliasing effects, i.e., n for two different DoA vectors ₁ And n ₂ Possibly with the same array response vector, i.e. a (n) ₁ )=a(n ₂ ) This makes it impossible to determine the angle at which the sound source is located, and hence to distinguish and find the correct DoA. Thus for a microphone array with a limited number of microphones there is a trade-off between increasing the array resolution and avoiding grating lobes, e.g. based on

And determining the distance between the microphones in the microphone array to obtain the array resolution of the microphone array, wherein lambda is the wavelength of the audio signal.

Broadband positioning: broadband signals are relatively complex in frequency spectrum and contain abundant frequency components, and broadband positioning can be regarded as a generalization of narrowband cases, namely positioning can be performed by processing a plurality of received frequency signals. A fixed array with a given array element configuration can only handle signals in a limited frequency range: when the frequency exceeds

When the signal wavelength is very small, especially the wavelength is smaller than the distance between the array elements, the grating lobe effect generated at this time can limit the positioning effect; when the frequency is lower than

Very large signal wavelengths, especially wavelengths longer than the entire array span, result in limited angular resolution of the array and thus inability to measure noiseA single target is positioned with sufficient accuracy.

As mentioned in the background, the existing sound source orientation methods are mainly implemented by beamforming techniques, wherein beamforming techniques will define

Is the signal received at the microphone, where M is the number of microphones in the array,

is shown in a microphone

The relative delay time of (a) is a function of the DoA of the audio signal n. In the beamforming, different doas (referred to as spatial matched filtering) may be obtained by performing weighting and Delay processing on the accumulated received signals at different microphones to further find the DoA with the maximum power, which is the target sound source direction, and the target sound source direction may be determined by, for example, a Delay and Sum (Delay and Sum) algorithm, a Minimum Variance Distortion free Response (MVDR) algorithm, or a controlled Response power phase transformation (SRP-PHAT) algorithm. The methods are mainly applied to narrow-band signals, under the condition of narrow band, input audio signals are concentrated around carrier frequency, the energy of the audio signals received by different microphones in a microphone array is calculated, for example, the power of the signals received by each microphone is calculated after Fourier transformation, a DoA generating the maximum power is found according to the calculated power, and then the direction of a target sound source is determined.

Therefore, the existing sound source orientation method is complex in calculation and high in calculation performance requirement on equipment, the complex calculation consumes a large amount of storage resources and power consumption, instantaneity is affected, and the method is more difficult to implement in low-power-consumption hardware.

Based on this, in order to provide a sound source orientation scheme with low power consumption, low cost, strong real-time performance and easy hardware implementation, embodiments of the present invention provide a sound source orientation method, apparatus, chip and electronic device, capture relative delay information in sound source data to be processed through pulse coding, perform sound source direction estimation based on the relative delay information, and improve accuracy of sound source estimation. Specifically, the pulse coding method based on the zero crossing point captures delay information required in sound source direction estimation, so that the accuracy of sound source direction estimation is ensured, direction estimation is performed based on a pulse neural network, the target direction of the sound source data to be processed is obtained, the accuracy of sound source direction estimation is ensured, the power consumption is reduced, and the pulse coding method based on the zero crossing point has better robustness and higher processing speed.

In order to facilitate understanding of the technical solution of the present invention, the sound source orientation method, apparatus, chip and electronic device provided by the present invention will be described below with reference to practical application scenarios.

In order to improve the real-time performance of sound source orientation and reduce the power consumption and complexity of sound source orientation, so as to ensure that the sound source orientation method can be easily applied to low-power-consumption hardware and further improve the orientation performance when a sound source is switched or/and changed or/and moved, the method performs sound source direction estimation based on the impulse neural network SNN, and converts DoA estimation into a classification task of the impulse neural network, which at least comprises the steps shown in fig. 2, wherein fig. 2 is a schematic flow diagram of the sound source orientation method provided by an embodiment of the invention.

Step S100, zero crossing pulse coding is carried out on the sound source data to be processed, and a pulse signal of the sound source data to be processed is obtained. Considering the impulse communication mechanism of the impulse neural network, the acoustic source data to be processed needs to be converted into a set of impulse features in advance.

In some embodiments, the sound source data to be processed may be a speech signal in a time domain or an audio signal in a frequency domain.

In some embodiments, the sound source data to be processed may be collected voice signals in the current environment, which include voice signals of the sound source and/or noise present in the environment.

Preferably, the sound source data to be processed is a speech signal in the current environment collected in real time; alternatively, the sound source data to be processed may be a speech signal in the current environment collected in a past period of time. The past period of time may be past 1s and past 1min, and this embodiment of the present invention is not particularly limited.

Optionally, the speech signal in the environment is collected by a microphone array. The microphone array may be a circular microphone array as shown in fig. 1, a linear microphone array, a distributed microphone array, or a cross-shaped microphone array. The microphone array includes at least one microphone, and the microphone may be a noise reduction microphone, which is not limited in the present invention. In addition, the sound source data to be processed may be obtained by performing filtering, noise reduction, time-frequency analysis, and the like on the collected voice signal in the current environment, which is not limited in the present invention.

Considering that the sound source data to be processed collected in practical application is a broadband signal, if the sound source data to be processed is directly used for zero-crossing coding, interference may exist, thereby reducing the accuracy of the sound source direction estimation result. Based on this, in order to improve the accuracy of sound source direction estimation, in a certain type of embodiment, the sound source data to be processed is subjected to zero-crossing pulse coding after being preprocessed, where the preprocessing includes channel decomposition, where the channel decomposition is to decompose a broadband signal into a plurality of frequency channels, signals in each frequency channel are narrowband signals, and each narrowband signal has a different frequency range. Specifically, channel decomposition is performed on to-be-processed sound source data received by each microphone in the microphone array, the to-be-processed sound source data of a broadband is decomposed into a plurality of narrow-band signals, and zero-crossing encoding is performed on the narrow-band signals of each frequency channel.

Optionally, the sound source data received by each microphone may be subjected to channel decomposition by a filter bank, where the filter bank includes, but is not limited to, a band pass filter bank, a narrow band filter bank, and the like. In addition, the channel decomposition may further include other modules, such as a low noise amplifier LNA coupled to a filter bank, and the like, and the invention is not limited thereto.

Considering that if each of the plurality of narrowband signals obtained after the channel decomposition is subjected to zero-crossing encoding, the data size is large, which increases the processing time for sound source orientation, and further reduces the real-time performance of sound source orientation, and each of the plurality of narrowband signals obtained after the channel decomposition has a different frequency range, and the energy of the sound signal corresponding to the sound source direction is high and has a specific frequency, that is, the energy received by each microphone in the microphone array into the sound source data to be processed is mainly concentrated in one frequency range or a plurality of frequency ranges, it is possible to determine, through activity detection, a target frequency channel having the maximum energy among the plurality of narrowband signals obtained after the channel decomposition or select a target frequency channel having processing energy greater than or equal to a preset energy threshold from the plurality of narrowband signals obtained after the channel decomposition, and perform zero-crossing encoding on the narrowband signal of the target frequency channel. Based on this, in order to implement fast sound source orientation, in a certain type of embodiment, the preprocessing includes channel decomposition and activity detection, specifically, channel decomposition is performed on to-be-processed sound source data received by each microphone in the microphone array, the to-be-processed sound source data of a broadband is decomposed into a plurality of frequency channels, activity detection is performed based on energy of narrowband signals of the plurality of frequency channels obtained after the channel decomposition, a target frequency channel is determined, and zero-crossing encoding is performed on the narrowband signal of the target frequency channel.

The activity detection may be to select a target frequency channel with the maximum energy from a plurality of narrowband signals obtained after channel decomposition, the activity detection may be to select several (two or more) channels with the energy at the front from the plurality of narrowband signals obtained after channel decomposition as target frequency channels, and the activity detection may also be to select a target frequency channel with the processing energy greater than or equal to a preset energy threshold from the plurality of narrowband signals obtained after channel decomposition. The preset energy threshold may be any one of an average value, a median, a mode, a second maximum value, and a third maximum value of the energy of the narrowband signals of the multiple frequency channels, and the embodiment of the present invention is not particularly limited.

The pulse code includes frequency code (Rate Coding), time code (Temporal Coding), burst code (burst Coding), and group code (position Coding). Pulse coding can be viewed as a feature extractor, which produces features that are processed by the SNN. However, some pulse codes extract characteristics of the input sound source signal that are non-coherent (or correlated) characteristics, such as the short time transform (STFT) strength of the input signal. Since the incoherent characteristic cannot capture phase information, the incoherent characteristic is not sensitive to relative time of received signals under different microphones, and meanwhile, in practical application scenes, such as conference rooms and other reflection propagation (or reverberation propagation) environments, huge frequency domain distortion is generated on input sound source signals to be processed, so that the extracted characteristic is disturbed, the incoherent characteristic of the sound source signals cannot be used for sound source direction estimation.

Due to the problems, the invention adopts different types of pulse codes, namely zero-crossing pulse codes, and codes the sound source data to be processed based on the zero-crossing pulse codes to obtain the pulse signals of the sound source data to be processed. Zero-crossing pulse coding is sensitive to the relative delay of incoming signals on different microphones, and can quickly capture the information without large disturbance in a reflection propagation environment, so that the real-time performance and the accuracy of sound source orientation are improved.

Specifically, the pulse coding method based on zero-crossing coding at least comprises the following steps of SA 111-SA 113:

step SA111, performing zero crossing point detection on the sound source data to be processed to obtain a zero crossing point in each of the sound source data to be processed and time information corresponding to each of the zero crossing points.

In addition, the sound source data to be processed received by each microphone may be preprocessed according to step S100, and the sound source data to be processed received by each microphone is preprocessed to obtain the sound source data preprocessed by each microphone.

Step SA112, performing zero crossing point detection on the to-be-processed sound source data received by each microphone or the pre-processed sound source data to obtain a zero crossing point in each to-be-processed sound source data and time information corresponding to each zero crossing point.

The zero-crossing point may be a signal point with a signal value of 0, or a signal point with an abrupt change in signal value, for example, a signal point with a signal value changing from a positive number to a negative number or/and changing from a negative number to a positive number, or a signal point with a product between signal values of adjacent signal points smaller than 0.

In certain embodiments of the present invention, the zero crossing point may be determined by multiplying the sum of the signal values at each time point by the signal value at the next time point adjacent to the time point.

Optionally, the sound source data to be processed received by each microphone is preprocessed to obtain preprocessed sound source data, products of a value of a current time point and a value of a next time point adjacent to the current time point are sequentially calculated from a first time point of the preprocessed sound source data of each microphone, and if the product of the value of the current time point and the value of the next time point adjacent to the current time point is smaller than 0, it is determined that at least one zero crossing point exists between the current time point and the next time point adjacent to the current time point; if the product of the value of the current time point and the value of the next time point adjacent to the current time point is greater than 0, determining that a zero crossing point does not exist between the current time point and the next time point adjacent to the current time point; and if the product of the value of the current time point and the value of the next time point adjacent to the current time point is equal to 0 and the value of the current time point is not 0, determining that the next time point adjacent to the current time point is the zero crossing point of the microphone.

Optionally, if the product of the value of the current time point and the value of the next time point adjacent to the current time point is less than 0, a zero crossing point with a value of 0 is determined by bisection from between the current time point and the next time point adjacent to the current time point, and the time corresponding to the zero crossing point is determined as the time information corresponding to the zero crossing point.

In some embodiments, in order to implement sparse processing and improve real-time performance, only downward zero-crossing points, i.e., signal points at which the signal values change from positive numbers to negative numbers, are detected. Alternatively, only upward zero-crossing points, i.e. signal points where the signal value changes from negative to positive, are detected. The downward zero-crossing point may be a zero-crossing point in a time period in which the signal value decreases, or a signal point in which the signal value changes from a positive number to a negative number; the upward zero-crossing point may be a zero-crossing point in a time period in which the signal value is incremented, or a signal point in which the signal value changes from a negative number to a positive number.

In order to improve the accuracy of the zero crossing point in the sound source data to be processed and further improve the accuracy of the pulse signal, in some embodiments of the present invention, an upward zero crossing point or/and a downward zero crossing point is detected, and the zero crossing point is determined. For example, taking the detection of a downward zero crossing point as an example, the candidate sound source data with the signal value change rate less than or equal to 0 is selected from the signal value change rate in the narrowband signal of the target frequency channel, the value cumulative sum corresponding to each time point in the candidate sound source data is obtained according to the signal value corresponding to each time point in the candidate sound source data, the target time point with the maximum value cumulative sum is selected from the value cumulative sum corresponding to each time point in the candidate sound source data according to the value cumulative sum corresponding to each time point in the candidate sound source data, and the time information corresponding to the zero crossing point and the zero crossing point is determined according to the target time point with the maximum value cumulative sum. Wherein, the value accumulated sum refers to the accumulated sum of the values of each time point and all time points before the time point in the candidate sound source data, and it can be understood that, for the first time point in the candidate sound source data, the value accumulated sum of the first time point is the value of the first time point.

Alternatively, there may be at least one candidate sound source data, and each candidate sound source data includes a monotonically decreasing set of time points and a signal value corresponding to each time point.

Optionally, the signal point having the maximum value cumulative sum in the candidate sound source data may be determined as a zero crossing point in the candidate sound source data, and the target time point having the maximum value cumulative sum in the candidate sound source data is time information corresponding to the zero crossing point.

Alternatively, the zero-crossing point and the time information corresponding to the zero-crossing point may be determined from a time period formed by the target time point having the maximum cumulative sum, the time point before the target time point, and the time point after the target time point in the candidate sound source data.

Specifically, for each candidate sound source data, a candidate time period in which a zero cross point in the candidate sound source data is located may be determined according to a target time point having a maximum cumulative sum, a time point before the target time point, and a time point after the target time point in the candidate sound source data, the candidate time period may be divided by a preset time window to obtain a plurality of candidate times, and a zero cross point in the candidate time period in which the zero cross point in the candidate sound source data is located and time information corresponding to the zero cross point may be determined according to a value corresponding to each candidate time. It will be appreciated that the time length of the preset time window is less than the time length between the target time point and the time point immediately preceding the target time point.

For example, a sum of accumulated values corresponding to each candidate time may be obtained according to a value corresponding to each candidate time in a candidate time period in which a zero crossing point in the candidate sound source data is located, a signal point corresponding to the maximum value sum corresponding to the candidate time is determined as the zero crossing point in the candidate sound source data, and the candidate time corresponding to the maximum value sum is time information corresponding to the zero crossing point.

Alternatively, when detecting an upward zero crossing point using a zero-crossing detection method, a plurality of monotonically increasing time point sets are selected from the preprocessed sound source data, for each monotonically increasing time point set, a value cumulative sum of each time point in the monotonically increasing time point set is obtained according to a value of each time point in the monotonically increasing time point set, a target time point having a minimum cumulative sum is selected from the value cumulative sums of each time point in the monotonically increasing time point set according to the value cumulative sum of each time point in the monotonically increasing time point set, and time information corresponding to the zero crossing point and the zero crossing point is determined according to the target time point of the minimum cumulative sum in the monotonically increasing time point set.

Step SA113, performing pulse coding based on zero crossing points in each to-be-processed sound source data and time information corresponding to each zero crossing point, to obtain pulse signals of the to-be-processed sound source data of each microphone.

In some embodiments, after determining and obtaining the zero crossing point of the sound source data to be processed of each microphone in the microphone sequence and the time information corresponding to each zero crossing point, a pulse may be generated according to the time information corresponding to each zero crossing point, for example, a pulse is generated at the zero crossing point, so as to obtain a pulse signal of the sound source data to be processed of the microphone.

Optionally, after determining the zero-crossing point of the preprocessed sound source data of each microphone and the time information corresponding to each zero-crossing point, setting the pulse pattern of the zero-crossing point in the preprocessed sound source data to 1, setting the pulse pattern of the non-zero-crossing point in the preprocessed sound source data to 0, thereby generating a pulse, and using the time information corresponding to each zero-crossing point as a pulse time, and arranging the zero-crossing points in the preprocessed sound source data on the same time axis based on the time information of the zero-crossing point in the preprocessed sound source data, so as to obtain a pulse signal of the preprocessed sound source data of each microphone.

Optionally, after determining a zero crossing point in the preprocessed sound source data of each microphone and time information corresponding to each zero crossing point, determining a pulse trigger time point according to the time information corresponding to each zero crossing point, and generating a pulse according to the pulse trigger time point to obtain a pulse signal of the preprocessed sound source data of each microphone.

Optionally, after determining a zero crossing point in the preprocessed sound source data of each microphone and time information corresponding to each zero crossing point, comparing a frequency value corresponding to the zero crossing point in the preprocessed sound source data with a preset threshold value, and if the frequency value corresponding to the zero crossing point in the preprocessed sound source data is greater than the preset threshold value, setting a pulse mode of the zero crossing point to 1; and if the frequency value corresponding to the zero crossing point in the preprocessed sound source data is smaller than or equal to a preset threshold value, setting the pulse mode of the zero crossing point to be 0, and arranging the zero crossing points in the preprocessed sound source data on the same time axis according to the time sequence in such a way to obtain the pulse signal of the preprocessed sound source data of each microphone.

In the multi-microphone array, a zero-crossing point of the preprocessed sound source data of each microphone is detected, and zero-crossing pulse encoding is performed to obtain a pulse sequence corresponding to the microphone. Because the time for the sound signal corresponding to the sound source to reach different microphones in the microphone array is different, the relative delay time between different microphones in the microphone array is different, and the relative delay time between different microphones in the microphone array is different, so that the time information of the zero cross point of the pre-processed sound source data from different microphones is different, and further, the pulse signals from different microphones have different delay times, that is, the delay time between different microphones can be captured by zero-crossing pulse coding.

Illustratively, a circular array including 8 microphones is taken as an example, as shown in fig. 3 and fig. 4, fig. 3 is a schematic diagram of a pulse sequence of a low-frequency channel encoded by a zero-crossing pulse according to an embodiment of the present invention, and fig. 4 is a schematic diagram of a pulse sequence of a high-frequency channel encoded by a zero-crossing pulse according to an embodiment of the present invention. As can be seen from fig. 3 and 4, the zero-crossing pulse from Mic 1 is generated at time 12 and the zero-crossing pulse from Mic 2 is generated at time 20, the zero-crossing pulse from Mic 2 being delayed by 8 unit time periods compared to Mic 1.

And S200, based on the pulse neural network, performing direction estimation on the pulse signals obtained by the zero-crossing pulse coding to obtain the target sound source direction of the sound source data to be processed.

FIG. 5 is a schematic diagram of a spiking neural network, which includes an input layer, an intermediate layer, and an output layer. The neuron-based touch screen comprises an input layer, an intermediate layer and an output layer, wherein neurons are arranged in the input layer, the intermediate layer and the output layer, the intermediate layer comprises at least one hidden layer, and a plurality of neurons are arranged in each hidden layer.

The input layer is used for carrying out pulse excitation according to the pulse number of an input pulse signal in a preset clock period to obtain a pulse sequence and transmitting the pulse sequence to the middle layer; the middle layer is used for carrying out pulse excitation according to the number of pulses in the received pulse sequence within a preset time period to obtain a new pulse sequence and transmitting the new pulse sequence to the output layer; the output layer is used for generating a target pulse signal based on the pulse sequence transmitted by the middle layer, making direction decision based on the target pulse signal and determining the sound source direction corresponding to the target pulse signal.

The impulse neural network on the chip or hardware cannot work directly (complete accurate reasoning work on the input environment signal), and the neuron and the synapse module/unit are only the implementation of the circuit hardware, and the neuron module/unit and the synapse module/unit included therein need to be subjected to set division, connection relationship definition, and the like, and weight values stored in the synapse circuit, corresponding time constants, and the like are defined. Therefore, it is necessary to train in advance to obtain corresponding parameters, such as supervised training or unsupervised training, and on-chip training or off-chip training. The present invention takes supervised training as an example, but not limited thereto, and the trained network configuration parameters are mapped to hardware, such as a chip, and after the chip receives signals collected from the environment, the impulse neural network in the chip is operated to automatically complete the inference process based on the received signals.

In a preferred embodiment of the present invention, the impulse neural network is trained based on the sample sound source data, for example, the SNN is trained with the impulse code sequence of a specific frequency as the input of the impulse neural network and the DoA direction as the target, so as to obtain the sound source localization model, wherein the specific frequency is at least one frequency value, for example, the specific frequency is at least one frequency valuef ₁ Or/andf ₂ or/andf ₃ and the like,f ₁ andf ₂ andf ₃ not equal. Wherein the sample sound source data comprises audio signals for each microphone of the microphone array in a different DoA direction.

For each sample sound source data, the audio signals at different microphones in the sample sound source data are subjected to channel decomposition, zero-crossing pulse coding is performed on the audio component (sound source component) on each frequency channel or target frequency channel (for example, obtained by the above-mentioned activity detection), so as to obtain a sample pulse signal or pulse sequence, that is, the sample pulse signal is input to the input layer of the impulse neural network model, and after being processed by the impulse neural network, the output layer of the impulse neural network model is subjected to sound source direction estimation, so as to output a prediction sector corresponding to the sample sound source data.

And then determining the positioning loss of the impulse neural network according to the sector label and the prediction sector corresponding to the sample sound source data, and adjusting the configuration parameters of the impulse neural network according to the positioning loss so as to iterate and obtain a sound source positioning model when the impulse neural network meets the preset convergence condition.

Wherein the configuration parameters of the spiking neural network comprise one or more of the following parameters: synapse weights, excitation times, thresholds, decay time constants, etc. of neurons corresponding to the input layer, the intermediate layer, and the output layer of the spiking neural network, which is not limited in the present invention. The preset convergence condition may be that the positioning loss is less than or equal to a preset positioning loss threshold, or that the number of iterations is less than or equal to a preset number threshold.

In a preferred embodiment, a microphone array is described as an example of a circular microphone array, where the microphone array includes M microphones, all ranges of the direction of arrival doas are quantized to η, and these quantized doas are used as classification candidates, where η is an angular oversampling parameter (η =1,2,3 … …). η M labels correspond to η M angular space of DoA quantization, from which it follows that the DoA resolution approximates

。

In a preferred embodiment, taking a circular array as an example, a spatial coordinate system may be established with the geometric center of the microphone array as an origin and the origin, and the origin is used as a circle center and a preset distance is used as a radius, and the microphone array is rotated clockwise or counterclockwise, and a position is selected for each rotation angle resolution in the area, a plurality of positions are selected, and an azimuth angle of each selected position is set as a sample sound source direction. The angular resolution may be determined according to the type of microphone array and the number of microphones in the microphone array. For example, for a circular microphone array with 8 microphones, η =1, the DoA may be quantized to 8 sectors, each sector having an angular resolution of 45 degrees. The microphone array may have other shapes, and the invention is not limited in this respect.

Optionally, in order to improve the accuracy of the target sound source direction and reduce the grating lobe influence, when determining the angular resolution, the angular resolution may be determined according to the type of the microphone array, the number of microphones in the microphone array, and a preset oversampling parameter. Preferably, for a circular microphone array with a number of microphones of 8, when the oversampling parameter is η =2, the DoA may be quantized to 2 x 8 =16 sectors, each sector having an angular resolution of 22.5 degrees.

And based on a sound source positioning model obtained through the training of the impulse neural network or impulse neural network hardware configured with configuration parameters the same as or similar to those of the trained network, carrying out direction estimation on the sound source data to be processed to obtain the target sound source direction of the sound source data to be processed. The trained network configuration parameters can be directly deployed to the impulse neural network hardware or deployed after processing, such as quantization processing. Wherein for each microphone array.

For example, a microphone array is taken as a circular array for explanation, the direction of arrival may be divided into sectors according to a preset angular resolution and a preset oversampling parameter, the divided sectors are taken as labels, and sample sound source data of each sector is collected, so that each sample sound source data corresponds to one sector label. For example, for a circular microphone array with a number of microphones of 8, when the oversampling parameter is 2, the DoAs may be quantized to 2 x 8 =16 sectors, each sector having an angular resolution of 22.5 degrees.

In an optional embodiment, the positioning loss of the impulse neural network is determined according to the sector label and the prediction sector corresponding to the sample sound source data, and a mean square error function may be adopted, or MSE surround loss may be adopted. For the mean square error function, the sector labels neighbor sectors, i.e., the distance between class-0 and class-1 is consistent with the distance between class-0 and class-2. For MSE surround loss, geometrically adjacent sectors can be distinguished, e.g., for a circular microphone array, class-0 and class-1 are the same distance as class-0 and class-15 (taking into account the circular shape of the array), while the distance between class-0 and class-2 is larger.

In certain embodiments of the invention, considering that training of the SNN is difficult directly, the ANN can be established based on parameters of the impulse neural network, target parameters of the impulse neural network are obtained by training the ANN, and parameters of the SNN in the electronic device are adjusted based on the target parameters of the impulse neural network to obtain the sound source positioning model.

Optionally, in the course of training the ANN, in order to ensure consistency between parameters of the SNN and the ANN in the electronic device, after each iterative training is completed, the parameters of the ANN may be synchronized into the SNN in the electronic device, and the sample pulse signal may be input into the impulse neural network of the electronic device, and after being processed by the impulse neural network, the prediction sector corresponding to the sample sound source data may be output, and according to the prediction sector corresponding to the sample sound source data and the sector label corresponding to the sample sound source data, the positioning loss of the impulse neural network may be determined, the parameters of the ANN may be adjusted according to the positioning loss, and the parameters adjusted by the ANN may be synchronized into the SNN in the electronic device.

Alternatively, the training method described in the applicant's prior application (chinese patent application publication No. CN114861892 a) may be used directly for training. The present invention is incorporated by reference in its entirety into this prior application.

The sound source orientation method provided by the embodiment of the invention carries out direction estimation on the pulse signal of the sound source data to be processed based on the pulse neural network to obtain the target direction of the sound source data to be processed, can reduce power consumption while ensuring the accuracy of sound source direction estimation, has better robustness and higher processing speed, and is easy to implement in low-power-consumption hardware.

It is considered that the geometry of the microphone array affects the array resolution of the microphone array, which is related to the estimation accuracy of the direction of arrival. All the microphones in the linear microphone array are in a line, and although better array resolution can be obtained, all the microphones are in a line, so that the resolution of the linear microphone array is asymmetric, for example, as shown in fig. 6, fig. 6 is a schematic diagram of the array resolution of the linear microphone array provided by the embodiment of the present invention, when a sound source is in front of the linear microphone array, the linear microphone array can obtain higher array resolution, and when the sound source is at the side of the linear microphone array, the array resolution of the linear microphone array is lower, that is, the position relationship between the sound source and the linear microphone array affects the array resolution of the linear microphone. Similar distributed microphone arrays and cross-shaped microphone arrays can obtain better array resolution at some angles, but angles with low array resolution exist, so that the applicability of linear microphones, distributed microphone arrays and cross-shaped microphone arrays is poor. As described above, for the circular microphone array, the array resolution thereof is related to the number of microphones and the oversampling parameter in the circular microphone array, without being affected by the positional relationship between the sound source and the circular microphone array. Therefore, to ensure the accuracy of sound source orientation, in some embodiments, the speech signals in the environment are collected by a circular microphone array as shown in fig. 1.

Further, since the microphone array cannot be localized using a previously designed waveform in a practical application scene such as a voice conference in a closed room, it is necessary to find its position or the arrival direction of a sound source using a speaker's voice signal. However, the speech signal is very unstable and has various sudden jumps in the time-frequency domain, which causes the speech signal of the speaker not to be a harmonic signal of a fixed frequency, and the main frequency of the sound source changes with time, as shown in the time-frequency diagram of the speech signal in fig. 7.

In a preferred embodiment of the invention, adaptive wide-band localization/orientation techniques are applied to sound source orientation, i.e. active frequency components in the signal are identified in each time interval and used for localization.

Specifically, as shown in fig. 8, fig. 8 is a schematic diagram of sound source orientation in an embodiment of the present invention. The pulse code generator comprises a preprocessor module, a pulse code module and a pulse neural network processor which are sequentially coupled. Wherein the preprocessor module preprocesses the signals received by the plurality of microphones, such as time domain or/and frequency domain analysis, and further wherein the preprocessor module identifies active frequency components, also referred to as target frequency components (also referred to as target frequency channels elsewhere herein). And the pulse coding module is coupled with the preprocessing module and used for carrying out zero-crossing pulse coding to obtain a pulse signal. The impulse neural network processor is provided with an impulse neural network, and the impulse neural network processor conducts sound source orientation based on a plurality of impulse signals.

In an embodiment, the pre-processing module comprises a channel decomposition module, and the channel decomposition module comprises a plurality of (two or more) filter banks, each filter bank being coupled to a different microphone for performing time-frequency (time-frequency) analysis on the audio signals received by each microphone.

Optionally, the number of filter banks is less than or equal to the number of microphones. Optionally, the number of pulse signals is less than or equal to the number of filter banks. In the embodiments of the present disclosure, the number of filter banks is equal to the number of microphones for example, but the present invention is not limited thereto.

Fig. 9 is a schematic diagram of sound source orientation after pre-processing of signals received by microphones in an embodiment of the present invention. As shown in fig. 9, the preprocessing module includes a plurality of (two or more) filter banks and an activity detector (activity detector). The activity detection module is coupled to each filter bank, performs activity detection among all microphones, and detects activity of signals received by all microphones at different frequencies to identify a target frequency component (also referred to as a target frequency or an active frequency component), i.e., to identify one or more frequency channels in the signals that are more active.

In this embodiment, each filter bank corresponds to one microphone, and the filter bank performs channel decomposition on a signal received by the corresponding microphone, and decomposes sound source data to be processed into a plurality of frequency channels. The activity detection module performs activity detection based on the energy of the narrowband signal after channel decomposition to determine a target frequency channel.

The pulse coding module is coupled to the activity detection module, and performs zero-crossing pulse coding on the signal filtered by each filter bank based on a target frequency to obtain a pulse signal or a pulse sequence corresponding to the signal received by the microphone corresponding to each filter bank, in other words, the pulse coding module performs zero-crossing pulse coding on an audio component of the signal received by each microphone at the target frequency. Specifically, the pulse coding module comprises at least one zero-crossing pulse coding unit, wherein an input end of each zero-crossing pulse coding unit is coupled with an output end of a corresponding filter bank through a corresponding activity detection unit, and each zero-crossing pulse coding unit carries out zero-crossing pulse coding on a component output by each filter bank on a target frequency channel so as to generate a corresponding pulse signal or pulse sequence.

In some embodiments of the present invention, other circuits may be coupled between the filter bank and the microphone, such as a low noise amplifier LNA, which is used for low noise amplification of the input audio. In addition, each parallel channel may further include a rectifier coupled after the filter of the channel for rectifying the output of the channel filter.

Further, the filter may be a band pass filter, a band stop filter, a narrow band filter, or the like. Optionally, the channel decomposition of the sound source data to be processed may also be performed by dividing the sound source data to be processed into different frequency intervals through a band-pass filter bank, where each frequency interval corresponds to one frequency channel; optionally, the channel decomposition may be performed on the sound source data to be processed, and the sound source data to be processed may also be divided into a plurality of narrow bands by a narrow band filter bank according to the bandwidth of the sound source data to be processed, where each narrow band corresponds to one frequency channel.

Illustratively, the number of filter banks and microphones in the microphone arrayThe same number of wind is taken as an example for explanation, when the filter bank includes 16 parallel band-pass filters BPFs, each BPF corresponds to one channel, and each filter bank performs channel decomposition on the audio signal of the sound source data to be processed received by the microphone corresponding to the filter bank to obtain a plurality of frequency channels. The multiple parallel channels are filtered according to frequency bands and detect signal activities changing along with time under different frequency bands, the BPF of each channel only retains signals matched with the center frequency of the BPF of the channel, time-frequency (time-frequency) analysis is carried out on audio signals received by the microphones corresponding to the filter bank through the filter bank to obtain audio components on 16 channels, for example, the audio signals received by the microphone Mic 1 pass through the BPF0 (frequency)

) BPF1 (frequency)

) … … yields N1_0, N1_1 … ….

Optionally, based on the signals on the frequency channels in the different frequency ranges output by each filter bank, the activity detection module selects a target frequency channel that meets a preset energy threshold from the signals on the frequency components in the different frequency ranges output by each filter bank. Specifically, a filter bank in the preprocessing module performs channel decomposition on to-be-processed sound source data received by a microphone corresponding to the filter bank to obtain multiple frequency channels of each to-be-processed sound source data, performs activity detection on the energy of narrow-band signals of the multiple frequency channels obtained after the channel decomposition, determines a target frequency channel, and performs zero-crossing encoding on the narrow-band signal of the target frequency channel. And determining the audio component in the target frequency channel of each piece of sound source data to be processed as the sound source data of each microphone after preprocessing.

In certain types of embodiments, the activity detection module includes a plurality of (more than one) activity detection units, one activity detection unit for each filter bank, each activity detection unit coupled with a corresponding filter bank. And each activity detection unit carries out independent activity detection on the signals decomposed by the filter bank channels and determines a target frequency channel of the sound source data to be processed received by each microphone.

FIG. 10 is a schematic diagram of independent activity detection provided by an embodiment of the present invention. The method comprises the steps of performing activity detection on audio components of sound source data to be processed of each microphone on a plurality of frequency channels through an activity detection unit, and selecting one or more (two or more) channels with highest/higher energy on the plurality of frequency channels of the sound source data to be processed received by each microphone as initial target frequency channels of the microphone.

In some embodiments, after an initial target frequency channel of each microphone is obtained, an audio component of to-be-processed sound source data received by each microphone after channel decomposition on the target frequency channel is pulse-coded, so that the data sparsity and the real-time performance are improved while the accuracy of sound source orientation is not reduced.

In some embodiments, the initial target frequency channel of each microphone may be obtained according to the activity detection unit, the frequency range of the initial target frequency channel of each microphone is screened, the activity frequency (i.e., the target frequency) is determined, and the target frequency channel of each microphone is determined according to the activity frequency, so as to ensure that the target frequency channel of each microphone has the same center frequency. Wherein the active frequency may be a center frequency of the signal in the frequency channel.

Alternatively, the frequency of occurrence of each center frequency may be counted according to the center frequency of the initial target frequency channel of each microphone, and one or more center frequencies with the highest/higher frequency may be determined as the active frequency. Alternatively, one or several center frequencies having a frequency value maximum are determined as the active frequencies.

Optionally, one or more center frequencies with the highest/higher frequency value of the center frequencies are selected as the active frequencies according to the center frequency of the initial target frequency channel of each microphone.

Optionally, the center frequency of the initial target frequency channel of each microphone is clustered, the initial target frequency channel is divided into at least one cluster, a target cluster with the largest number of elements in the cluster is selected, and the center frequency corresponding to the target cluster is determined as the activity frequency.

In some embodiments, after determining the active frequency, the center frequency of the initial target frequency channel of each microphone may be compared with the active frequency, if the center frequency of the initial target frequency channel of the microphone matches the active frequency, the initial target frequency channel of the microphone is determined as the target frequency channel of the microphone, and if the center frequency of the initial target frequency channel of the microphone does not match the active frequency, the target frequency channel whose center frequency matches the active frequency is selected from the multiple frequency channels of the microphone. The matching may be that the center frequency of the initial target frequency channel of the microphone is the same as the active frequency, or that the difference between the center frequency of the initial target frequency channel of the microphone and the active frequency is less than or equal to a preset frequency difference.

Considering that the multiple frequency channels of each microphone are individually activity detected, it is difficult to ensure that the initial target frequency channel of each microphone has the same center frequency, and a subsequent processing step is required to ensure that the initial target frequency channel of each microphone has the same center frequency, which adds to the processing step of the pre-processing. Based on this, to ensure the accuracy of the pulse coding and further simplify the operation, in a preferred embodiment of the present invention, the activity detection module performs the joint activity detection after multiple filter banks.

FIG. 11 is a schematic diagram of joint activity detection provided by an embodiment of the present invention. The filter group carries out channel decomposition on the sound source data to be processed received by the corresponding microphone, and the activity detection module carries out joint activity detection on a plurality of frequency channels after the channel decomposition is carried out on all microphone signals so as to determine the target frequency. Namely, the activity detection module performs joint activity detection among all the microphones corresponding to the filter bank, and identifies the target frequency based on the activity conditions of the audio signals received by all the microphones corresponding to the filter bank under different frequencies. Further, the target frequency is one or several frequencies obtained by the joint activity detection and meeting a preset condition. And determining the target frequency channel of each filter bank or the microphone corresponding to the filter bank according to the active frequency.

Alternatively, the joint activity detection may be to count the energy sum of all microphones in different frequency channels, in other words, calculate the energy sum of audio signals received by all microphones by frequency; the preset condition may be that the energy accumulation sum is greater than or equal to a preset energy accumulation sum threshold, and the preset condition may also be a maximum energy accumulation sum. The preset threshold may be any one of an average value, a median, a mode, a second highest value, a third highest value, and a fourth highest value of the energy sum of all the microphones in different frequency channels.

Alternatively, the joint activity detection may be to count the average of signal energy of all microphones under different frequency channels; the preset condition may be that the average value of the signal energy is greater than or equal to a preset average threshold, and the preset condition may also be a maximum average value of the signal energy.

Optionally, taking the joint activity detection as an example for counting the energy accumulation sums of all microphones in different frequency channels, the activity detection module accumulates the energies of all microphones in different frequencies according to the frequency corresponding to each frequency channel in the multiple frequency channels of each to-be-processed sound source data to obtain an energy accumulation sum corresponding to each frequency, determines an activity frequency at which the energy accumulation sum meets a preset condition according to the energy accumulation sum corresponding to each frequency, and determines a target frequency channel of each to-be-processed sound source data according to the activity frequency.

The frequency corresponding to each frequency channel in the multiple frequency channels of the sound source data to be processed received by each microphone may be compared with the active frequency, and the frequency channel, in the sound source data to be processed received by each microphone, whose frequency matches the active frequency is determined as the target frequency channel of the sound source data to be processed.

For example, a circular microphone array with 8 microphones is taken as an example to describe, and the to-be-processed sound source data received by each microphone in the microphone array is subjected to channel decomposition according to the filter bank corresponding to the microphone, so as to obtain audio components of the to-be-processed sound source data of the microphone on multiple frequency channels. The activity detection module performs joint activity detection, selects activity frequencies meeting preset conditions according to the energy sum of audio frequency components of signals received by all the microphones on each frequency channel, and obtains a target frequency channel of each microphone according to the activity frequencies.

For example, for each frequency channel f, by

Detecting the sum of the energy accumulations of all the microphones on a window with the size of w under the frequency channel f, wherein M is the number of the microphones in the microphone array, t is discrete time,

representing the signal component of center frequency f received by the mth microphone at time t. After the energy accumulation sum of all the microphones under different frequency channels is obtained through calculation, the total energy sum is obtained through calculation

Wherein k is a positive integer greater than or equal to 1 and less than or equal to M.

Optionally, when k =1, that is, selecting an active frequency with the largest energy accumulation sum from the energy accumulation sums of all microphones in each frequency channel, determining a frequency channel with a center frequency matching the active frequency in the multiple frequency channels of each microphone as a target frequency channel of each microphone, and performing zero-crossing pulse encoding on a signal in the target frequency channel of each microphone to obtain a pulse sequence. Wherein the number of pulse signals in the pulse sequence is the same as the number of microphones in the microphone array. For example, for a circular array comprising 8 microphones, if k =1, 8 pulse signals with different delays are obtained.

Optionally, when k is greater than or equal to 2, that is, at least two active frequencies with a larger energy accumulation sum are selected from the energy accumulation sums of all the microphones in each frequency channel, a frequency channel with a center frequency matching the active frequency in the multiple frequency channels of each microphone is determined as a target frequency channel of each microphone, and zero-crossing pulse coding is performed on a signal in the target frequency channel of each microphone to obtain a pulse sequence. Wherein the number of pulse signals in the pulse sequence is equal to the number of microphones k in the microphone array. For example, when k =2, 8 × 2=16 pulse signals are obtained, of which 8 are pulse signals with different delays associated with the first active frequency component and the other 8 are pulse signals with different delays associated with the second active frequency component.

Considering that when the number of filter banks is large or the number of filters included in a filter bank is large, if joint activity detection is performed on a plurality of frequency channels of all microphones, the data processing amount of an activity detection module is increased, the preprocessing time length is increased, and the real-time performance of the sound source orientation method is further influenced.

In order to ensure real-time performance of the sound source orientation method, in a certain type of embodiment, the activity detection module includes at least two activity detection units, each activity detection unit is coupled to at least one filter bank, each activity detection unit performs local joint activity detection on a plurality of frequency channels output by the at least one filter bank corresponding to the activity detection unit, obtains an activity frequency in the plurality of frequency channels output by the at least one filter bank corresponding to the activity detection unit, determines a target frequency according to the activity frequency in the plurality of frequency channels output by each filter bank, and obtains a target frequency channel of each microphone according to the target frequency.

Optionally, the microphones may be grouped according to the number of activity detection units in the activity detection module to obtain at least two microphone combinations, each microphone combination corresponds to one activity detection unit, and for an activity detection unit, the activity detection unit determines the target frequency of each microphone in the microphone combination corresponding to the activity detection unit according to the center frequency of multiple frequency channels obtained by channel decomposition of each microphone in the microphone combination corresponding to the activity detection unit, according to the joint activity detection method. Each microphone combination comprises at least one microphone, the number of the microphones in each microphone combination can be the same, and the number of the microphones in each microphone combination can be different.

Optionally, after the target frequency of each microphone is obtained, clustering may be performed according to the target frequency of each microphone, dividing the target frequencies of the microphones into a cluster, selecting a target cluster with the largest number of elements in the cluster, and determining the target frequency corresponding to the target cluster as the active frequency.

Optionally, the frequency of occurrence of each target frequency may be counted according to the target frequency of each microphone, and the target frequency with the highest frequency may be determined as the active frequency.

Alternatively, the maximum target frequency may be determined as the active frequency based on the target frequency of each microphone.

In some embodiments of the present invention, a filter bank is adopted to perform channel decomposition on sound source data to be processed received by each microphone to obtain audio components on a plurality of frequency channels, active frequency components are identified through joint activity detection, and zero-crossing pulse coding is performed on the audio components of each microphone on the active frequency component channels, so that the speed and accuracy of sound source orientation are improved, and power consumption and cost are reduced.

Exemplarily, taking one microphone in the microphone array as an example for explanation, the method at least includes the following steps: the sound source data received by the microphone is subjected to channel decomposition to obtain audio components (or sound source components) on a plurality of channels. According to the target frequency (frequency satisfying the preset condition) identified by the activity detection, a channel corresponding to the activity frequency is selected from a plurality of channels, and the channel can be called as a target frequency channel. And carrying out zero-crossing pulse coding on the audio frequency component on the target frequency channel corresponding to each microphone to obtain a pulse signal or a pulse sequence corresponding to the microphone received signal. Optionally, the activity frequency is identified based on independent activity detection or joint activity detection or local activity detection.

If, for example, the activity detection chooses an activity frequency, at the start t1,

is identified as the active frequency, and is therefore based on the BPF4 (center frequency) of each microphone among all the microphones

) The audio component on the channel produces a pulse signal. At the time of the later time t2,

identified as active frequencies, based on the BPF5 (center frequency) of each microphone

) The audio component on the channel produces a pulse signal. These pulses are connected together over time to form a pulse train.

Alternatively, the preset condition may be the energy sum of one or more (two or more) frequency channels at the maximum, or the energy sum of one or more frequency channels greater than a preset energy threshold. Understandably, the frequency channel satisfying the preset condition is the target frequency channel. As described above, whether the preset condition is satisfied may be determined based on the first threshold, and the second threshold. Based on the target frequency channel, the audio component of the signal received by each microphone on the target frequency channel is an effective frequency component.

When estimating the sound source direction based on the impulse neural network, the sound source data to be processed needs to be converted into an impulse feature set, that is, the sound source data to be processed needs to be converted into an impulse signal, and then the direction of the impulse signal is estimated based on the impulse neural network, so as to obtain the target sound source direction of the sound source data to be processed. In order to improve the accuracy of the sound source direction, in some embodiments, the preprocessed sound source data may also be zero-cross coded in a local maximum manner. Specifically, the zero-crossing coding method based on the local maximum value at least comprises the following steps of SB 111-SB 115:

step SB111 determines a plurality of target signal point sets of the sound source data preprocessed by the microphones based on the signal values of the signal points in the sound source data preprocessed by the microphones. Optionally, the target signal point set includes a combination of signal points whose signal values continuously decrease for a preset time period.

In some embodiments of the present invention, in order to reduce data throughput, a downward zero crossing point may be extracted from the preprocessed sound source data, and for example, a signal point with a continuously decreasing signal value may be selected from the preprocessed sound source data to obtain a target signal point set. Specifically, the method for determining the target signal point set based on the downward zero crossing point comprises the following steps:

and comparing the signal values of the signal points in the sound source data preprocessed by the microphone, and determining the signal points of which the signal values in the sound source data preprocessed by the microphone continuously decrease.

In some kind of embodiments, the to-be-processed sound source data received by each microphone may be subjected to channel decomposition and activity detection according to the above preprocessing method, and a plurality of target signal point sets of the preprocessed sound source data are determined for the signal values of signal points in the audio component of the target frequency channel of each microphone.

Optionally, the target signal point set includes a combination of signal points whose signal values continuously increase within a preset time period.

In some embodiments of the present invention, in order to reduce data throughput, upward zero crossing points may be extracted from the preprocessed sound source data, and for example, signal points with successively increasing signal values may be selected from the preprocessed sound source data to obtain a target signal point set. Specifically, the method for determining the target signal point set based on the upward zero crossing point comprises the following steps:

and comparing the signal values of the signal points in the sound source data preprocessed by the microphone, and determining the signal points of which the signal values are continuously increased progressively in the sound source data preprocessed by the microphone.

And according to the time information corresponding to the signal points of which the signal values continuously increase progressively in the sound source data after the microphone is preprocessed, grouping the signal points of which the signal values continuously increase progressively in the sound source data after the microphone is preprocessed to obtain a plurality of target signal point sets.

In step SB112, a sum of corresponding signal values of each signal point in each target signal point set is obtained according to a signal value (a non-absolute value, i.e., a positive or negative number or zero may be possible) of each signal point in each target signal point set.

Step SB113 compares the sum of the signal values corresponding to each signal point in each target signal point set, and determines a target signal point having a local maximum value in each target signal point set and time information corresponding to the target signal point.

In some embodiment, the absolute value of the sum of the corresponding signal values of each signal point in each target signal point set may be compared, the absolute value of the sum having the largest signal value among the absolute values of the sums of the corresponding signal values of each signal point is determined as the target signal point having the local maximum value, and the time information corresponding to the target signal point is obtained according to the time point corresponding to the target signal point.

In some embodiments, the sum of the signal values corresponding to each signal point in each target signal point set may be compared, the signal point with the largest sum of the signal values is determined as the target signal point with the local maximum value, and the time information corresponding to the target signal point is obtained according to the time point corresponding to the target signal point.

In some embodiments, the sum of the signal values corresponding to each signal point in the target signal point set may be compared, the signal point with the smallest sum of the signal values is determined as the target signal point with the local maximum value, and the time information corresponding to the target signal point is obtained according to the time point corresponding to the target signal point.

For example, taking the case where the signal point having the sum of the maximum signal values is determined as the target signal point having the local maximum value as an example, step SB113 includes:

and aiming at each target signal point set, comparing the sum of corresponding signal values of each signal point in the target signal point set, and determining a candidate target signal point with the sum of initial maximum signal values in the target signal point set.

And determining a candidate time period with a local maximum value in the target signal point set according to the time information corresponding to the candidate target signal point.

Step SB114, comparing the sums of the corresponding signal values of each signal point in each target signal point set, and determining a target signal point having a local maximum value in each target signal point set and time information corresponding to the target signal point.

Step SB115 determines zero crossing points in the sound source data to be processed received by the microphone and time information corresponding to the zero crossing points, based on the target signal points having local maximum values in each of the target signal point sets and the time information corresponding to each of the target signal points.

In some embodiments of the present invention, pulse coding may be performed according to the zero-crossing points in the preprocessed sound source data and the time information corresponding to each zero-crossing point, so as to obtain the pulse signal of the sound source data to be processed according to the step SA 113.

The embodiment of the invention carries out zero crossing pulse coding based on the local maximum value, can eliminate zero crossing electron collapse caused by noise, improves the quality of pulse signals and further ensures the accuracy of a target sound source method.

In order to improve the performance of the impulse neural network and ensure the accuracy of sound source direction estimation, in some embodiments, the long-term and short-term memory network may be further used to correct the impulse signal after zero-cross impulse coding, and the corrected impulse sequence is input to the impulse neural network for direction estimation, so as to obtain the target sound source direction of the sound source data to be processed, thereby ensuring the accuracy of sound source direction estimation.

FIG. 12 is a schematic structural diagram of sound source localization based on a long-short term memory network and an impulse neural network according to an embodiment of the present invention. The step of obtaining the target sound source direction of the sound source data to be processed based on the embodiment may further include:

inputting the pulse signal of the sound source data to be processed to a preset feature extraction module, and performing feature extraction on the pulse signal of the sound source data to be processed through the preset feature extraction module to obtain a pulse feature sequence.

And inputting the pulse characteristic sequence into a pulse neural network for direction estimation to obtain the target direction of the sound source data to be processed.

The preset feature extraction module is constructed on the basis of a long-term and short-term memory network, a pulse signal of sound source data to be processed is input into the feature extraction module, a hidden state is extracted from the pulse signal of the sound source data to be processed according to an input gate, an output gate and a forgetting gate in the feature extraction module, the pulse signal of the sound source data to be processed is corrected on the basis of the hidden state, and a pulse feature sequence is generated. In certain embodiments of the invention, the feature extraction module may be trained by a supervised implementation.

After the pulse characteristic sequence is obtained, the pulse characteristic sequence can be input into a pulse neural network, and the direction of the pulse characteristic sequence is estimated through an input layer, an intermediate layer and an output layer in the pulse neural network, so that the target direction of the sound source data to be processed is obtained.

To better explain the sound source orientation technology provided by the embodiment of the present invention, a preferred embodiment is applied to an application scenario based on an audio conference in a closed room, and a target direction of sound source data to be processed in the application scenario is obtained, as shown in fig. 13 to 16.

Fig. 13 is a test result of sound source orientation simulation in a low-frequency channel, which shows the effect of the sound source orientation method provided by the embodiment of the present invention in the low-frequency channel when the DoA of the sound source signal changes suddenly from 90 degrees to-90 degrees, where 1 radian is equal to 60 degrees. Fig. 13 is a schematic diagram of signals of corresponding sound source data received by each microphone, a schematic diagram of a target sound source direction detected by an algorithm, and a schematic diagram of a pulse signal after zero-crossing pulse coding corresponding to each microphone, from bottom to top.

Fig. 14 is a comparison result of a test of sound source orientation using a low frequency channel between a brain-like chip implemented by low power consumption hardware and a simulation model. Fig. 14 is a test result of a sound source orientation model in a computer device using a low-frequency channel to perform sound source orientation and a test result of a chip implemented by low-power-consumption hardware using a low-frequency channel to perform sound source orientation, where a radian is used to represent 1 radian equal to 60 °, and as can be seen from fig. 14, in an actual environment, a better sound source orientation result is obtained in the chip by using the sound source orientation technology provided by the embodiment of the present invention, and the difference is very small and negligible compared with a computer simulation result.

Fig. 15 is a simulation test result of sound source orientation in a high frequency channel, which shows the effect of the sound source orientation method provided by the embodiment of the present invention on the high frequency channel when the DoA of the sound source signal suddenly changes from 90 degrees to-90 degrees, wherein, expressed by using radian, 1 radian is equal to 60 °. Fig. 15 is a schematic diagram of a signal of corresponding sound source data received by each microphone, a schematic diagram of a target sound source direction detected by an algorithm, and a schematic diagram of a pulse signal after zero-crossing pulse coding corresponding to each microphone, from bottom to top.

FIG. 16 is a comparison of the test results of sound source orientation using high frequency channels for the brain-like chip implemented in low power hardware and the simulation model. Fig. 16 is a test result of a sound source orientation model in a computer device using a high frequency channel for sound source orientation and a test result of a chip implemented by low power consumption hardware using a high frequency channel for sound source orientation, respectively, from bottom to top, where a radian is used for representation, and 1 radian is equal to 60 °. As can be seen from fig. 16, in an actual environment, the sound source orientation technology provided by the embodiment of the present invention obtains a better sound source orientation result in a chip, and the difference is very small and negligible compared with the computer simulation result.

As can be seen from fig. 13 to fig. 16, the sound source orientation technique implemented by the low-power hardware of the present invention has a relatively fast response to the abrupt change of DoA in both the high-frequency channel and the low-frequency channel.

According to the sound source orientation method provided by the embodiment of the invention, the direction estimation is carried out on the pulse signal of the sound source data to be processed based on the pulse neural network, so that the target direction of the sound source data to be processed is obtained, the accuracy of the sound source direction estimation can be ensured, the power consumption is reduced, and the sound source orientation method has better robustness and higher processing speed; the relative time delay information in the sound source data to be processed is captured through pulse coding, sound source direction estimation is carried out based on the relative time delay information, the sound source estimation accuracy is improved, phase information required in the sound source direction estimation can be effectively captured through a pulse coding method based on a zero crossing point, and the sound source direction estimation accuracy is further ensured.

In order to better implement the sound source orientation method provided by the embodiment of the invention, on the basis of the sound source orientation method, a sound source orientation device is provided, and the sound source orientation device comprises:

the encoding module is used for carrying out zero crossing encoding on the sound source data to be processed to obtain a pulse signal of the sound source data to be processed;

In certain embodiments of the invention, the encoding module comprises:

a zero crossing point detection unit, configured to perform zero crossing point detection on the sound source data after the pre-processing of each microphone, so as to obtain a zero crossing point in the sound source data to be processed received by each microphone and time information corresponding to each zero crossing point;

and the encoding unit is used for carrying out pulse encoding on the basis of zero crossing points in the sound source data to be processed received by each microphone and time information corresponding to the zero crossing points to obtain pulse signals of the sound source data to be processed received by each microphone.

In a certain class of embodiments of the present invention, the zero crossing point detecting unit is configured to determine, for each piece of sound source data preprocessed by the microphone, a plurality of target signal point sets of the sound source data preprocessed by the microphone according to a signal value of each signal point in the sound source data preprocessed by the microphone; obtaining the sum of corresponding signal values of each signal point in each target signal point set according to the signal value of each signal point in each target signal point set; comparing the sum of corresponding signal values of each signal point in each target signal point set, and determining a target signal point with a local maximum value in each target signal point set and time information corresponding to the target signal point; and determining zero crossing points in the sound source data to be processed received by the microphone and time information corresponding to the zero crossing points according to target signal points with local maximum values in each target signal point set and the time information corresponding to each target signal point.

In a certain embodiment of the present invention, the zero crossing point detecting unit is configured to compare signal values of signal points in the sound source data after the microphone is preprocessed, and determine a signal point where the signal value in the sound source data after the microphone is preprocessed decreases continuously; and according to the time information corresponding to the signal points of which the signal values continuously decrease in the sound source data after the microphone is preprocessed, grouping the signal points of which the signal values continuously decrease in the sound source data after the microphone is preprocessed to obtain a plurality of target signal point sets.

In a certain type of embodiment of the present invention, the zero crossing point detection unit is configured to compare, for each target signal point set, a sum of corresponding signal values of each signal point in the target signal point set, and determine a candidate target signal point having a sum of initial maximum signal values in the target signal point set; determining a candidate time period with a local maximum value in the target signal point set according to the time information corresponding to the candidate target signal point; and comparing the sums of the corresponding signal values of the signal points corresponding to each time information in the candidate time period, determining the signal point with the maximum sum of the signal values, and determining the signal point with the maximum sum of the signal values as the target signal point with the local maximum value in the target signal point set.

In some embodiments of the invention, the sound source direction finding device further comprises:

and the preprocessing module is used for performing channel decomposition on the to-be-processed sound source data received by each microphone to obtain the to-be-processed sound source data received by each microphone, and decomposing the to-be-processed sound source data into a plurality of frequency channels.

In a certain class of embodiments of the present invention, the preprocessing module is configured to perform activity detection on the channel component after channel decomposition based on the sound source data to be processed received by each microphone, so as to obtain a target frequency; the target frequency is one or more than one frequency; and determining the sound source component of the sound source data to be processed received by each microphone in the target frequency channel as the sound source data preprocessed by each microphone.

In a certain embodiment of the present invention, the estimation module is configured to input the pulse signal of the sound source data to be processed to the feature extraction module, and perform feature extraction on the pulse signal of the sound source data to be processed by the feature extraction module to obtain a pulse feature sequence; the feature extraction module is constructed on the basis of a long-term and short-term memory network; and inputting the pulse characteristic sequence into the sound source positioning model for direction estimation to obtain the target direction of the sound source data to be processed.

According to the sound source orientation device provided by the embodiment of the invention, the direction of the pulse signal of the sound source data to be processed is estimated based on the pulse neural network, so that the target direction of the sound source data to be processed is obtained, the accuracy of sound source direction estimation can be ensured, the power consumption is reduced, and the sound source orientation device has better robustness and higher processing speed; the relative time delay information in the sound source data to be processed is captured through pulse coding, sound source direction estimation is carried out based on the relative time delay information, the sound source estimation accuracy is improved, phase information required in the sound source direction estimation can be effectively captured through a pulse coding method based on a zero crossing point, and the sound source direction estimation accuracy is further ensured.

In order to better implement the sound source orientation method provided by the embodiment of the present invention, based on the sound source orientation method, a sound source signal separation method is provided, and specifically, as shown in fig. 17, fig. 17 is a schematic flow diagram of the sound source signal separation method provided by the embodiment of the present invention, where the sound source signal separation method at least includes steps S300 to S500:

step S300, carrying out sound source direction estimation on the sound source data to be separated, and determining candidate sound sources corresponding to the sound source data to be separated and the target sound source direction of each candidate sound source.

In a certain embodiment of the present invention, the sound source direction of the sound source data to be separated may be estimated according to the sound source orientation method in any of the above embodiments, and the candidate sound source corresponding to the sound source data to be separated and the target sound source direction of each candidate sound source may be determined.

Step S400, sound source separation is carried out according to the positions of the sound channels for collecting the sound source data to be separated and the target sound source directions of the candidate sound sources, and the sound signals of the candidate sound sources are obtained.

The candidate sound source refers to a sound source that may exist in the current environment estimated from the sound source data to be separated, and includes a target sound source.

In some embodiments of the present invention, there are various ways to perform sound source separation processing on sound source data to be separated, which exemplarily include:

(1) Sound source separation can be performed on the positions of the sound channels for collecting the sound source data to be separated and the target sound source directions of the candidate sound sources through a separation method based on independent subspace analysis, and a sound signal of each candidate sound source is obtained.

(2) The sound signal of each candidate sound source can be obtained by performing sound source separation on each sound channel position for collecting the sound source data to be separated and the target sound source direction of each candidate sound source through a separation method based on non-negative matrix factorization.

(3) And performing sound source separation on the positions of the sound channels for collecting the sound source data to be separated and the target sound source directions of the candidate sound sources by using a separation method based on independent vector analysis optimized by an auxiliary function to obtain the sound signal of each candidate sound source.

It should be noted that the sound source separation processing method is only an exemplary illustration and does not constitute a limitation of the sound signal processing method provided in the embodiment of the present invention, and for example, the sound signal of each candidate sound source may be obtained by performing sound source separation processing according to each sound channel position for collecting the sound source data to be separated and the target sound source direction of each candidate sound source by a separation method based on overdetermined independent vector analysis optimized by a secondary function.

Step S500, according to the sound signal of each candidate sound source, determining and obtaining the target sound source from a plurality of candidate sound sources.

In some embodiments of the present invention, the target sound source may be determined from a plurality of the candidate sound sources according to the sound quality scores by evaluating the sound quality scores of the sound signals of the candidate sound sources. For example, a target sound source having the highest sound quality score may be selected from the plurality of candidate sound sources according to the sound quality score.

Alternatively, there are various ways to perform quality assessment on the sound signal of each candidate sound source, examples of which include:

(1) The sound quality score of the sound signal of each candidate sound source may be determined by performing a quality assessment of the sound signal of each candidate sound source by calculating a signal-to-interference ratio of the sound signal of each candidate sound source.

(2) The sound quality score of the sound signal of each candidate sound source may be determined by performing a quality assessment of the sound signal of each candidate sound source by calculating a signal distortion ratio of the sound signal of each candidate sound source.

(3) The sound quality score of the sound signal of each candidate sound source may be determined by performing a quality assessment on the sound signal of each candidate sound source by calculating a maximum likelihood ratio of the sound signal of each candidate sound source.

(4) The sound quality score of the sound signal of each candidate sound source may be determined by performing a quality assessment of the sound signal of each candidate sound source by calculating a cepstral cluster of the sound signal of each candidate sound source.

(5) The sound quality score of the sound signal of each candidate sound source may be determined by performing a quality assessment of the sound signal of each candidate sound source by calculating a frequency weighted segmented signal-to-noise ratio of the sound signal of each candidate sound source.

(6) The sound quality score of the sound signal of each candidate sound source may be determined by performing a quality assessment of the sound signal of each candidate sound source by calculating a speech quality perceptual evaluation score of the sound signal of each candidate sound source.

(7) The sound quality score of the sound signal of each candidate sound source may be determined by performing a quality assessment on the sound signal of each candidate sound source by calculating a kurtosis value of the sound signal of each candidate sound source.

(8) The sound quality score of the sound signal of each candidate sound source can be determined by evaluating the quality of the sound signal of each candidate sound source by calculating the probability score corresponding to the speech feature vector of the sound signal of each candidate sound source. Wherein the probability score is used to characterize the probability that the sound signal of each candidate sound source is the speech signal of the target sound source.

It should be noted that the above method for evaluating the quality of the sound signal of each candidate sound source is only an exemplary illustration, and does not constitute a limitation on the sound signal processing method provided by the embodiment of the present invention. In practical application, the corresponding sound quality score determination method can be selected according to the calculation effectiveness of the electronic equipment in a practical application scene.

According to the sound source signal separation method provided by the embodiment of the invention, the accuracy of the directions of the candidate sound sources is improved by estimating the sound source direction of the sound source data to be separated, and the final target sound source is determined by the evaluation value of each candidate sound source, so that the accuracy of the separated sound source signal is further improved, and the problem of low stability of signal separation is solved.

In order to better implement the sound source orientation method provided by the embodiment of the present invention, based on the sound source orientation method, based on an application scenario of an audio conference in a closed room, a sound source tracking method is provided, and specifically, as shown in fig. 18, fig. 18 is a schematic flow diagram of the sound source tracking method provided by the embodiment of the present invention, where the sound source tracking method at least includes steps S600 to S800:

step S600, sound source data in a conference scene is received through a microphone array.

Step S700, performing sound source direction estimation on the sound source data, and determining a target sound source direction corresponding to the sound source data.

In a certain class of embodiments of the present invention, the sound source direction of the sound source data may be estimated according to the sound source orientation method in any of the above embodiments, and the target sound source direction corresponding to the sound source data may be determined.

And step S800, adjusting the direction parameters of the sound source tracking equipment according to the direction of the target sound source.

Among other things, sound source tracking devices are used to track a target sound source in an audio conference in a room, including but not limited to cameras, microphones, and the like.

It will be appreciated that the sound source tracking device is in the same plane as the microphone array receiving the sound source.

In some embodiments of the present invention, a current direction parameter of the sound source tracking device may be obtained, a target direction parameter of the sound source tracking device may be obtained according to a target sound source direction corresponding to the sound source data, and a direction parameter of the sound source tracking device may be adjusted according to the target direction parameter and the current direction parameter of the sound source tracking device. The directional parameters include, but are not limited to, azimuth and pitch. Exemplarily, as shown in fig. 19, fig. 19 is a test result of sound source tracking performed by a brain-like chip implemented by low-power hardware according to the present invention, wherein radian is used for representing that 1 radian is equal to 60 °, and 1 radian is 90 °. The first graph in fig. 19 from top to bottom represents the target sound source direction of the real sound source at different times, and the second graph and the third graph are classified into a sound source tracking effect before smoothing and a sound source tracking effect after smoothing, and as can be seen from fig. 19, the sound source tracking method provided by the embodiment of the invention can quickly respond according to the sound source direction to realize quick tracking of the sound source.

According to the sound source tracking method provided by the embodiment of the invention, the accuracy of the direction of the sound source is improved by estimating the sound source direction of the sound source data, and the direction parameters of the sound source tracking equipment are quickly adjusted based on the target sound source direction, so that the sound source is quickly tracked.

The embodiment of the present invention further provides a chip, which utilizes any one of the sound source orientation methods described above, or uses a sound source signal separation method.

The chip is a pseudonymous chip or a brain-like chip, namely the chip can be a chip developed by simulating the working mode of biological neuron morphology, is usually triggered based on events, and has the characteristics of low power consumption, low delay response and no privacy disclosure. The current pseudomorphic chips include chips such as Loihi from Intel, truenouth from IBM, and Dynap-CNN from Synsense, but the invention is not limited thereto.

An embodiment of the present invention further provides an electronic device, where the sound source orientation method, the sound source signal separation method, or the chip described in any of the above are stored.

While the present invention has been described with reference to particular features and embodiments thereof, various modifications, combinations, and substitutions may be made thereto without departing from the invention. The scope of the present application is not intended to be limited to the particular embodiments of the process, machine, manufacture, composition of matter, means, methods and steps described in the specification, and it is intended that the method, means, and method may be practiced in association with, inter-dependent on, inter-operative with, or after one or more other products, methods.

Therefore, the specification and drawings should be considered simply as a description of some embodiments of the technical solutions defined by the appended claims, and therefore the appended claims should be interpreted according to the principles of maximum reasonable interpretation and are intended to cover all modifications, variations, combinations, or equivalents within the scope of the disclosure as possible, while avoiding an unreasonable interpretation.

To achieve better technical results or for certain applications, a person skilled in the art may make further improvements on the technical solution based on the present invention. However, even if the partial improvement/design is inventive or/and advanced, the technical idea of the present invention is covered by the technical features defined in the claims, and the technical solution is also within the protection scope of the present invention.

Several technical features mentioned in the attached claims may be replaced by alternative technical features or the order of some technical processes, the order of materials organization may be recombined. Those skilled in the art will readily appreciate that various modifications, changes and substitutions can be made without departing from the scope of the present invention, and the technical problems and/or the sequences can be substantially solved by the same means.

The method steps or modules described in connection with the embodiments disclosed herein may be embodied in hardware, software, or a combination of both, and the steps and components of the embodiments have been described in a functional generic manner in the foregoing description for the sake of clarity in describing the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application or design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention as claimed.

Claims

1. A method of sound source orientation, the method comprising:

carrying out zero-crossing pulse coding on sound source data to be processed to obtain a pulse signal of the sound source data to be processed;

2. The sound source localization method of claim 1, wherein:

the method comprises the steps that sound source data to be processed are received on the basis of microphones, and zero crossing pulse coding is carried out on the sound source data to be processed received by each microphone after preprocessing; wherein the zero-crossing pulse encoding comprises:

and performing pulse coding based on zero crossing points in the sound source data to be processed received by each microphone and time information corresponding to each zero crossing point to obtain pulse signals of the sound source data to be processed received by each microphone.

3. The sound source localization method according to claim 2, wherein performing zero crossing point detection on the sound source data preprocessed by each microphone to obtain a zero crossing point in the sound source data to be processed received by each microphone and time information corresponding to each zero crossing point comprises:

4. A sound source direction finding apparatus, characterized in that it comprises:

5. The sound source direction device according to claim 4, wherein:

the sound source orienting device further includes: the preprocessing module is used for preprocessing sound source data received by the microphones to obtain preprocessed sound source data of each microphone;

6. The sound source direction device according to claim 5, wherein:

the preprocessing module comprises:

the channel decomposition module is used for carrying out channel decomposition on the sound source data to be processed received by each microphone;

the activity detection module is coupled with the channel decomposition module and is used for carrying out activity detection on the channel components subjected to channel decomposition based on the to-be-processed sound source data received by each microphone so as to obtain a target frequency; the target frequency is one or more than one frequency;

and determining the sound source component of the sound source data to be processed received by each microphone in a target frequency channel as the sound source data preprocessed by each microphone.

7. A sound source separation method, characterized by comprising:

performing sound source direction estimation on sound source data to be separated by the sound source orientation method according to any one of claims 1 to 3, and determining candidate sound sources corresponding to the sound source data to be separated and a target sound source direction of each candidate sound source;

8. A sound source tracking method, characterized by comprising:

determining a target sound source direction of sound source data by the sound source localization method of any of the above claims 1 to 3;

and carrying out sound source tracking based on the target sound source direction of the sound source data.

9. A chip, characterized in that it comprises a sound source localization arrangement according to any one of claims 4 to 6.

10. An electronic device, characterized in that the electronic device comprises a sound source directing device according to any of claims 4 to 6, or comprises a chip according to claim 9.