CN116320853A - Pretreatment device and method thereof, sound source orientation device and method thereof, and chip - Google Patents

Pretreatment device and method thereof, sound source orientation device and method thereof, and chip Download PDF

Info

Publication number
CN116320853A
CN116320853A CN202310159281.4A CN202310159281A CN116320853A CN 116320853 A CN116320853 A CN 116320853A CN 202310159281 A CN202310159281 A CN 202310159281A CN 116320853 A CN116320853 A CN 116320853A
Authority
CN
China
Prior art keywords
sound source
source data
microphone
signal
frequency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310159281.4A
Other languages
Chinese (zh)
Inventor
赛义德·哈格哈特舒尔
迪兰·理查德·缪尔
乔宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Shizhi Technology Co ltd
Original Assignee
Shenzhen Shizhi Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Shizhi Technology Co ltd filed Critical Shenzhen Shizhi Technology Co ltd
Priority to CN202310159281.4A priority Critical patent/CN116320853A/en
Publication of CN116320853A publication Critical patent/CN116320853A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R1/00Details of transducers, loudspeakers or microphones
    • H04R1/08Mouthpieces; Microphones; Attachments therefor
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S3/00Direction-finders for determining the direction from which infrasonic, sonic, ultrasonic, or electromagnetic waves, or particle emission, not having a directional significance, are being received
    • G01S3/80Direction-finders for determining the direction from which infrasonic, sonic, ultrasonic, or electromagnetic waves, or particle emission, not having a directional significance, are being received using ultrasonic, sonic or infrasonic waves
    • G01S3/802Systems for determining direction or deviation from predetermined direction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Remote Sensing (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Acoustics & Sound (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The invention discloses a preprocessing device and a method thereof, a sound source orientation device and a method and a chip thereof, and aims to solve the technical problems of low instantaneity, high power consumption, complex calculation and the like when the traditional sound source orientation technology is applied in an actual scene; specifically, the preprocessing device comprises an activity detection module, wherein activity detection is carried out on the basis of channel components of sound source data to be processed, which are received by each microphone and are subjected to channel decomposition, so as to obtain target frequency; and determining sound source components of the sound source data to be processed received by each microphone in the target frequency channel as the sound source data of each microphone after pretreatment. The invention can effectively solve the problem of unstable voice signals, has good real-time performance and strong environmental adaptability, and can be realized in low-power consumption hardware with low cost. The method is suitable for the field of brain-like calculation.

Description

Pretreatment device and method thereof, sound source orientation device and method thereof, and chip
Technical Field
The present invention relates to a preprocessing device and a method thereof, a sound source directing device and a method thereof, and a chip, and more particularly, to a preprocessing device and a method thereof, a sound source directing device and a method thereof, and a chip thereof, which perform sound source directing with low power consumption and low cost by using an adaptive broadband directing technology.
Background
The sound source direction can be rapidly and effectively identified under complex environments such as noise. Along with the development of artificial intelligence technology, the bionic machine vision and machine hearing are widely applied to the border fields of video conferences, intelligent robots, intelligent households, quality video monitoring systems, intelligent Internet of things and the like.
The existing methods are based on deep learning artificial neural networks (ANN or RNN and the like) for source positioning (sound source localization, SSL), and the technologies lack dynamics mechanisms in nerves on one hand, are not bionic/intelligent enough, and have high real-time performance to be improved, and on the other hand, have high energy consumption and storage space requirements, are mainly used for large-power terminals of networking, and are not suitable for edge calculation and Internet of things scenes.
Because the distance and direction between the sound source and each microphone in the microphone array are different, each microphone in the microphone array may receive the voice signal of the sound source, and as the sound source moves, the reverberation of the room, the interference of other sound sources, noise (including but not limited to environmental noise or/and internal noise of electronic equipment, etc.) inevitably reduce the quality, language definition and accuracy of sound source orientation, while the current sound source orientation technology is not bionic, does not have high sensitivity and robustness, these factors increase the difficulty and real-time of sound source orientation, affect the audio-visual effect and reduce the performance of electronic equipment in a voice interaction mode. Therefore, it is generally necessary to perform processing such as noise reduction of a voice signal and separation processing of a sound source after determining the position of the sound source.
In addition, most of the existing sound source positioning/orientation methods need to improve the accuracy of the sound source method by means of Singular Value Decomposition (SVD), subspace (subspace), beamforming (beamforming), generalized cross-correlation phase transformation and other algorithms, which increases the data processing amount, has higher requirements on the computing performance of the device, consumes a great amount of storage resources and power consumption, and is more difficult to realize in low-power hardware.
If the sound source can be realized in a biological or bionic mode on silicon, is sensitive to the relative delay of the incoming signals on different microphones, can quickly detect the sound source in real time and effectively identify the position or direction of the sound source, and has the advantages of less consumption of calculation or storage resources, low power consumption, low cost and easy realization, the sound source orientation scheme is a great progress of the machine hearing in the commercial application of the edge intelligent calculation field.
Disclosure of Invention
In order to solve or alleviate some or all of the above technical problems, the present invention is implemented by the following technical solutions:
the preprocessing device is used for preprocessing the sound source data to be processed received by the microphones to obtain the sound source data preprocessed by each microphone; the pretreatment device comprises: the channel decomposition module is used for carrying out channel decomposition on the sound source data to be processed received by each microphone; the activity detection module is coupled with the channel decomposition module and is used for carrying out activity detection on the basis of channel components of the sound source data to be processed received by each microphone after channel decomposition so as to obtain target frequency; the target frequency is one or more frequencies; and determining sound source components of the sound source data to be processed received by each microphone in the target frequency channel as the sound source data of each microphone after pretreatment.
In certain classes of embodiments, the activity detection module performs independent activity detection or joint activity detection or local joint activity detection;
the independent activity detection is as follows: counting energy or energy average values of sound source data to be processed received by each microphone under different frequency channels after channel decomposition, and obtaining initial active frequency meeting a first preset condition; obtaining the target frequency based on the initial active frequency corresponding to each microphone;
the joint activity detection is as follows: the energy sum of all the microphones under different frequency channels is counted, or the average value of the signal energy of all the microphones under different frequency channels is counted, so that the target frequency meeting the second preset condition is obtained;
the local joint activity detection is as follows: grouping the sound source data to be processed received by all microphones to obtain at least two sound source data combinations; the sound source data combination comprises at least one sound source data to be processed; the energy sum of all the sound source data in each sound source data combination under different frequency channels is counted, or the signal energy average value of all the sound source data in each sound source data combination under different frequency channels is counted, so that the active frequency of each sound source data combination is obtained; and determining a target frequency based on the active frequency of each sound source data combination.
In some embodiments, the first preset condition of independent activity detection is that energy or energy average is maximum, or energy average is greater than or equal to a second threshold;
the second preset condition of the joint activity detection is that the energy sum or the energy average value is maximum, or the energy sum or the energy average value is larger than or equal to a second threshold value.
In some class of embodiments, the obtaining the target frequency based on the initial active frequency corresponding to each microphone includes one of:
selecting at least one initial activity frequency as a target frequency based on the frequency of occurrence of each initial activity frequency;
selecting at least one initial activity frequency as a target frequency based on the frequency value of each initial activity frequency;
clustering the initial activity frequencies corresponding to the microphones, and selecting at least one initial activity frequency as a target frequency.
In some embodiments, the determining the target frequency based on the active frequency of each of the sound source data combinations includes one of the following ways:
selecting at least one active frequency as a target frequency based on the frequency of occurrence of each active frequency;
selecting at least one active frequency as a target frequency based on the frequency value of each active frequency;
Clustering the active frequencies corresponding to the sound source data combinations, and selecting at least one active frequency as a target frequency.
In some embodiments, the channel decomposition module includes two or more filter banks for preprocessing sound source data to be processed received by each microphone in a microphone array formed by two or more microphones; wherein the number of filter banks is equal to or less than the number of microphones in the microphone array, each filter bank being coupled to one microphone in the microphone array;
and the filter bank performs filtering processing and divides the sound source data to be processed received by the corresponding microphone into a plurality of frequency channels.
In a class of embodiments, the microphone array is linear or circular or spherical or cross-shaped or spiral.
In a class of embodiments, the microphone array is a circular array including 8 microphones.
In certain classes of embodiments, the sound source data may be replaced by electromagnetic waves or/and seismic waves or/and radar or/and physiological signals, and correspondingly the microphone replaced by a sensor corresponding to electromagnetic waves or seismic waves or radar or physiological signals.
The preprocessing method is used for preprocessing the sound source data to be processed received by the microphones to obtain sound source data of each microphone after the preprocessing step; the pretreatment method comprises the following steps:
Carrying out channel decomposition on the sound source data to be processed received by each microphone to obtain the sound source data to be processed received by each microphone, and decomposing the sound source data to be processed into a plurality of frequency channels;
performing activity detection on channel components after channel decomposition on sound source data to be processed received by each microphone to obtain target frequency; the target frequency is one or more frequencies;
and determining sound source components of the sound source data to be processed received by each microphone in the target frequency channel as the sound source data of each microphone after pretreatment.
In some embodiments, the performing channel decomposition on the sound source data to be processed received by each microphone includes: and filtering the sound source data to be processed received by each microphone through a band-pass filter bank, and dividing the sound source data to be processed received by the microphone into a plurality of frequency channels.
In some embodiments, the activity detection of the channel component after the channel decomposition based on the sound source data to be processed received by each microphone to obtain the target frequency includes: performing independent activity detection on channel components after channel decomposition of sound source data to be processed received by each microphone to obtain initial activity frequencies corresponding to each microphone; the independent activity detection is to count the energy or the energy average value of each microphone under different frequency channels respectively, so as to obtain the initial activity frequency meeting the first preset condition;
The target frequency is obtained based on the initial active frequency corresponding to each microphone.
In some embodiments, the first preset condition is that the energy or energy average value is maximum, or the energy or energy average value is greater than or equal to a second threshold value.
In some class of embodiments, the obtaining the target frequency based on the initial active frequency corresponding to each microphone includes one of:
selecting at least one initial activity frequency as a target frequency based on the frequency of occurrence of each initial activity frequency;
selecting at least one initial activity frequency as a target frequency based on the frequency value of each initial activity frequency;
clustering the initial activity frequencies corresponding to the microphones, and selecting at least one initial activity frequency as a target frequency.
In some embodiments, the activity detection of the channel component after the channel decomposition based on the sound source data to be processed received by each microphone to obtain the target frequency includes: carrying out joint activity detection on channel components after channel decomposition on sound source data to be processed received by all microphones;
the joint activity detection is to count the energy sum of all the microphones under different frequency channels respectively or count the average value of the signal energy of all the microphones under different frequency channels respectively so as to obtain the target frequency meeting the second preset condition.
In some embodiments, the second preset condition is that the energy sum or the energy average value is maximum, or the energy sum or the energy average value is greater than or equal to a second threshold value.
In some embodiments, the activity detection of the channel component after the channel decomposition based on the sound source data to be processed received by each microphone to obtain the target frequency includes: and carrying out local joint activity detection on channel components after channel decomposition on the sound source data to be processed received by all microphones, wherein the local joint activity detection is as follows:
grouping the sound source data to be processed received by all microphones to obtain at least two sound source data combinations; the sound source data combination comprises at least one sound source data to be processed;
the energy sum of all the sound source data in each sound source data combination under different frequency channels is counted, or the signal energy average value of all the sound source data in each sound source data combination under different frequency channels is counted, so that the active frequency of each sound source data combination is obtained;
determining a target frequency based on the active frequency of each sound source data combination; and determining a target frequency channel for obtaining the sound source data to be processed received by each microphone according to the target frequency.
In certain classes of embodiments, the sound source data may be replaced by electromagnetic waves or/and seismic waves or/and radar or/and physiological signals, and correspondingly the microphone replaced by a sensor corresponding to electromagnetic waves or seismic waves or radar or physiological signals.
A first type of sound source directing means comprising preprocessing means as described above; the method comprises the steps of,
the encoding module is coupled with the preprocessing device and is used for carrying out pulse encoding on the sound source data preprocessed by each microphone to obtain pulse signals corresponding to the sound source data to be processed received by each microphone;
and the estimation module is coupled with the encoding module, and is used for estimating the direction of the pulse signal obtained by pulse encoding based on the pulse neural network to obtain the target sound source direction of the sound source data to be processed.
In certain classes of embodiments, the encoding module performs zero-crossing pulse encoding.
In certain embodiments, detecting zero crossings of the preprocessed sound source data of each microphone on the target frequency channel, and generating a pulse at the zero crossings; wherein the zero crossing point is an upward zero crossing point or/and a downward zero crossing point;
the upward zero crossing point is a signal point of which the signal amplitude is changed from a negative number to a positive number, and the downward zero crossing point is a signal point of which the signal amplitude is changed from a positive number to a negative number.
In certain classes of embodiments, the encoding module performs the steps of:
performing zero crossing point detection on the sound source data preprocessed by each microphone to obtain zero crossing points in the sound source data to be processed received by each microphone and time information corresponding to each zero crossing point;
and performing pulse coding based on the zero crossing points in the sound source data to be processed received by each microphone and time information corresponding to each zero crossing point to obtain pulse signals of the sound source data to be processed received by each microphone.
The first-class sound source orientation method comprises the pretreatment method, so as to obtain sound source data of each microphone after pretreatment;
performing pulse coding on the sound source data of each microphone after pretreatment to obtain pulse signals corresponding to the sound source data to be treated received by each microphone;
and based on a pulse neural network, estimating the direction of a pulse signal obtained by pulse coding to obtain the target sound source direction of the sound source data to be processed.
In certain classes of embodiments, the pulse code is zero-crossing pulse code.
A chip of the first kind, characterized in that said chip comprises preprocessing means as described above, or sound source directing means of the first kind as described above.
A first type of electronic device comprising a sound source directing means of the first type as described above, or comprising a chip of the first type as described above.
A second type of sound source directing method, the method comprising: performing zero-crossing pulse coding on sound source data to be processed to obtain pulse signals of the sound source data to be processed;
and based on a pulse neural network, performing direction estimation on a pulse signal obtained by zero-crossing pulse coding to obtain a target sound source direction of the sound source data to be processed.
In some embodiments, the microphone is used for receiving the sound source data to be processed, and zero-crossing pulse coding is performed after the sound source data to be processed received by each microphone is preprocessed; wherein the zero-crossing pulse coding comprises:
performing zero crossing point detection on the sound source data preprocessed by each microphone to obtain zero crossing points in the sound source data to be processed received by each microphone and time information corresponding to each zero crossing point;
and performing pulse coding based on the zero crossing points in the sound source data to be processed received by each microphone and time information corresponding to each zero crossing point to obtain pulse signals of the sound source data to be processed received by each microphone.
In some embodiments, performing zero-crossing detection on the sound source data preprocessed by each microphone to obtain zero-crossing points in each sound source data to be processed and time information corresponding to each zero-crossing point, where the method includes:
for each piece of preprocessed sound source data of the microphone, determining a plurality of target signal point sets of the preprocessed sound source data of the microphone according to the signal values of the signal points in the preprocessed sound source data of the microphone;
obtaining the sum of signal values corresponding to the signal points in the target signal point sets according to the signal values of the signal points in the target signal point sets;
comparing the sum of signal values corresponding to the signal points in the target signal point sets, and determining target signal points with local maxima in the target signal point sets and time information corresponding to the target signal points;
and determining zero crossing points in the sound source data to be processed received by the microphone and time information corresponding to the zero crossing points according to the target signal points with local maxima in the target signal point sets and the time information corresponding to the target signal points.
In some embodiments, the determining a plurality of target signal point sets of the sound source data after the microphone is preprocessed according to the signal value of each signal point in the sound source data after the microphone is preprocessed includes:
comparing the signal values of all signal points in the sound source data after the microphone is preprocessed, and determining signal points with continuously decreasing signal values in the sound source data after the microphone is preprocessed;
and grouping the signal points with continuously decreasing signal values in the sound source data after the microphone is preprocessed according to the time information corresponding to the signal points with continuously decreasing signal values in the sound source data after the microphone is preprocessed, so as to obtain a plurality of target signal point sets.
In some embodiments, the comparing the sum of signal values corresponding to the signal points in each of the target signal point sets to determine the target signal point having the local maximum in each of the target signal point sets includes:
comparing the sum of signal values corresponding to all signal points in the target signal point set aiming at each target signal point set, and determining candidate target signal points with the sum of initial maximum signal values in the target signal point set;
According to the time information corresponding to the candidate target signal points, determining a candidate time period with a local maximum value in the target signal point set;
and comparing the sum of the signal values corresponding to the signal points corresponding to the time information in the candidate time period, determining the signal point with the sum of the maximum signal value, and determining the signal point with the sum of the maximum signal value as the target signal point with the local maximum value in the target signal point set.
In certain embodiments, preprocessing includes channel decomposition of the sound source data received by each microphone to obtain a decomposition of the sound source data received by each microphone to a plurality of frequency channels.
In some embodiments, the preprocessing further includes performing activity detection based on channel components of the sound source data to be processed received by each microphone after channel decomposition, so as to obtain a target frequency; the target frequency is one or more frequencies;
and determining sound source components of the sound source data to be processed received by each microphone in the target frequency channel as the sound source data of each microphone after pretreatment.
In some embodiments, the performing channel decomposition on the sound source data to be processed received by each microphone includes: and filtering the sound source data to be processed received by each microphone through a band-pass filter bank, and dividing the sound source data to be processed received by the microphone into a plurality of frequency channels.
In some embodiments, energy or energy sum or average energy of channel components of the sound source data to be processed, which are received by each microphone and subjected to channel decomposition, under different frequencies is calculated in the same time window, so as to obtain a target frequency meeting a preset condition.
In certain embodiments, the preset condition is that the energy or energy sum or average energy is greater than or equal to a first threshold; alternatively, the energy or energy sum or average energy is greater than or equal to the first threshold and less than or equal to the second threshold.
In some embodiments, based on a pulse neural network, performing direction estimation on a pulse signal obtained by zero-crossing pulse coding to obtain a target sound source direction of the sound source data to be processed, including:
inputting the pulse signals of the sound source data to be processed into a feature extraction module, and carrying out feature extraction on the pulse signals of the sound source data to be processed through the feature extraction module to obtain a pulse feature sequence; the feature extraction module is constructed based on a long-term and short-term memory network;
and inputting the pulse characteristic sequence into the pulse neural network to perform direction estimation to obtain the target direction of the sound source data to be processed.
In certain classes of embodiments, the sound source data may be replaced by electromagnetic waves or/and seismic waves or/and radar or/and physiological signals, and correspondingly the microphone replaced by a sensor corresponding to electromagnetic waves or seismic waves or radar or physiological signals.
A second type of sound source directing arrangement, the sound source directing arrangement comprising: the coding module is used for carrying out zero-crossing coding on sound source data to be processed to obtain pulse signals of the sound source data to be processed;
and the estimation module is used for estimating the direction of the pulse signal obtained by the zero-crossing pulse coding based on the pulse neural network to obtain the target sound source direction of the sound source data to be processed.
In a class of embodiments, the sound source directing device further comprises: the preprocessing module is used for preprocessing the sound source data received by the microphones to obtain preprocessed sound source data of each microphone;
and the coding module carries out zero-cross point coding on the sound source data preprocessed by each microphone to obtain pulse signals of the sound source data to be processed received by each microphone.
In certain classes of embodiments, the preprocessing module comprises: the channel decomposition module is used for carrying out channel decomposition on the sound source data to be processed received by each microphone;
The activity detection module is coupled with the channel decomposition module and is used for carrying out activity detection on the basis of channel components of the sound source data to be processed received by each microphone after channel decomposition so as to obtain target frequency; the target frequency is one or more frequencies;
and determining sound source components of the sound source data to be processed received by each microphone in the target frequency channel as the sound source data of each microphone after pretreatment.
In some embodiments, energy or energy sum or average energy of channel components of the sound source data to be processed, which are received by each microphone and subjected to channel decomposition, under different frequencies is calculated in the same time window, so as to obtain a target frequency meeting a preset condition.
In certain embodiments, the preset condition is that the energy or energy sum or average energy is greater than or equal to a first threshold; alternatively, the energy or energy sum or average energy is greater than or equal to the first threshold and less than or equal to the second threshold.
In some embodiments, the encoding module is configured to detect zero-crossing points in the sound source data to be processed received by each microphone and time information corresponding to each zero-crossing point;
pulse encoding is performed based on zero-crossing points in the sound source data to be processed received by each microphone and time information corresponding to each zero-crossing point,
Determining a plurality of target signal point sets of the sound source data after preprocessing of the microphones according to the signal values of the signal points in the sound source data after preprocessing of the microphones;
obtaining the sum of signal values corresponding to the signal points in the target signal point sets according to the signal values of the signal points in the target signal point sets;
comparing the sum of signal values corresponding to the signal points in the target signal point sets, and determining target signal points with local maxima in the target signal point sets and time information corresponding to the target signal points;
and determining zero crossing points in the sound source data to be processed received by the microphone and time information corresponding to the zero crossing points according to the target signal points with local maxima in the target signal point sets and the time information corresponding to the target signal points.
In some embodiments, determining a plurality of target signal point sets of the preprocessed sound source data of the microphone according to signal values of signal points in the preprocessed sound source data of the microphone includes:
comparing the signal values of all signal points in the sound source data after the microphone is preprocessed, and determining signal points with continuously decreasing signal values in the sound source data after the microphone is preprocessed;
And grouping the signal points with continuously decreasing signal values in the sound source data after the microphone is preprocessed according to the time information corresponding to the signal points with continuously decreasing signal values in the sound source data after the microphone is preprocessed, so as to obtain a plurality of target signal point sets.
In some embodiments, the comparing the sum of signal values corresponding to the signal points in each of the target signal point sets to determine the target signal point having the local maximum in each of the target signal point sets includes:
comparing the sum of signal values corresponding to all signal points in the target signal point set aiming at each target signal point set, and determining candidate target signal points with the sum of initial maximum signal values in the target signal point set;
according to the time information corresponding to the candidate target signal points, determining a candidate time period with a local maximum value in the target signal point set;
and comparing the sum of the signal values corresponding to the signal points corresponding to the time information in the candidate time period, determining the signal point with the sum of the maximum signal value, and determining the signal point with the sum of the maximum signal value as the target signal point with the local maximum value in the target signal point set.
In a class of embodiments, the sound source directing device further comprises: the feature extraction module is coupled between the coding module and the pulse neural network, and is used for extracting the features of the pulse signals of the sound source data to be processed, which are received by each microphone and are generated by the coding module, so as to obtain a pulse feature sequence;
and the impulse neural network carries out direction estimation based on the impulse characteristic sequence to obtain the target direction of the sound source data to be processed.
A sound source separation method, the sound source separation method comprising: estimating the sound source direction of the sound source data to be separated by the second-class sound source orientation method, and determining the candidate sound sources corresponding to the sound source data to be separated and the target sound source direction of each candidate sound source;
performing sound source separation according to the positions of sound channels for collecting the sound source data to be separated and the target sound source directions of the candidate sound sources to obtain sound signals of each candidate sound source;
and determining a target sound source from a plurality of candidate sound sources according to the sound signals of the candidate sound sources.
A sound source tracking method, the sound source tracking method comprising: determining a target sound source direction of sound source data by a second-type sound source orientation method as described above; and performing sound source tracking based on the target sound source direction of the sound source data.
A second type of chip comprising a second type of sound source directing device as described above.
A second type of electronic device comprising a second type of sound source directing means as described above, or comprising a second type of chip as described above.
Some or all embodiments of the present invention have the following beneficial technical effects:
1) The sound source direction estimation scheme of the invention does not need complex algorithms such as beam forming, subspace and the like, realizes sound source estimation by using a pulse neural network based on event-driven (event-based), and has the advantages of simple method, easy orientation, low power consumption and easy hardware implementation.
2) The pulse coding method based on the zero crossing point can effectively capture the phase information required by sound source direction estimation, carries out sound source direction estimation based on relative time delay information, and improves the instantaneity, anti-interference performance and accuracy of sound source estimation. And further adopts robust zero-crossing pulse coding to improve the robustness.
3) The invention applies the self-adaptive broadband orientation technology to carry out sound source orientation, namely, frequency components of activities in signals are identified in each time interval, and the components are used for positioning, so that the invention can effectively overcome the problem of unstable voice signals while enhancing the real-time performance, has strong environmental adaptability and is suitable for various complex environments.
4) The activity detection of the invention has various embodiments, and the flexibility is strong in independent activity detection or joint activity detection or local joint activity detection.
5) The zero-crossing pulse code is sensitive to the relative delay of the incoming signals on different microphones, can quickly capture the information, and does not have large disturbance in the DoA estimation result in the reflection propagation environment or the noise environment.
6) The DOA estimation method and the DOA estimation device can effectively realize DOA estimation of narrowband and wideband sound signals. In addition, the sound source direction estimation scheme of the present invention is very fast in response to a DoA abrupt change (e.g., a speaker change in a conference room), can rapidly output a switched DoA angle, and can rapidly, effectively, and accurately track when a sound source moves (e.g., a speaker walks).
7) The sound source orientation technology of the invention obtains better sound source orientation results in the chip, and the test results of sound source mutation and tracking by using the chip have very small difference compared with the computer simulation results and can be ignored. The sound source orientation technology can be effectively implemented in hardware, and has commercial application value.
Further advantageous effects will be further described in the preferred embodiments.
The above-described technical solutions/features are intended to summarize the technical solutions and technical features described in the detailed description section, and thus the ranges described may not be exactly the same. However, these new solutions disclosed in this section are also part of the numerous solutions disclosed in this document, and the technical features disclosed in this section and the technical features disclosed in the following detailed description section, and some contents in the drawings not explicitly described in the specification disclose more solutions in a reasonable combination with each other.
The technical scheme combined by all the technical features disclosed in any position of the invention is used for supporting the generalization of the technical scheme, the modification of the patent document and the disclosure of the technical scheme.
Drawings
FIG. 1 is a schematic illustration of directions of arrival in a circular array;
FIG. 2 is a flow chart of a sound source directing method according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a pulse sequence of a low frequency channel after zero-crossing pulse encoding according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a pulse sequence of a high frequency channel after zero-crossing pulse encoding according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of the structure of a impulse neural network;
FIG. 6 is a schematic diagram of array resolution of a linear microphone array provided by an embodiment of the invention;
FIG. 7 is a time-frequency schematic diagram of a speech signal;
FIG. 8 is a schematic diagram of sound source direction in an embodiment of the invention;
fig. 9 is a schematic diagram of sound source direction after preprocessing a signal received by a microphone according to an embodiment of the present invention.
FIG. 10 is a schematic diagram of independent activity detection provided by an embodiment of the present invention;
FIG. 11 is a schematic diagram of joint activity detection provided by an embodiment of the present invention;
FIG. 12 is a schematic diagram of a sound source localization model based on a long-short-term memory network and a pulse neural network according to an embodiment of the present invention;
FIG. 13 is a simulation test result of sound source orientation under the low frequency channel of the present invention;
FIG. 14 is a test comparison result of sound source orientation by using a low frequency channel between a brain-like chip and a simulation model implemented by low power hardware of the invention;
FIG. 15 is a simulation test result of sound source orientation under the high frequency channel of the present invention;
FIG. 16 is a test comparison result of sound source orientation by using a high frequency channel between a brain-like chip and a simulation model implemented by low power hardware of the invention;
fig. 17 is a schematic flow chart of a sound source signal separation method according to an embodiment of the present invention;
FIG. 18 is a flowchart of a sound source tracking method according to an embodiment of the present invention;
fig. 19 is a test result of sound source tracking by the brain-like chip implemented by low-power hardware of the invention.
Detailed Description
Since various alternatives are not exhaustive, the gist of the technical solution in the embodiment of the present invention will be clearly and completely described below with reference to the drawings in the embodiment of the present invention. Other technical solutions and details not disclosed in detail below, which generally belong to technical objects or technical features that can be achieved by conventional means in the art, are limited in space and the present invention is not described in detail.
Except where division is used, any position "/" in this disclosure means a logical "or". The ordinal numbers "first", "second", etc., in any position of the present invention are used merely for distinguishing between the labels in the description and do not imply an absolute order in time or space, nor do they imply that the terms preceded by such ordinal numbers are necessarily different from the same terms preceded by other ordinal terms.
The present invention will be described in terms of various elements for use in various combinations of embodiments, which elements are to be combined in various methods, products. In the present invention, even if only the gist described in introducing a method/product scheme means that the corresponding product/method scheme explicitly includes the technical feature.
The description of a step, module, or feature in any location in the disclosure does not imply that the step, module, or feature is the only step or feature present, but that other embodiments may be implemented by those skilled in the art with the aid of other technical means according to the disclosed technical solutions. The embodiments of the present invention are generally disclosed for the purpose of disclosing preferred embodiments, but it is not meant to imply that the contrary embodiments of the preferred embodiments are not intended to cover all embodiments of the invention as long as such contrary embodiments are at least one technical problem addressed by the present invention. Based on the gist of the specific embodiments of the present invention, a person skilled in the art can apply means of substitution, deletion, addition, combination, exchange of sequences, etc. to certain technical features, so as to obtain a technical solution still following the inventive concept. Such solutions without departing from the technical idea of the invention are also within the scope of protection of the invention. Part of the important terminology and symbol interpretation:
neuromorphic (mimicry) chips: the method has the event driving characteristic, the event is driven to be calculated or processed after the event occurs, and the ultra-high real-time performance and ultra-low power consumption are realized on a hardware circuit. Neuromorphic chips are classified, according to type, into neuromorphic chips based on analog or digital or data hybrid circuits.
Pulsed neural network (Spiking neural network, SNN): one of the event-driven neuromorphic chips is a third-generation artificial neural network, has rich space-time dynamics characteristics, various coding mechanisms and event-driven characteristics, and is low in calculation cost and low in power consumption. Compared with the artificial neural network ANN, the SNN is more bionic and advanced, and the brain-like calculation (brain-inspired computing) or the neuromorphic calculation (neuromorphic computing) based on the SNN has better performance and calculation overhead than the traditional AI chip. It should be noted that, the type of the impulse neural network is not specifically limited in the embodiments of the present invention, and as long as the neural network driven based on the impulse signal or the event can be applied to the sound source direction method provided in the embodiments of the present invention, the impulse neural network, for example, the impulse convolutional neural network (SCNN), the impulse cyclic neural network (SRNN), the long-short-term neural network (LSTM), and the like, can be built according to the actual application scenario.
Direction of arrival (Direction of Arrival, doA): the direction angle of the audio signal output by the sound source reaching the microphone array is different for the sound sources with different directions of arrival, and the delay of the audio signal output by the sound source reaching the microphone array is different. The microphone array may have different spatial shapes, such as circular, linear, spherical, cross-shaped, spiral, etc. For example, a circular microphone array is illustrated as shown in fig. 1, and as shown in fig. 1, it is understood that the microphone array shown in fig. 1 is only an example, and the microphone array is not specifically limited in the embodiment of the present invention. Fig. 1 is a schematic representation of the direction of arrival in a circular array, the projection of the array elements along the DoA, as a measure of the relative time of receipt of signals at the array elements. Wherein the array element refers to a microphone in the microphone array.
It should be noted that the DoA estimation method of the present invention is not only applicable to audio waves, but also to electromagnetic waves, seismic waves, radar and the like or one-dimensional waves to find the direction or position of the target.
Narrowband: the bandwidth of the signal is well below its center frequency, e.g. the narrowband signal bandwidth ranges from 10-100MHz.
Narrow band positioning: the narrowband signal spectrum is relatively simple and can be considered as a single frequency signal. In the narrowband case, lie in { r ] k :k∈[M]Microphones of different microphone arrays may generate a signal
Figure BDA0004093598740000151
The phase shift (phase shift) of the harmonic signal of the given incident harmonic, so that the influence of the microphone array set arrangement on the signal can be determined by
Figure BDA0004093598740000152
The defined M-dim array response vector a (n) is encoded, wherein n is the unit norm vector representing the DOA direction of the incident wave, and is in the unit circle +.>
Figure BDA0004093598740000153
The DoA vector on (A) is represented by lambda, lambda is the wavelength, M is the number of microphones in the array, r k Is the kth microphone in the array arrangement. It will be appreciated that the array response as a function of the DoA vector is indeed a spatial harmonic signal whose frequency depends on the geometry { r } of the array arrangement of the microphones k :k∈[M]}。
Array resolution: characterizing the distance that two targets can distinguish the DoA in the presence of noise. This resolution depends on the array geometry and more importantly on the spatial size of the array, e.g., an L-sized linear array with angular resolution of
Figure BDA0004093598740000154
Generally, the larger the array, the better its angular resolution.
Grating lobes: while the use of a larger array may result in better resolution, it may lead to grating lobe problems when the number of array elements, and the spatial span of the array as a whole, is limited. When this occurs, the array response vector will appear aliased, i.e., for two different DOA vectors n 1 And n 2 May have the same array response vector, i.e., a (n 1 )=a(n 2 ) This makes it impossible to determine the angle at which the sound source is located, and thus to distinguish and find the correct DoA. Thus for microphone arrays with a limited number of microphones, there is a trade-off between increasing the resolution of the array and avoiding grating lobes, e.g. according to
Figure BDA0004093598740000161
And determining the distance between the microphones in the microphone array to obtain the array resolution of the microphone array, wherein lambda is the wavelength of the audio signal. />
Broadband: the bandwidth of the signal is well below its center frequency.
Broadband positioning: broadband signal spectrum is relatively complex, contains a rich frequency component, and broadband positioning can be seen as a generalization of the narrowband case, i.e. positioning can be performed by processing the received multiple frequency signals. A fixed array with a given array element configuration can only process signals in a limited frequency range: when the frequency exceeds f max That is, when the signal wavelength is very small, especially when the wavelength is smaller than the pitch of the array elements, the grating lobe effect generated at this time can limit the positioning effect; when the frequency is lower than f min When the signal wavelength is very large, particularly when the wavelength is greater than the span of the entire array, this results in limited angular resolution of the array and thus in the presence of measurement noise, an individual target cannot be located with sufficient accuracy.
As described in the background, existing sound source direction methods are mainly implemented by beam forming techniques, where the beam forming techniques will define { x (t- τ) i ):∈[M]Is the signal received at the microphone, where M is the number of microphones in the array, τ ii (n) represents the microphone i E [ M ]]The relative delay time at (a) is a function of the DoA of the audio signal n. In beamforming, the greatest power of the received signals may be further found by weighting and delaying the received signals accumulated on different microphones to obtain different doas (called spatial matched filtering), where the greatest power of the received signals is the target sound source direction, for example, the target sound source direction may be determined by a Delay-and-add (Delay and Sum) algorithm, a minimum variance undistorted response (Minimum Variance Distortion less Response, MVDR) algorithm, and a controllable response power phase transformation (SRP-phas) algorithm. The method is mainly applied to narrowband signals, under the narrowband condition, input audio signals are concentrated around carrier frequency, energy of the audio signals received by different microphones in the microphone array is calculated, for example, power is calculated after the received signals of the microphones are subjected to Fourier transform, the DoA generating the maximum power is found according to the calculated power, and then the target sound source direction is determined.
In the practical application scene, the signals are mostly broadband, under the broadband condition, the existing sound source orientation method decomposes the input frequency spectrum into a plurality of narrowband channels, the beam forming method under the narrowband condition is applied to each narrowband channel, and the wide beam forming and the DoA estimation are obtained by combining the results. The input signals may also be delayed on different channels to achieve time information synchronization, which may be further combined to estimate the input signal's DoAs.
Therefore, the existing sound source orientation method is complex in calculation, high in requirement on the calculation performance of equipment, and complex calculation consumes a large amount of storage resources and power consumption, is affected in real-time performance, and is more difficult to realize in low-power hardware.
Based on the above, in order to provide a sound source orientation scheme which has low power consumption, low cost, strong real-time performance and easy hardware implementation, the embodiment of the invention provides a sound source orientation method, a device, a chip and electronic equipment, relative time delay information in sound source data to be processed is captured through pulse coding, sound source direction estimation is performed based on the relative time delay information, and the accuracy of sound source estimation is improved. Specifically, the pulse coding method based on the zero crossing point captures delay information required in the sound source direction estimation, further ensures the accuracy of the sound source direction estimation, carries out the direction estimation based on the pulse neural network to obtain the target direction of the sound source data to be processed, reduces power consumption while ensuring the accuracy of the sound source direction estimation, and has better robustness and faster processing speed.
In order to facilitate understanding of the technical scheme of the invention, the sound source orientation method, the device, the chip and the electronic equipment provided by the invention are introduced below in combination with practical application scenes.
In order to improve the instantaneity of sound source orientation, reduce the power consumption and complexity of sound source orientation, so as to ensure that the sound source orientation method can be easily applied to low-power hardware, and further improve the orientation performance when the sound source is switched or/and changed or/and moved, the invention carries out sound source direction estimation based on the pulse neural network SNN, converts the DOA estimation into a classification task of the pulse neural network, and at least comprises the steps shown in fig. 2, wherein fig. 2 is a flow diagram of the sound source orientation method provided by a certain embodiment of the invention.
Step S100, performing zero-crossing pulse coding on the sound source data to be processed to obtain pulse signals of the sound source data to be processed. Considering the impulse communication mechanism of the impulse neural network, the sound source data to be processed needs to be converted into a set of impulse characteristics in advance.
In some types of embodiments, the sound source data to be processed may be a speech signal in the time domain or may be an audio signal in the frequency domain.
In some types of embodiments, the sound source data to be processed may be a collected speech signal in the current environment, including a speech signal of the sound source or/and noise present in the environment.
Preferably, the sound source data to be processed is a voice signal in the current environment collected in real time; alternatively, the sound source data to be processed may be a voice signal in the current environment acquired during a past period of time. Wherein, the past period of time may be past 1s and past 1min, and the embodiment of the invention is not particularly limited.
Optionally, the speech signal in the environment is collected by a microphone array. The microphone array may be a circular microphone array as shown in fig. 1, a linear microphone array, a distributed microphone array, or a cross microphone array. The microphone array includes at least one microphone, which may be a noise reduction microphone, which is not limited by the present invention. In addition, the sound source data to be processed can be obtained by performing processing such as filtering, noise reduction, time-frequency analysis and the like on the collected voice signals in the current environment, and the invention is not limited to the processing.
Considering that the sound source data to be processed acquired in practical application is a broadband signal, if the sound source data to be processed is directly used for zero-cross coding, interference may exist, so that the accuracy of the sound source direction estimation result is reduced. Based on this, in order to improve the accuracy of the sound source direction estimation, in some embodiments, the sound source data to be processed is subjected to zero-crossing pulse coding after being preprocessed, where the preprocessing includes channel decomposition, where the channel decomposition is to decompose a wideband signal into a plurality of frequency channels, where the signals in each frequency channel are narrowband signals, and each narrowband signal has a different frequency range. Specifically, channel decomposition is performed on sound source data to be processed received by each microphone in the microphone array, wideband sound source data to be processed is decomposed into a plurality of narrowband signals, and zero-crossing encoding is performed on the narrowband signals of each frequency channel.
Alternatively, the sound source data received by each microphone may be channel decomposed by a filter bank including, but not limited to, a band pass filter bank, a narrowband filter bank, and the like. In addition, the channel decomposition may further include other modules, such as a low noise amplifier LNA coupled to a filter bank, etc., which is not limited by the present invention.
In consideration of that if zero-crossing encoding is performed on each of the plurality of narrowband signals obtained after channel decomposition, the data size is larger, which increases the processing time of sound source orientation, and further reduces the instantaneity of sound source orientation, and each of the plurality of narrowband signals obtained after channel decomposition has a different frequency range, while the energy of the sound signal corresponding to the sound source direction is higher and has a specific frequency, that is, the energy of the sound source data to be processed received by each microphone in the microphone array is mainly concentrated in one frequency range or a plurality of frequency ranges, therefore, by activity detection, the target frequency channel with the largest energy in the plurality of narrowband signals obtained after channel decomposition or the target frequency channel with processing energy greater than or equal to a preset energy threshold value is selected from the plurality of narrowband signals obtained after channel decomposition, and the narrowband signals of the target frequency channel are subjected to zero-crossing encoding. Based on this, in order to achieve fast sound source orientation, in some embodiments, the preprocessing includes channel decomposition and activity detection, specifically, channel decomposition is performed on sound source data to be processed received by each microphone in the microphone array, the sound source data to be processed in a wideband is decomposed into a plurality of frequency channels, activity detection is performed based on energy of narrowband signals of the plurality of frequency channels obtained after the channel decomposition, a target frequency channel is determined, and zero-cross encoding is performed on the narrowband signals of the target frequency channel.
The activity detection can be that a target frequency channel with the maximum energy is selected from a plurality of narrow-band signals obtained after channel decomposition, the activity detection can be that a plurality of channels (two or more) with the front energy are selected from a plurality of narrow-band signals obtained after channel decomposition as the target frequency channel, and the activity detection can also be that a target frequency channel with the processing energy larger than or equal to a preset energy threshold is selected from a plurality of narrow-band signals obtained after channel decomposition. The preset energy threshold may be any one of an average value, a median, a mode, a second maximum value, and a third maximum value of energy of the narrowband signals of the plurality of frequency channels, which is not specifically limited in the embodiment of the present invention.
Pulse Coding includes frequency Coding (Rate Coding), time Coding (Temporal Coding), burst Coding (burst Coding), group Coding (Population Coding), and the like. Pulse coding can be seen as a feature extractor, the features that it produces are processed by the SNN. However, some pulse code extracted input sound source signals are characterized by incoherent (or correlated) features, such as the Short Time Frequency Transform (STFT) strength of the input signal. Because the incoherent characteristic cannot capture phase information, the incoherent characteristic is not sensitive enough to the relative time of the received signals under different microphones, and meanwhile, in the practical application scene, such as a conference room, and the like, the reflection propagation (or reverberation propagation) environment generates huge frequency domain distortion on the input sound source signal to be processed, so that the extracted characteristic is disturbed, and therefore, the incoherent characteristic of the sound source signal cannot be used for estimating the sound source direction. In addition, in the case of pure noise, random poisson pulses independent of the array geometry are obtained after zero crossing processing, and therefore, pure noise has no effect on the DoA estimation.
Because of the problems, the invention adopts different types of pulse codes, namely zero-crossing pulse codes, and encodes the sound source data to be processed based on the zero-crossing pulse codes to obtain the pulse signals of the sound source data to be processed. The zero-crossing pulse codes are sensitive to the relative delays of the incoming signals on different microphones, and can quickly capture the information without large disturbance in the reflection propagation environment, thereby improving the instantaneity and accuracy of sound source orientation.
Specifically, the pulse coding method based on zero-crossing coding at least comprises steps SA 111-SA 113:
and step SA111, performing zero crossing point detection on the sound source data to be processed to obtain zero crossing points in the sound source data to be processed and time information corresponding to the zero crossing points.
In addition, the pre-processing may be performed on the sound source data to be processed received by each microphone according to step S100, where the pre-processed sound source data received by each microphone is obtained after the pre-processing of the sound source data to be processed received by each microphone.
And step SA112, performing zero-crossing detection on the sound source data to be processed or the sound source data after preprocessing received by each microphone, and obtaining zero-crossing points in each sound source data to be processed and time information corresponding to each zero-crossing point.
The zero-crossing point may be a signal point with a signal value of 0, or may be a signal point with a signal value suddenly changed, for example, a signal point with a signal value changed from positive to negative or/and a signal point changed from negative to positive, or may be a signal point with a product between signal values of adjacent signal points of less than 0.
In some types of embodiments of the present invention, the zero-crossing point may be determined by multiplying the sum of the signal values at each point in time by the signal value at the next point in time adjacent to that point in time.
Optionally, after preprocessing the sound source data to be processed received by each microphone, obtaining preprocessed sound source data, sequentially calculating the product of the value of the current time point and the value of the next time point adjacent to the current time point from the first time point of the preprocessed sound source data of each microphone, and if the product of the value of the current time point and the value of the next time point adjacent to the current time point is less than 0, determining that at least one zero crossing point exists between the current time point and the next time point adjacent to the current time point; if the product of the value of the current time point and the value of the next time point adjacent to the current time point is greater than 0, determining that a zero crossing point does not exist between the current time point and the next time point adjacent to the current time point; if the product of the value of the current time point and the value of the next time point adjacent to the current time point is equal to 0 and the value of the current time point is not 0, determining the next time point adjacent to the current time point as the zero crossing point of the microphone.
Alternatively, if the product of the value of the current time point and the value of the next time point adjacent to the current time point is less than 0, a zero-crossing point having a value of 0 is determined from between the current time point and the next time point adjacent to the current time point by a dichotomy, and the time corresponding to the zero-crossing point is determined as the time information corresponding to the zero-crossing point.
In some embodiments, in order to implement sparse processing and improve real-time performance, only the downward zero-crossing points, that is, the signal points where the signal value changes from positive to negative, are detected. Alternatively, only the zero crossing point upward, i.e., the signal point at which the signal value changes from negative to positive, is detected. The downward zero crossing point may be a zero crossing point in a period of time in which the signal value decreases, or may be a signal point in which the signal value changes from positive to negative; the upward zero-crossing point may be a zero-crossing point during a period in which the signal value increases, or may be a signal point in which the signal value changes from negative to positive.
In order to improve the accuracy of zero crossing points in the sound source data to be processed and thus the accuracy of the pulse signal, in some embodiments of the present invention, the zero crossing points are determined by detecting the upward zero crossing point or/and the downward zero crossing point. Illustratively, taking the detection of the downward zero crossing point as an example, selecting candidate sound source data with a change rate of signal values smaller than or equal to 0 from the change rate of signal values in the narrowband signal of the target frequency channel, obtaining a value accumulation sum corresponding to each time point in the candidate sound source data according to the signal value corresponding to each time point in the candidate sound source data, selecting a target time point with a maximum value accumulation sum from the value accumulation sums of each time point in the candidate sound source data according to the value accumulation sums of each time point in the candidate sound source data, and determining the zero crossing point and the time information corresponding to the zero crossing point according to the target time point of the maximum value accumulation sum. Wherein the value accumulation sums refer to the accumulation sums of the values of each time point in the candidate sound source data and all time points before the time point, it is understood that, for the first time point in the candidate sound source data, the value accumulation sum of the first time point is the value of the first time point.
Alternatively, there may be at least one candidate sound source data, each candidate sound source data including a monotonically decreasing set of time points, and a signal value corresponding to each time point.
Alternatively, the signal point with the maximum value accumulation sum in the candidate sound source data may be determined as the zero crossing point in the candidate sound source data, and the target time point with the maximum value accumulation sum in the candidate sound source data is the time information corresponding to the zero crossing point.
Alternatively, the zero-crossing point and the time information corresponding to the zero-crossing point may be determined from a time period constituted by a target time point having the maximum value integrated sum, a time point preceding the target time point, and a time point following the target time point in the candidate sound source data.
Specifically, for each candidate sound source data, a candidate time period in which the zero crossing point in the candidate sound source data is located may be determined according to a target time point with a maximum value cumulative sum in the candidate sound source data, a time point before the target time point, and a time point after the target time point, and the candidate time period is divided by a preset time window to obtain a plurality of candidate times, and according to a value corresponding to each candidate time, the zero crossing point in the candidate time period in which the zero crossing point in the candidate sound source data is located and time information corresponding to the zero crossing point are determined. It is understood that the time length of the preset time window is smaller than the time length between the target time point and the time point immediately before the target time point.
For example, a value accumulation sum corresponding to each candidate time can be obtained according to a value corresponding to each candidate time in a candidate time period where the zero crossing point in the candidate sound source data is located, a signal point corresponding to a candidate time corresponding to a maximum value accumulation sum is determined as the zero crossing point in the candidate sound source data, and the candidate time corresponding to the maximum value accumulation sum is the time information corresponding to the zero crossing point.
Alternatively, when detecting an upward zero crossing point using the zero crossing detection method, a plurality of monotonically increasing time point sets are selected from the preprocessed sound source data, for each monotonically increasing time point set, a value running sum of each time point in the monotonically increasing time point set is obtained from the value of each time point in the monotonically increasing time point set, a target time point having a minimum value running sum is selected from the value running sums of each time point in the monotonically increasing time point set, and the time information corresponding to the zero crossing point and the zero crossing point is determined from the target time point of the minimum value running sum in the monotonically increasing time point set.
Step SA113, performing pulse coding based on the zero crossing points in the sound source data to be processed and the time information corresponding to the zero crossing points, so as to obtain the pulse signals of the sound source data to be processed of each microphone.
In some embodiments, after determining the zero-crossing points of the sound source data to be processed by each microphone in the microphone sequence and the time information corresponding to each zero-crossing point, a pulse may be generated according to the time information corresponding to each zero-crossing point, for example, a pulse is generated at the zero-crossing point, so as to obtain the pulse signal of the sound source data to be processed by the microphone.
Optionally, after determining the zero-crossing points of the preprocessed sound source data of each microphone and the time information corresponding to each zero-crossing point, setting the pulse mode of the zero-crossing point in the preprocessed sound source data to 1, setting the pulse mode of the non-zero-crossing point in the preprocessed sound source data to 0, generating a pulse, and using the time information corresponding to each zero-crossing point as the pulse time, and arranging the zero-crossing points in the preprocessed sound source data on the same time axis based on the time information of the zero-crossing point in the preprocessed sound source data, to obtain the pulse signal of the preprocessed sound source data of each microphone.
Optionally, after determining the zero crossing point and the time information corresponding to each zero crossing point in the preprocessed sound source data of each microphone, determining a pulse triggering time point according to the time information corresponding to each zero crossing point, and generating a pulse according to the pulse triggering time point to obtain a pulse signal of the preprocessed sound source data of each microphone.
Optionally, after determining the zero-crossing point in the preprocessed sound source data of each microphone and the time information corresponding to each zero-crossing point, comparing the frequency value corresponding to the zero-crossing point in the preprocessed sound source data with a preset threshold, and if the frequency value corresponding to the zero-crossing point in the preprocessed sound source data is greater than the preset threshold, setting the pulse mode of the zero-crossing point to be 1; if the frequency value corresponding to the zero crossing point in the preprocessed sound source data is smaller than or equal to a preset threshold value, setting the pulse mode of the zero crossing point to be 0, and arranging the zero crossing points in the preprocessed sound source data on the same time axis according to the time sequence in the mode to obtain the pulse signals of the preprocessed sound source data of each microphone.
In the multi-microphone array, zero-crossing points of the preprocessed sound source data of each microphone are detected, and zero-crossing pulse coding is performed to obtain a pulse sequence corresponding to the microphone. Since the time for the sound signal corresponding to the sound source to reach the different microphones in the microphone array is different, the relative delay time between the different microphones in the microphone array is different, and the difference in the relative delay time between the different microphones in the microphone array results in that the time information of the zero crossing point of the preprocessed sound source data from the different microphones is different, and thus the pulse signals from the different microphones have different delay times, that is, the delay time between the different microphones can be captured through zero crossing pulse coding.
For example, a circular array including 8 microphones is taken as an example, as shown in fig. 3 and fig. 4, fig. 3 is a schematic diagram of a pulse sequence of a low-frequency channel after zero-crossing pulse coding provided in an embodiment of the present invention, and fig. 4 is a schematic diagram of a pulse sequence of a high-frequency channel after zero-crossing pulse coding provided in an embodiment of the present invention. As can be seen from fig. 3 and 4, the zero-crossing pulse from Mic 1 is generated at time 12 and the zero-crossing pulse from Mic 2 is generated at time 20, with the zero-crossing pulse from Mic 2 being delayed by 8 units of time compared to Mic 1.
Step S200, based on the pulse neural network, the direction estimation is carried out on the pulse signals obtained by the zero-crossing pulse coding, and the target sound source direction of the sound source data to be processed is obtained.
Fig. 5 is a schematic structural diagram of a impulse neural network, which includes an input layer, an intermediate layer, and an output layer. The input layer, the middle layer and the output layer are respectively provided with a neuron, the middle layer comprises at least one hidden layer, and each hidden layer is provided with a plurality of neurons.
The input layer is used for carrying out pulse excitation according to the pulse number of the input pulse signal in a preset clock period to obtain a pulse sequence, and transmitting the pulse sequence to the middle layer; the middle layer is used for carrying out pulse excitation according to the number of pulses in the received pulse sequence in a preset time period to obtain a new pulse sequence, and transmitting the new pulse sequence to the output layer; the output layer is used for generating a target pulse signal based on the pulse sequence transmitted by the middle layer, and determining the sound source direction corresponding to the target pulse signal based on the direction decision of the target pulse signal.
The pulse neural network on the chip or the hardware cannot work directly (accurate reasoning work is completed on the input environmental signal), wherein the neurons and the synaptic modules/units are only realized by circuit hardware, and the neuron modules/units and the synaptic modules/units included in the pulse neural network are required to be divided in a collection mode, a connection relation is defined, and the weight values stored in the synaptic circuit, the corresponding time constants and the like are defined. Therefore, advanced training is required to obtain corresponding parameters, such as supervised training or unsupervised training, on-chip training or off-chip training, and the like. The invention takes supervised training as an example, but not limited to, the network configuration parameters obtained by training are mapped to hardware, such as a chip, and after the chip receives signals collected from the environment, the chip operates a pulse neural network in the chip to automatically complete the reasoning process based on the received signals.
In a preferred embodiment of the present invention, the pulse neural network is trained based on sample sound source data, for example, a pulse code sequence with a specific frequency is used as an input of the pulse neural network, and the SNN is trained with the DOA direction as a target, so as to obtain a sound source localization model, wherein the specific frequency is at least one frequency value, for example, f 1 Or/and f 2 Or/and (or)f 3 Etc., f 1 And f 2 And f 3 Are not equal. Wherein the sample sound source data comprises audio signals for each microphone in the array of microphones in different DoA directions.
For each sample sound source data, carrying out channel decomposition on audio signals at different microphones in the sample sound source data, carrying out zero-crossing pulse coding on audio components (sound source components) on each frequency channel or target frequency channel (such as obtained by activity detection as described above), obtaining a sample pulse signal or pulse sequence, inputting the sample pulse signal into an input layer of a pulse neural network model, carrying out sound source direction estimation by an output layer of the pulse neural network model after processing by the pulse neural network, and outputting a prediction sector corresponding to the sample sound source data.
And then determining the positioning loss of the impulse neural network according to the sector label corresponding to the sample sound source data and the predicted sector, adjusting the configuration parameters of the impulse neural network according to the positioning loss, and obtaining a sound source positioning model when the impulse neural network meets the preset convergence condition through iteration.
Wherein the configuration parameters of the impulse neural network include one or more of the following parameters: the invention is not limited to the synapse weights, firing times, thresholds, decay time constants, etc. of the neurons corresponding to the input layer, the intermediate layer, and the output layer of the impulse neural network. The preset convergence condition may be that the positioning loss is less than or equal to a preset positioning loss threshold, or that the iteration number is less than or equal to a preset number threshold.
In a preferred embodiment, a circular microphone array is taken as an example of a microphone array, and the microphone array is provided to include M microphones, where η is an angular oversampling parameter (angular oversampling parameter), η=1, 2,3 … …, and all ranges of the directions of arrival DoA are quantized to η, and these quantized doas are used as classification candidates. The ηM labels correspond to the ηM angular spaces of the DoA quantization, whereby the DoA resolution approximates
Figure BDA0004093598740000241
In a preferred embodiment, taking a circular array as an example, a space coordinate system can be established by taking the geometric center of the microphone array as an origin, taking the origin as a circle center and taking a preset distance as a radius, rotating in a clockwise direction or rotating in a counterclockwise direction, selecting one position per rotation angle resolution in the area, selecting a plurality of positions, and setting the azimuth angle of each selected position as the direction of a sample sound source. The angular resolution may be determined based on the type of microphone array and the number of microphones in the microphone array. For example, for a circular microphone array with 8 microphones, η=1, the DoA may be quantized to 8 sectors, each sector having an angular resolution of 45 degrees. In addition, the microphone array may have other shapes, and the invention is not limited in this regard.
Optionally, in order to improve the accuracy of the target sound source direction and reduce the grating lobe influence, when determining the angular resolution, the angular resolution may be determined according to the type of the microphone array, the number of microphones in the microphone array, and a preset oversampling parameter. Preferably, for a circular microphone array with 8 microphones, the DoA can be quantized to 2x8=16 sectors, each sector having an angular resolution of 22.5 degrees, when the oversampling parameter is η=2.
And estimating the direction of the sound source data to be processed based on a sound source localization model obtained through training of the impulse neural network or impulse neural network hardware configured with configuration parameters which are the same as or similar to those of the trained network, and obtaining the target sound source direction of the sound source data to be processed. The network configuration parameters obtained by training can be directly deployed to the impulse neural network hardware or deployed after processing, such as quantization processing and the like. Wherein for each microphone array.
For example, taking a microphone array as a circular array as an example, the direction of arrival may be divided into sectors according to a preset angular resolution and a preset oversampling parameter, the divided sectors are used as labels, and sample sound source data of each sector are collected, so that each sample sound source data corresponds to a sector label. For example, for a circular microphone array with 8 microphones, the DoAs can be quantized to 2x8=16 sectors, each sector having an angular resolution of 22.5 degrees, when the oversampling parameter is 2.
In an alternative embodiment, the positioning loss of the impulse neural network is determined according to the sector label corresponding to the sample sound source data and the prediction sector, and a mean square error function can be adopted, or an MSE surrounding loss can be adopted. For the mean square error function, the sector label neighbor sectors can be distinguished, i.e., the distance between class-o and class-1 is consistent with the distance between class-o and class-2. For MSE surround loss geometrically adjacent sectors can be distinguished, e.g. for circular microphone arrays, class-o and class-1 are the same distance as class-o and class-15 (considering the circles of the array), while the distance between class-o and class-2 is larger.
In some embodiments of the present invention, considering that the difficulty of directly training the SNN is relatively high, the ANN may be built based on the parameters of the impulse neural network, the target parameters of the impulse neural network may be obtained by training the ANN, and the parameters of the SNN in the electronic device may be adjusted based on the target parameters of the impulse neural network, so as to obtain the sound source localization model.
Optionally, in the ANN training process, to ensure consistency of parameters of the ANN and the SNN in the electronic device, after each iteration training is completed, the parameters of the ANN may be synchronized to the SNN in the electronic device, and a sample pulse signal is input to a pulse neural network of the electronic device, after the sample pulse signal is processed by the pulse neural network, a prediction sector corresponding to the sample sound source data is output, and according to the prediction sector corresponding to the sample sound source data and a sector label corresponding to the sample sound source data, positioning loss of the pulse neural network is determined, and parameters of the ANN are adjusted according to the positioning loss, and the parameters after the ANN adjustment are synchronized to the SNN in the electronic device.
Alternatively, training may be performed directly using the training method described in applicant's prior application (application number: CN 202210789977.0). The content of this prior application is incorporated by reference in its entirety.
According to the sound source orientation method provided by the embodiment of the invention, the direction estimation is performed on the pulse signals of the sound source data to be processed based on the pulse neural network, so that the target direction of the sound source data to be processed is obtained, the power consumption can be reduced while the accuracy of the sound source direction estimation is ensured, the robustness is better, the processing speed is faster, and the implementation in low-power hardware is easy.
It is considered that the geometry of the microphone array affects the array resolution of the microphone array, which is related to the accuracy of the estimation of the direction of arrival. While better array resolution can be obtained, all microphones are in a line, so that the resolution of the linear microphone array is asymmetric, as shown in fig. 6, and as an example, fig. 6 is a schematic diagram of the array resolution of the linear microphone array according to an embodiment of the present invention, when the sound source is in front of the linear microphone array, the linear microphone array may obtain higher array resolution, and when the sound source is on the side of the linear microphone array, the array resolution of the linear microphone array is lower, that is, the positional relationship between the sound source and the linear microphone array affects the array resolution of the linear microphone array. Similar distributed microphone arrays and cross-shaped microphone arrays may achieve better array resolution at some angles, but there are angles where the array resolution is low, which makes the applicability of linear microphones, distributed microphone arrays, and cross-shaped microphone arrays poor. As described above, for the circular microphone array, the array resolution thereof is related to the number of microphones in the circular microphone array and the oversampling parameter, without being affected by the positional relationship between the sound source and the circular microphone array. Thus, to ensure accuracy of sound source orientation, in certain types of embodiments, the speech signals in the environment in which they are located are collected by a circular array of microphones as shown in FIG. 1.
Furthermore, since the microphone array cannot be positioned using a pre-designed waveform in an actual application scene such as a voice conference in a closed room, it is necessary to find the position thereof or the direction of arrival of a sound source using a speaker's voice signal. However, the voice signal is very unstable and has various abrupt jumps in the time-frequency domain, which makes the voice signal of the speaker not a harmonic signal of a fixed frequency, and the main frequency of the sound source varies with time, as shown in the time-frequency diagram of the voice signal in fig. 7.
In a preferred embodiment of the invention adaptive wideband-band localization/localization techniques are applied for sound source localization, i.e. identifying the frequency components of the activity (active) in the signal during each time interval and using these components for localization.
Specifically, as shown in fig. 8, fig. 8 is a schematic diagram of sound source direction in an embodiment of the present invention. The pulse neural network processor comprises a preprocessor module, a pulse coding module and a pulse neural network processor which are sequentially coupled. Wherein the preprocessor module preprocesses signals received by the plurality of microphones, such as time domain or/and frequency domain analysis, and further wherein the preprocessor module may identify active frequency components, also referred to as target frequency components (also referred to herein elsewhere as target frequency channels). And the pulse coding module is coupled with the preprocessing module and used for carrying out zero-crossing pulse coding to obtain a pulse signal. The impulse neural network processor is deployed with an impulse neural network, and the impulse neural network processor performs sound source orientation based on a plurality of impulse signals.
In an embodiment, the preprocessing module includes a channel decomposition module, where the channel decomposition module includes a plurality of (two or more) filter banks, each filter bank being coupled to a different microphone, respectively, for performing time-frequency (time-frequency) analysis on the audio signal received by each microphone.
Optionally, the number of filter banks is less than or equal to the number of microphones. Optionally, the number of pulse signals is less than or equal to the number of filter banks. In the embodiments herein, the number of filter banks is equal to the number of microphones, but the invention is not limited thereto.
Fig. 9 is a schematic diagram of sound source direction after preprocessing a signal received by a microphone according to an embodiment of the present invention. As shown in fig. 9, the preprocessing module includes a plurality (two or more) of filter banks and an activity detection module (activity detector). The activity detection module is coupled to each filter bank, and is used for detecting activity among all microphones, and detecting activity conditions of signals received by all microphones at different frequencies so as to identify a target frequency component (also called a target frequency or an activity frequency component), namely, identify one or more frequency channels with stronger activity in the signals.
In this embodiment, each filter bank corresponds to one microphone, and the filter banks perform channel decomposition on signals received by the corresponding microphone, and decompose sound source data to be processed into a plurality of frequency channels. The activity detection module performs activity detection based on the energy of the narrowband signal after channel decomposition to determine a target frequency channel.
The pulse coding module is coupled to the activity detection module, and performs zero-crossing pulse coding on the signals filtered by the filter banks based on the target frequency to obtain pulse signals or pulse sequences corresponding to the signals received by the microphones corresponding to the filter banks, in other words, the pulse coding module performs zero-crossing pulse coding on the audio component of the signals received by the microphones on the target frequency. Specifically, the pulse coding module includes at least one zero-crossing pulse coding unit, the input end of each zero-crossing pulse coding unit is coupled with the output end of the corresponding filter bank through the corresponding activity detection unit, and each zero-crossing pulse coding unit performs zero-crossing pulse coding on the component output by each filter bank on the target frequency channel to generate a corresponding pulse signal or pulse sequence.
In some types of embodiments of the invention, other circuitry, such as a low noise amplifier, LNA, etc., may be coupled between the filter bank and the microphone, the LNA being used to low noise amplify the input audio. In addition, each parallel channel may further include a rectifier coupled to the channel after the filter for rectifying the output of the channel filter.
Further, the filter may be a band pass filter, a band reject filter, a narrow band filter, or the like. Optionally, the channel decomposition of the sound source data to be processed may be performed by dividing the sound source data to be processed into different frequency intervals by a band-pass filter bank, where each frequency interval corresponds to a frequency channel; alternatively, the channel decomposition may be performed on the sound source data to be processed, where the sound source data to be processed is divided into a plurality of narrow bands by a narrow band filter bank according to the bandwidth of the sound source data to be processed, and each narrow band corresponds to a frequency channel.
For example, the case that the number of filter banks is the same as the number of microphones in the microphone array is taken as an example, when the filter banks include 16 parallel band-pass filter BPFs, each BPF corresponds to one channel, and each filter bank performs channel decomposition on an audio signal of sound source data to be processed received by a microphone corresponding to the filter bank, so as to obtain a plurality of frequency channels. Filtering multiple parallel channels according to frequency bands, detecting time-varying signal activity in different frequency bands, wherein the BPF of each channel only retains a signal matched with the center frequency of the BPF of the channel, and performing time-frequency (time-frequency) analysis on the audio signals received by the microphones corresponding to the filter bank by the filter bank to obtain audio components on 16 channels, for example, the audio signals received by the microphone Mic 1 are subjected to BPFo (frequency f) o ) BPF1 (frequency f 1 ) … … N is obtained 1_o 、N 1_1 ……。
Optionally, the activity detection module selects a target frequency channel satisfying a preset energy threshold from the signals on the frequency components of the different frequency ranges output by the filter banks based on the signals on the frequency channels of the different frequency ranges output by the filter banks. Specifically, a filter bank in the preprocessing module carries out channel decomposition on sound source data to be processed received by a microphone corresponding to the filter bank to obtain a plurality of frequency channels of each sound source data to be processed, activity detection is carried out based on energy of narrowband signals of the frequency channels obtained after channel decomposition, a target frequency channel is determined, and zero-crossing encoding is carried out on the narrowband signals of the target frequency channel. And determining the audio component in the target frequency channel of each sound source data to be processed as the sound source data of each microphone after pretreatment.
In certain types of embodiments, the activity detection module includes a plurality of (more than one) activity detection units, one for each filter bank, each activity detection unit coupled with a corresponding filter bank. Each activity detection unit carries out independent activity detection on the signals decomposed by the filter bank channels, and determines a target frequency channel of sound source data to be processed, which is received by each microphone.
FIG. 10 is a schematic diagram of independent activity detection provided by an embodiment of the present invention. The method comprises the steps of detecting the activity of audio components of sound source data to be processed of each microphone on a plurality of frequency channels through an activity detection unit, and selecting one or more (two or more) channels with highest energy/higher energy of the sound source data to be processed of each microphone on the plurality of frequency channels as initial target frequency channels of the microphone.
In some embodiments, after the initial target frequency channel of each microphone is obtained, pulse coding is performed on the audio component of the sound source data to be processed, which is received by each microphone after channel decomposition, on the target frequency channel, so that the data sparsity and instantaneity are improved while the accuracy of sound source orientation is not reduced.
In some embodiments, the initial target frequency channel of each microphone may be obtained according to the activity detection unit, the active frequency (i.e. the target frequency) may be determined by screening according to the frequency range of the initial target frequency channel of each microphone, and the target frequency channel of each microphone may be determined according to the active frequency, so as to ensure that the target frequency channels of each microphone have the same center frequency. Wherein the active frequency may be a center frequency of the signal in the frequency channel.
Alternatively, the frequency of occurrence of each center frequency may be counted according to the center frequency of the initial target frequency channel of each microphone, and one or several center frequencies with highest/higher frequencies may be determined as the active frequency. Alternatively, the center frequency or frequencies having the greatest frequency values are determined as the active frequencies.
Optionally, according to the center frequency of the initial target frequency channel of each microphone, one or several center frequencies with the highest/higher frequency values of the center frequencies are selected to be determined as the active frequency.
Optionally, clustering the center frequency of the initial target frequency channel of each microphone, dividing the initial target frequency channel into at least one cluster, selecting a target cluster with the largest element number in the clusters, and determining the center frequency corresponding to the target cluster as the active frequency.
In some types of embodiments, after determining the active frequency, the center frequency of the initial target frequency channel of each microphone may be compared with the active frequency, if the center frequency of the initial target frequency channel of each microphone matches the active frequency, the initial target frequency channel of each microphone is determined to be the target frequency channel of the microphone, and if the center frequency of the initial target frequency channel of each microphone does not match the active frequency, the target frequency channel whose center frequency matches the active frequency is selected from the plurality of frequency channels of each microphone. The matching may be that the center frequency of the initial target frequency channel of the microphone is the same as the active frequency, or that the difference between the center frequency of the initial target frequency channel of the microphone and the active frequency is less than or equal to a preset frequency difference.
In view of the individual activity detection of the plurality of frequency channels of each microphone, it is difficult to ensure that the initial target frequency channel of each microphone has the same center frequency, and a subsequent processing step is required to ensure that the initial target frequency channel of each microphone has the same center frequency, which will add to the processing steps of the preprocessing. Based on this, to ensure accuracy of pulse coding, the operation is further simplified, and in some class of preferred embodiments, the activity detection module performs joint activity detection after multiple filter banks.
FIG. 11 is a schematic diagram of joint activity detection provided by an embodiment of the present invention. The filter bank carries out channel decomposition on sound source data to be processed received by the corresponding microphone, and the activity detection module carries out joint activity detection on a plurality of frequency channels after channel decomposition on all microphone signals so as to determine target frequency. That is, the activity detection module performs joint activity detection between all microphones corresponding to the filter bank to identify the target frequency based on the activity of all audio signals received by the microphones corresponding to the filter bank at different frequencies. Further, the target frequency is one or several frequencies meeting preset conditions obtained by joint activity detection. And determining a target frequency channel of each filter bank or a microphone corresponding to the filter bank according to the active frequency.
Alternatively, joint activity detection may be to count the energy sums of all microphones at different frequency channels, in other words, calculate the energy sums of the audio signals received by all microphones by frequency; the preset condition may be that the energy accumulation sum is greater than or equal to a preset energy accumulation sum threshold, and the preset condition may be that the maximum energy accumulation sum. The preset threshold value can be any one of average value, median, mode, secondary highest value, third highest value and fourth highest value of energy sums of all microphones under different frequency channels.
Alternatively, the joint activity detection may be to count the average of the signal energy of all microphones in different frequency channels; the preset condition may be that the signal energy average value is greater than or equal to a preset average threshold value, and the preset condition may be that the maximum signal energy average value.
Optionally, taking joint activity detection as an example to count energy accumulation sums of all microphones under different frequency channels, the activity detection module accumulates energy of all microphones under different frequencies according to frequencies corresponding to each frequency channel in a plurality of frequency channels of the sound source data to be processed, obtains energy accumulation sums corresponding to the frequencies, determines an activity frequency of the energy accumulation sums meeting a preset condition according to the energy accumulation sums corresponding to the frequencies, and determines a target frequency channel of the sound source data to be processed according to the activity frequency.
The frequency corresponding to each frequency channel in the plurality of frequency channels of the sound source data to be processed received by each microphone can be compared with the active frequency, and the frequency channel, of which the frequency is matched with the active frequency, in the sound source data to be processed received by each microphone is determined as the target frequency channel of the sound source data to be processed.
For example, a circular microphone array of 8 microphones is taken as an example for explanation, and channel decomposition is performed on sound source data to be processed received by each microphone in the microphone array according to a filter bank corresponding to each microphone in the microphone array, so as to obtain audio components of the sound source data to be processed of the microphone on a plurality of frequency channels. The activity detection module performs joint activity detection, selects an activity frequency meeting a preset condition according to the energy sum of the audio components of the signals received by all the microphones on each frequency channel, and obtains a target frequency channel of each microphone according to the activity frequency.
For example, for each frequency channel f, by
Figure BDA0004093598740000311
Detecting the cumulative sum of the energy accumulated by all microphones on a window with the size of w under the frequency channel f, wherein M is the number of the microphones in the microphone array, t is the discrete time, S mf (t) represents a signal component having a center frequency f received by the mth microphone at time t. After the energy accumulation sum of all the microphones under different frequency channels is calculated, the energy accumulation sum is calculated by f a1 (t),f a2 (t),...,f ak (t)=argmax f∈{0,1,...,F-1} E f (t), wherein k is a positive integer greater than or equal to 1 and less than or equal to M.
Optionally, when k=1, that is, selecting an active frequency with the largest energy accumulation sum from the energy accumulation sums of all the microphones in each frequency channel, determining a frequency channel with a center frequency matched with the active frequency in a plurality of frequency channels of each microphone as a target frequency channel of each microphone, and performing zero-crossing pulse coding on signals in the target frequency channel of each microphone to obtain a pulse sequence. Wherein the number of pulse signals in the pulse train is the same as the number of microphones in the microphone array. For example, for a circular array containing 8 microphones, if k=1, 8 pulse signals with different delays are obtained.
Optionally, when k is greater than or equal to 2, at least two active frequencies with larger energy accumulation sums are selected from the energy accumulation sums of all the microphones in each frequency channel, a frequency channel with the center frequency matched with the active frequency in a plurality of frequency channels of each microphone is determined as a target frequency channel of each microphone, and zero-crossing pulse coding is performed on signals in the target frequency channel of each microphone to obtain a pulse sequence. Wherein the number of pulse signals in the pulse train is equal to the number of microphones in the microphone array. For example, when k=2, 8×2=16 pulse signals are obtained, of which 8 are pulse signals having different delays associated with the first active frequency component and the other 8 are pulse signals having different delays associated with the second active frequency component.
Considering that when the number of filter banks is large or the number of filters included in the filter banks is large, if joint activity detection is performed on multiple frequency channels of all microphones, the data processing amount of the activity detection module is increased, the preprocessing time length is increased, and therefore the instantaneity of the sound source orientation method is affected.
In order to ensure the real-time performance of the sound source orientation method, in some embodiments, the activity detection module includes at least two activity detection units, each activity detection unit is coupled to at least one filter bank, each activity detection unit performs local joint activity detection on a plurality of frequency channels output by at least one filter bank corresponding to the activity detection unit, so as to obtain an activity frequency in the plurality of frequency channels output by at least one filter bank corresponding to the activity detection unit, determine a target frequency according to the activity frequency in the plurality of frequency channels output by each filter bank, and obtain a target frequency channel of each microphone according to the target frequency.
Optionally, the microphones may be grouped according to the number of the activity detection units in the activity detection module to obtain at least two microphone combinations, where each microphone combination corresponds to one activity detection unit, and for the activity detection unit, the activity detection unit determines, according to the joint activity detection method, a target frequency of each microphone in the microphone combination corresponding to the activity detection unit according to a center frequency of a plurality of frequency channels obtained by decomposing each microphone in the microphone combination corresponding to the activity detection unit through channels. Wherein, each microphone combination comprises at least one microphone, the number of the microphones in each microphone combination can be the same, and the number of the microphones in each microphone combination can be different.
Optionally, after the target frequency of each microphone is obtained, clustering may be performed according to the target frequency of each microphone, the target frequencies of the microphones are divided into a cluster, a target cluster with the largest number of elements in the cluster is selected, and the target frequency corresponding to the target cluster is determined as the active frequency.
Alternatively, the frequency of each target frequency may be counted according to the target frequency of each microphone, and the target frequency with the highest frequency is determined as the active frequency.
Alternatively, the maximum target frequency may be determined as the active frequency according to the target frequency of each microphone. In some embodiments of the present invention, a filter bank is used to perform channel decomposition on sound source data to be processed received by each microphone, so as to obtain audio components on a plurality of frequency channels, and active frequency components are identified through joint activity detection, and zero-crossing pulse coding is performed on the audio components of each microphone on the active frequency component channels, so that the speed and accuracy of sound source orientation are improved, and the power consumption and cost are reduced.
Illustratively, one microphone of the microphone array is taken as an example, and at least the following steps are included: and carrying out channel decomposition on the sound source data received by the microphone to obtain audio components (or sound source components) on a plurality of channels. A channel corresponding to the activity frequency, which may be referred to as a target frequency channel, is selected from among a plurality of channels according to the target frequency (frequency satisfying a preset condition) identified by the activity detection. And carrying out zero-crossing pulse coding on the audio component on the target frequency channel corresponding to each microphone to obtain a pulse signal or a pulse sequence corresponding to the microphone receiving signal. Alternatively, the frequency of activity is identified based on independent or joint or local activity detection.
For example, if the activity detection selects an activity frequency, at start t1,f 4 is identified as the active frequency, so that among all microphones, the BPF based on each microphone 4 (center frequency f 4 ) The audio component on the channel produces a pulsed signal. At the later time t2, f 5 Identified as active frequency, BPF based on each microphone 5 (center frequency f 5 ) The audio component on the channel produces a pulsed signal. These pulses are connected together as time passes, forming a pulse train.
Alternatively, the preset condition may be energy and maximum one or several (two or more) frequency channels, or energy and one or several channels greater than a preset energy threshold. It can be understood that the frequency channel satisfying the preset condition is the target frequency channel. Wherein, as described in the foregoing, whether the preset condition is satisfied may be determined based on the first threshold, and the second threshold. Based on the target frequency channel, the audio component of the signal received by each microphone on the target frequency channel is an effective frequency component.
When estimating the direction of the sound source based on the pulse neural network, the sound source data to be processed needs to be converted into a pulse feature set, namely the sound source data to be processed needs to be converted into a pulse signal, and then the direction estimation is carried out on the pulse signal based on the pulse neural network, so that the target sound source direction of the sound source data to be processed is obtained. In order to improve the accuracy of the sound source direction, in some embodiments, the preprocessed sound source data may also be zero-cross coded in a local maxima manner. Specifically, the zero-crossing coding method based on the local maxima at least comprises steps SB 111-SB 115:
Step SB111, a plurality of target signal point sets of the sound source data after preprocessing of the microphones are determined according to the signal values of the signal points in the sound source data after preprocessing of the microphones. Optionally, the target signal point set includes a combination of signal points with signal values continuously decreasing for a preset period of time.
In some embodiments of the present invention, in order to reduce the data throughput, a downward zero crossing point may be extracted from the preprocessed sound source data, and illustratively, signal points with continuously decreasing signal values may be selected from the preprocessed sound source data, to obtain a target signal point set. Specifically, the method for determining the target signal point set based on the downward zero crossing point comprises the following steps:
and comparing the signal values of all the signal points in the sound source data after the microphone is preprocessed, and determining the signal points with continuously decreasing signal values in the sound source data after the microphone is preprocessed.
And grouping the signal points with continuously decreasing signal values in the sound source data after the microphone is preprocessed according to the time information corresponding to the signal points with continuously decreasing signal values in the sound source data after the microphone is preprocessed, so as to obtain a plurality of target signal point sets.
In some embodiments, the channel decomposition and activity detection may be performed on the sound source data to be processed received by each microphone according to the above preprocessing method, and a plurality of target signal point sets of the preprocessed sound source data may be determined for signal values of signal points in the audio component of the target frequency channel of each microphone.
Optionally, the set of target signal points includes a combination of signal points with signal values that continuously increment over a preset period of time.
In some embodiments of the present invention, in order to reduce the data throughput, an upward zero-crossing point may be extracted from the preprocessed sound source data, and illustratively, signal points with continuously increasing signal values may be selected from the preprocessed sound source data, to obtain a target signal point set. Specifically, the method for determining the target signal point set based on the upward zero crossing point comprises the following steps:
and comparing the signal values of all the signal points in the sound source data after the microphone is preprocessed, and determining the signal points with continuously increasing signal values in the sound source data after the microphone is preprocessed.
And grouping the signal points with continuously increasing signal values in the sound source data after the microphone is preprocessed according to the time information corresponding to the signal points with continuously increasing signal values in the sound source data after the microphone is preprocessed, so as to obtain a plurality of target signal point sets.
Step SB112, obtaining the sum of the signal values corresponding to the signal points in the target signal point sets according to the signal values (i.e., the non-absolute values, i.e., positive or negative or zero) of the signal points in the target signal point sets.
Step SB113, comparing the sum of the signal values corresponding to the signal points in the target signal point sets, and determining the target signal point having a local maximum in the target signal point sets and the time information corresponding to the target signal point.
In some embodiments, the absolute value of the sum of the signal values corresponding to each signal point in each target signal point set may be compared, the absolute value of the sum of the signal values corresponding to each signal point having the maximum signal value is determined as the target signal point having the local maximum value, and the time information corresponding to the target signal point is obtained according to the time point corresponding to the target signal point.
In some embodiments, the sum of signal values corresponding to the signal points in each target signal point set may be compared, a signal point with the sum of maximum signal values is determined as a target signal point with a local maximum value, and time information corresponding to the target signal point is obtained according to a time point corresponding to the target signal point.
In some embodiments, the sum of signal values corresponding to each signal point in the target signal point set may be compared, a signal point with the sum of minimum signal values is determined as a target signal point with a local maximum value, and time information corresponding to the target signal point is obtained according to a time point corresponding to the target signal point.
Illustratively, taking the example of determining the signal point having the sum of the maximum signal values as the target signal point having the local maximum value, step SB113 includes:
and comparing the sum of signal values corresponding to all the signal points in the target signal point set aiming at each target signal point set, and determining candidate target signal points with the sum of initial maximum signal values in the target signal point set.
And determining a candidate time period with a local maximum value in the target signal point set according to the time information corresponding to the candidate target signal point.
And comparing the sum of the signal values corresponding to the signal points corresponding to the time information in the candidate time period, determining the signal point with the sum of the maximum signal value, and determining the signal point with the sum of the maximum signal value as the target signal point with the local maximum value in the target signal point set.
Step SB114, comparing the sum of the signal values corresponding to the signal points in the target signal point sets, and determining the target signal point having a local maximum in the target signal point sets and the time information corresponding to the target signal point.
Step SB115, determining, according to the target signal points having local maxima in the target signal point sets and the time information corresponding to the target signal points, zero crossing points in the sound source data to be processed received by the microphone and the time information corresponding to the zero crossing points.
In some embodiments of the present invention, pulse encoding may be performed according to zero-crossing points in the preprocessed sound source data and time information corresponding to each zero-crossing point, and the pulse signal of the sound source data to be processed may be obtained according to the step SA 113.
The embodiment of the invention carries out zero-crossing pulse coding based on the local maximum value, can eliminate zero-crossing electronic collapse caused by noise, improves the quality of pulse signals and further ensures the accuracy of a target sound source method.
In order to improve the performance of the pulse neural network and ensure the accuracy of sound source direction estimation, in some embodiments, the pulse signal after zero-crossing pulse coding can be further corrected by using a long-short-term memory network, and the corrected pulse sequence is input into the pulse neural network to perform direction estimation, so as to obtain the target sound source direction of the sound source data to be processed, and further ensure the accuracy of sound source direction estimation.
Fig. 12 is a schematic diagram of a structure for performing sound source localization based on a long-term and short-term memory network and a pulse neural network according to an embodiment of the present invention. The step of obtaining the target sound source direction of the sound source data to be processed based on the embodiment may further include:
inputting the pulse signal of the sound source data to be processed into a preset feature extraction module, and carrying out feature extraction on the pulse signal of the sound source data to be processed through the preset feature extraction module to obtain a pulse feature sequence.
And inputting the pulse characteristic sequence into a pulse neural network to perform direction estimation, so as to obtain the target direction of the sound source data to be processed.
The method comprises the steps that a preset feature extraction module is constructed based on a long-short-period memory network, pulse signals of sound source data to be processed are input into the feature extraction module, hidden states are extracted from the pulse signals of the sound source data to be processed according to input gates, output gates and forgetting gates in the feature extraction module, and pulse feature sequences are generated by correcting the pulse signals of the sound source data to be processed based on the hidden states. In some types of embodiments of the invention, the feature extraction module may be trained by a supervised implementation.
After the pulse characteristic sequence is obtained, the pulse characteristic sequence can be input into a pulse neural network, and the pulse characteristic sequence is estimated in the direction through an input layer, a middle layer and an output layer in the pulse neural network, so that the target direction of the sound source data to be processed is obtained.
In order to better explain the sound source orientation technology provided by the embodiment of the invention, a certain preferred embodiment scheme is applied to an application scene based on an audio conference in a closed room, and the target direction of sound source data to be processed in the application scene is obtained, as shown in fig. 13 to 16.
Fig. 13 is a simulation test result of sound source orientation under a low-frequency channel, and shows the effect of the sound source orientation method provided by the embodiment of the invention on the low-frequency channel when the DoA of the sound source signal suddenly changes from 90 degrees to-90 degrees, wherein the radian is used for representing, and 1 radian is equal to 60 degrees. Fig. 13 is a schematic diagram of signals of corresponding sound source data received by each microphone, a target sound source direction detected by an algorithm, and a pulse signal after zero-crossing pulse coding corresponding to each microphone, from bottom to top.
Fig. 14 is a test comparison result of sound source orientation by using a low-frequency channel between a brain-like chip and a simulation model implemented by low-power hardware. Fig. 14 shows, from bottom to top, a test result of sound source orientation by a sound source orientation model in a computer device using a low-frequency channel and a test result of sound source orientation by a chip implemented by low-power hardware using a low-frequency channel, where the test result is represented by radian, and 1 radian is equal to 60 °, as can be seen from fig. 14, in an actual environment, a better sound source orientation result is obtained in the chip by using the sound source orientation technology provided by the embodiment of the present invention, and compared with a computer simulation result, the difference is very small and negligible.
Fig. 15 is a simulation test result of sound source orientation under a high frequency channel, showing the effect of the sound source orientation method provided by the embodiment of the present invention on the high frequency channel when the DoA of the sound source signal suddenly changes from 90 degrees to-90 degrees, wherein the radian is used for representation, and 1 radian is equal to 60 °. Fig. 15 is a schematic diagram of signals of corresponding sound source data received by each microphone, a target sound source direction detected by an algorithm, and a pulse signal after zero-crossing pulse coding corresponding to each microphone, from bottom to top.
Fig. 16 is a test comparison result of sound source orientation by using a high-frequency channel between a brain-like chip and a simulation model, which are realized by low-power hardware. Fig. 16 shows, from bottom to top, a test result of sound source direction by a sound source direction model in a computer device using a high frequency channel and a test result of sound source direction by a chip implemented by low power hardware using a high frequency channel, wherein the test result is expressed by radian, and 1 radian is equal to 60 °. As can be seen from FIG. 16, in the practical environment, the sound source orientation technology provided by the embodiment of the invention obtains a better sound source orientation result in the chip, and compared with the computer simulation result, the sound source orientation technology has very small difference and can be ignored.
As can be seen from fig. 13 to 16, the low power hardware implementation of the sound source directing technique of the present invention is quite fast in response to the abrupt change of the DoA under both the high frequency channel and the low frequency channel.
According to the sound source orientation method provided by the embodiment of the invention, the direction estimation is carried out on the pulse signals of the sound source data to be processed based on the pulse neural network, so that the target direction of the sound source data to be processed is obtained, the accuracy of the sound source direction estimation is ensured, the power consumption is reduced, and better robustness and faster processing speed are realized; and capturing the relative time delay information in the sound source data to be processed through pulse coding, carrying out sound source direction estimation based on the relative time delay information, improving the accuracy of sound source estimation, and effectively capturing the phase information required in the sound source direction estimation by the pulse coding method based on the zero crossing point, thereby ensuring the accuracy of sound source direction estimation.
In order to better implement the sound source orientation method provided by the embodiment of the invention, on the basis of the sound source orientation method, a sound source orientation device is provided, and the sound source orientation device comprises:
the coding module is used for carrying out zero-crossing coding on sound source data to be processed to obtain pulse signals of the sound source data to be processed;
And the estimation module is used for estimating the direction of the pulse signal based on a sound source positioning model obtained through training of the pulse neural network to obtain the target sound source direction of the sound source data to be processed.
In some class of embodiments of the present invention, an encoding module includes:
the zero crossing point detection unit is used for detecting zero crossing points of the sound source data to be processed according to the signal values of all signal points in the sound source data to be processed, so as to obtain the zero crossing points in the sound source data to be processed and the time information corresponding to the zero crossing points;
and the encoding unit is used for carrying out pulse encoding according to the zero crossing points in the sound source data to be processed and the time information corresponding to each zero crossing point to obtain the pulse signal of the sound source data to be processed.
In some embodiments of the present invention, a zero-crossing point detection unit is configured to determine, according to signal values of signal points in sound source data to be processed, a plurality of target signal point sets of the sound source data to be processed; obtaining the sum of signal values corresponding to the signal points in the target signal point sets according to the signal values of the signal points in the target signal point sets; comparing the sum of signal values corresponding to the signal points in the target signal point sets, and determining target signal points with local maxima in the target signal point sets and time information corresponding to the target signal points; and determining zero crossing points in the sound source data to be processed and time information corresponding to the zero crossing points according to the target signal points with local maxima in the target signal point sets and the time information corresponding to the target signal points.
In some embodiments of the present invention, the zero-crossing point detection unit is configured to compare signal values of signal points in sound source data to be processed, and determine signal points with continuously decreasing signal values in the sound source data to be processed; and grouping the signal points with continuously decreasing signal values in the sound source data to be processed according to the time information corresponding to the signal points with continuously decreasing signal values in the sound source data to be processed, so as to obtain a plurality of target signal point sets.
In some embodiments of the present invention, the zero-crossing point detection unit is configured to compare, for each of the target signal point sets, a sum of signal values corresponding to signal points in the target signal point set, and determine a candidate target signal point having a sum of initial maximum signal values in the target signal point set; according to the time information corresponding to the candidate target signal points, determining a candidate time period with a local maximum value in the target signal point set; and comparing the sum of the signal values corresponding to the signal points corresponding to the time information in the candidate time period, determining the signal point with the sum of the maximum signal value, and determining the signal point with the sum of the maximum signal value as the target signal point with the local maximum value in the target signal point set.
In certain classes of embodiments of the present invention, the sound source directing device further comprises:
the preprocessing module is used for carrying out channel decomposition on the initial sound source data to obtain a plurality of frequency channels; determining a target frequency channel with the frequency meeting the preset frequency requirement from the plurality of frequency channels according to the frequency corresponding to each frequency channel; and determining the sound source data of the target frequency channel as the sound source data to be processed.
In some embodiments of the present invention, the preprocessing module is configured to perform filtering processing on initial sound source data through a band-pass filter bank, and divide the initial sound source data into a plurality of frequency channels.
In some embodiments of the present invention, the estimation module is configured to input the pulse signal of the sound source data to be processed to the feature extraction module, and perform feature extraction on the pulse signal of the sound source data to be processed through the feature extraction module to obtain a pulse feature sequence; the feature extraction module is constructed based on a long-term and short-term memory network; and inputting the pulse characteristic sequence into the sound source positioning model to perform direction estimation to obtain the target direction of the sound source data to be processed.
According to the sound source orientation device provided by the embodiment of the invention, the direction estimation is carried out on the pulse signals of the sound source data to be processed based on the pulse neural network, so that the target direction of the sound source data to be processed is obtained, the accuracy of the sound source direction estimation is ensured, the power consumption is reduced, and better robustness and faster processing speed are realized; and capturing the relative time delay information in the sound source data to be processed through pulse coding, carrying out sound source direction estimation based on the relative time delay information, improving the accuracy of sound source estimation, and effectively capturing the phase information required in the sound source direction estimation by the pulse coding method based on the zero crossing point, thereby ensuring the accuracy of sound source direction estimation.
In order to better implement the sound source direction method provided by the embodiment of the present invention, a sound source signal separation method is provided based on the sound source direction method, specifically, as shown in fig. 17, fig. 17 is a schematic flow chart of the sound source signal separation method provided by the embodiment of the present invention, where the sound source signal separation method at least includes steps S300 to S500:
step S300, estimating the sound source direction of the sound source data to be separated, and determining the candidate sound sources corresponding to the sound source data to be separated and the target sound source direction of each candidate sound source.
In some embodiments of the present invention, the sound source direction of the sound source data to be separated may be estimated according to the sound source direction method in any of the above embodiments, and the candidate sound sources corresponding to the sound source data to be separated and the target sound source direction of each candidate sound source may be determined.
Step S400, performing sound source separation according to the positions of the sound channels for collecting the sound source data to be separated and the target sound source directions of the candidate sound sources, and obtaining the sound signal of each candidate sound source.
The candidate sound sources refer to sound sources that may exist in the current environment estimated from the sound source data to be separated, and include target sound sources.
In some types of embodiments of the present invention, there are various ways to perform sound source separation processing on sound source data to be separated, and exemplary ways include:
(1) And carrying out sound source separation on the positions of the sound channels for collecting the sound source data to be separated and the target sound source directions of the candidate sound sources by a separation method based on independent subspace analysis to obtain the sound signal of each candidate sound source.
(2) And carrying out sound source separation on the positions of the sound channels for collecting the sound source data to be separated and the target sound source directions of the candidate sound sources by a separation method based on non-negative matrix factorization to obtain the sound signal of each candidate sound source.
(3) And carrying out sound source separation on the positions of the sound channels for collecting the sound source data to be separated and the target sound source directions of the candidate sound sources by a separation method based on independent vector analysis of auxiliary function optimization, so as to obtain the sound signal of each candidate sound source.
It should be noted that the above sound source separation processing method is only illustrative, and does not constitute limitation of the sound signal processing method provided in the embodiment of the present invention, for example, the sound signal of each candidate sound source may be obtained by performing sound source separation processing according to the positions of the sound channels for collecting the sound source data to be separated and the target sound source directions of the candidate sound sources by using the separation method based on the overdetermined independent vector analysis optimized by the auxiliary function.
Step S500, determining a target sound source from a plurality of candidate sound sources according to the sound signals of the candidate sound sources.
In some class of embodiments of the present invention, the target sound source may be determined from a plurality of the candidate sound sources by evaluating the sound quality score of the sound signal of each of the candidate sound sources, based on the sound quality score. For example, a target sound source having the highest sound quality score may be selected from a plurality of the candidate sound sources according to the sound quality score.
Alternatively, there are a variety of ways to evaluate the quality of the sound signal of each candidate source, exemplary including:
(1) The sound quality score of the sound signal of each candidate sound source may be determined by evaluating the quality of the sound signal of each candidate sound source by calculating the signal-to-interference ratio of the sound signal of each candidate sound source.
(2) The sound quality score of the sound signal of each candidate sound source may be determined by evaluating the quality of the sound signal of each candidate sound source by calculating the signal distortion ratio of the sound signal of each candidate sound source.
(3) The sound quality score of the sound signal of each candidate sound source may be determined by evaluating the quality of the sound signal of each candidate sound source by calculating the maximum likelihood ratio of the sound signal of each candidate sound source.
(4) The sound quality score of the sound signal of each candidate sound source may be determined by computing a cepstrum cluster of the sound signal of each candidate sound source to evaluate the sound signal of each candidate sound source.
(5) The sound quality score of the sound signal of each candidate sound source may be determined by evaluating the quality of the sound signal of each candidate sound source by calculating the frequency weighted segmented signal-to-noise ratio of the sound signal of each candidate sound source.
(6) The sound quality score of the sound signal of each candidate sound source may be determined by evaluating the quality of the sound signal of each candidate sound source by calculating a speech quality perception evaluation score of the sound signal of each candidate sound source.
(7) The sound quality score of the sound signal of each candidate sound source may be determined by calculating the kurtosis value of the sound signal of each candidate sound source to evaluate the quality of the sound signal of each candidate sound source.
(8) The sound quality score of the sound signal of each candidate sound source may be determined by evaluating the quality of the sound signal of each candidate sound source by calculating a probability score corresponding to the speech feature vector of the sound signal of each candidate sound source. Wherein the probability score is used to characterize the probability that the sound signal of each candidate sound source is the speech signal of the target sound source.
It should be noted that the method for evaluating the quality of the sound signal of each candidate sound source is merely an exemplary illustration, and does not limit the sound signal processing method provided by the embodiment of the present invention. In practical application, the corresponding sound quality score determining method can be selected according to the calculation effectiveness of the electronic equipment in the practical application scene.
According to the sound source signal separation method provided by the embodiment of the invention, the accuracy of the direction of the candidate sound sources is improved by estimating the sound source direction of the sound source data to be separated, and the final target sound source is determined by the estimated value of each candidate sound source, so that the accuracy of the separated sound source signals is further improved, and the problem of low stability of signal separation is solved.
In order to better implement the sound source orientation method provided by the embodiment of the present invention, based on the application scenario of the audio conference in the closed room, a sound source tracking method is provided, specifically, as shown in fig. 18, fig. 18 is a schematic flow diagram of the sound source tracking method provided by the embodiment of the present invention, where the sound source tracking method at least includes steps S600 to S800:
in step S600, sound source data in a conference scene is received through a microphone array.
In step S700, sound source direction estimation is performed on the sound source data, and a target sound source direction corresponding to the sound source data is determined.
In some embodiments of the present invention, the sound source direction of the sound source data may be estimated according to the sound source direction method in any of the above embodiments, to determine the target sound source direction corresponding to the sound source data.
Step S800, adjusting the direction parameters of the sound source tracking device according to the target sound source direction.
Wherein the sound source tracking device is used to track a target sound source in an audio conference in a room, including but not limited to cameras, microphones, etc.
It will be appreciated that the sound source tracking device is in the same plane as the microphone array receiving the sound source.
In some embodiments of the present invention, a current direction parameter of the sound source tracking device may be obtained, a target direction parameter of the sound source tracking device may be obtained according to a target sound source direction corresponding to the sound source data, and the direction parameter of the sound source tracking device may be adjusted according to the target direction parameter and the current direction parameter of the sound source tracking device. Wherein the direction parameters include, but are not limited to, azimuth and pitch. Exemplary, as shown in fig. 19, fig. 19 is a test result of sound source tracking performed by the brain-like chip implemented by the low-power hardware of the present invention, where the radian is used for representation, and 1 radian is equal to 60 °, and 1 radian is 90 °. The first graph from top to bottom in fig. 19 characterizes the target sound source directions of the real sound source at different times, the second graph and the third graph are classified into the sound source tracking effect before smoothing and the sound source tracking effect after smoothing, and as can be seen from fig. 19, the sound source tracking method provided by the embodiment of the invention can quickly respond according to the sound source directions, so that quick tracking of the sound source is realized.
According to the sound source tracking method provided by the embodiment of the invention, the sound source direction estimation is carried out on the sound source data, the accuracy of the direction of the sound source is improved, and the direction parameters of the sound source tracking equipment are quickly adjusted based on the target sound source direction, so that the quick tracking of the sound source is realized.
The embodiment of the invention also provides a chip, which utilizes any sound source orientation method or a sound source signal separation method.
The chip is a pseudo-attitude chip or a brain-like chip, namely the chip can be developed by simulating a biological neuron form working mode, is usually triggered on the basis of an event, and has the characteristics of low power consumption, low delay response and no privacy leakage. The existing pseudo-attitude chip comprises chips such as Loihi of Intel, dynap-CNN of TrueNorth, synsense of IBM and the like, and the invention is not limited to the chips.
The embodiment of the invention also provides electronic equipment, wherein the sound source orientation method, the sound source signal separation method or the chip described in any one of the above are stored on the electronic equipment.
Although the present invention has been described with reference to specific features and embodiments thereof, various modifications, combinations, substitutions can be made thereto without departing from the invention. The scope of the present application is not intended to be limited to the particular embodiments of the process, machine, manufacture, composition of matter, means, methods and steps described in the specification, but rather, the methods and modules may be practiced in one or more products, methods, and systems of the associated, interdependent, inter-working, pre/post stages.
The specification and drawings are, accordingly, to be regarded in an abbreviated manner as an introduction to some embodiments of the technical solutions defined by the appended claims and are thus to be construed in accordance with the doctrine of greatest reasonable interpretation and are intended to cover as much as possible all modifications, changes, combinations or equivalents within the scope of the disclosure of the invention while also avoiding unreasonable interpretation.
Further improvements in the technical solutions may be made by those skilled in the art on the basis of the present invention in order to achieve better technical results or for the needs of certain applications. However, even if the partial improvement/design has creative or/and progressive characteristics, the technical idea of the present invention is relied on to cover the technical features defined in the claims, and the technical scheme shall fall within the protection scope of the present invention.
The features recited in the appended claims may be presented in the form of alternative features or in the order of some of the technical processes or the sequence of organization of materials may be combined. Those skilled in the art will readily recognize that such modifications, changes, and substitutions can be made herein after with the understanding of the present invention, by changing the sequence of the process steps and the organization of the materials, and then by employing substantially the same means to solve substantially the same technical problem and achieve substantially the same technical result, and therefore such modifications, changes, and substitutions should be made herein by the equivalency of the claims even though they are specifically defined in the appended claims.
The steps and components of the embodiments have been described generally in terms of functions in the foregoing description to clearly illustrate this interchangeability of hardware and software, and in terms of various steps or modules described in connection with the embodiments disclosed herein, may be implemented in hardware, software, or a combination of both. Whether such functionality is implemented as hardware or software depends upon the particular application or design constraints imposed on the solution. Those of ordinary skill in the art may implement the described functionality using different approaches for each particular application, but such implementation is not intended to be beyond the scope of the claimed invention.

Claims (10)

1. A pretreatment device, characterized in that:
the preprocessing device is used for preprocessing the sound source data to be processed received by the microphones to obtain the sound source data preprocessed by each microphone; the pretreatment device comprises:
the channel decomposition module is used for carrying out channel decomposition on the sound source data to be processed received by each microphone;
the activity detection module is coupled with the channel decomposition module and is used for carrying out activity detection on the basis of channel components of the sound source data to be processed received by each microphone after channel decomposition so as to obtain target frequency; the target frequency is one or more frequencies;
And determining sound source components of the sound source data to be processed received by each microphone in the target frequency channel as the sound source data of each microphone after pretreatment.
2. The pretreatment device according to claim 1, wherein:
the activity detection module performs independent activity detection or joint activity detection or local joint activity detection;
the independent activity detection is as follows: counting energy or energy average values of sound source data to be processed received by each microphone under different frequency channels after channel decomposition, and obtaining initial active frequency meeting a first preset condition; obtaining the target frequency based on the initial active frequency corresponding to each microphone;
the joint activity detection is as follows: the energy sum of all the microphones under different frequency channels is counted, or the average value of the signal energy of all the microphones under different frequency channels is counted, so that the target frequency meeting the second preset condition is obtained;
the local joint activity detection is as follows: grouping the sound source data to be processed received by all microphones to obtain at least two sound source data combinations; the sound source data combination comprises at least one sound source data to be processed; the energy sum of all the sound source data in each sound source data combination under different frequency channels is counted, or the signal energy average value of all the sound source data in each sound source data combination under different frequency channels is counted, so that the active frequency of each sound source data combination is obtained; and determining a target frequency based on the active frequency of each sound source data combination.
3. The pretreatment device according to claim 2, wherein:
the first preset condition of the independent activity detection is that the energy or the energy average value is maximum, or the energy average value is larger than or equal to a second threshold value;
the second preset condition of the joint activity detection is that the energy sum or the energy average value is maximum, or the energy sum or the energy average value is larger than or equal to a second threshold value.
4. The preprocessing method is characterized by being used for preprocessing sound source data to be processed received by microphones to obtain sound source data of each microphone after the preprocessing step;
the pretreatment method comprises the following steps:
carrying out channel decomposition on the sound source data to be processed received by each microphone to obtain the sound source data to be processed received by each microphone, and decomposing the sound source data to be processed into a plurality of frequency channels;
performing activity detection on channel components after channel decomposition on sound source data to be processed received by each microphone to obtain target frequency; the target frequency is one or more frequencies;
and determining sound source components of the sound source data to be processed received by each microphone in the target frequency channel as the sound source data of each microphone after pretreatment.
5. The pretreatment method according to claim 4, wherein: the channel decomposition of the sound source data to be processed received by each microphone comprises the following steps:
and filtering the sound source data to be processed received by each microphone through a band-pass filter bank, and dividing the sound source data to be processed received by the microphone into a plurality of frequency channels.
6. A sound source directing device, the sound source directing device comprising:
a pretreatment device according to any one of claims 1 to 3;
the encoding module is coupled with the preprocessing device and is used for carrying out pulse encoding on the sound source data preprocessed by each microphone to obtain pulse signals corresponding to the sound source data to be processed received by each microphone;
and the estimation module is coupled with the encoding module, and is used for estimating the direction of the pulse signal obtained by pulse encoding based on the pulse neural network to obtain the target sound source direction of the sound source data to be processed.
7. The sound source directing device of claim 6, wherein:
the coding module performs zero-crossing pulse coding.
8. A sound source directing method, characterized in that the sound source directing method comprises the preprocessing method according to claim 4 or 5, and sound source data of each microphone after preprocessing is obtained;
Performing pulse coding on the sound source data of each microphone after pretreatment to obtain pulse signals corresponding to the sound source data to be treated received by each microphone;
and based on a pulse neural network, estimating the direction of a pulse signal obtained by pulse coding to obtain the target sound source direction of the sound source data to be processed.
9. A chip comprising the pretreatment device of any one of claims 1 to 3 or the sound source directing device of any one of claims 6 to 7.
10. An electronic device comprising the preprocessing device of any one of claims 1 to 3, or comprising the sound source directing device of any one of claims 6 to 7, or comprising the chip of claim 9.
CN202310159281.4A 2023-02-14 2023-02-14 Pretreatment device and method thereof, sound source orientation device and method thereof, and chip Pending CN116320853A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310159281.4A CN116320853A (en) 2023-02-14 2023-02-14 Pretreatment device and method thereof, sound source orientation device and method thereof, and chip

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310159281.4A CN116320853A (en) 2023-02-14 2023-02-14 Pretreatment device and method thereof, sound source orientation device and method thereof, and chip

Publications (1)

Publication Number Publication Date
CN116320853A true CN116320853A (en) 2023-06-23

Family

ID=86831701

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310159281.4A Pending CN116320853A (en) 2023-02-14 2023-02-14 Pretreatment device and method thereof, sound source orientation device and method thereof, and chip

Country Status (1)

Country Link
CN (1) CN116320853A (en)

Similar Documents

Publication Publication Date Title
CN109800700B (en) Underwater acoustic signal target classification and identification method based on deep learning
Nguyen et al. Robust source counting and DOA estimation using spatial pseudo-spectrum and convolutional neural network
Wu et al. Multisource DOA estimation in a reverberant environment using a single acoustic vector sensor
US5502688A (en) Feedforward neural network system for the detection and characterization of sonar signals with characteristic spectrogram textures
Nguyen et al. A general network architecture for sound event localization and detection using transfer learning and recurrent neural network
US20220201421A1 (en) Spatial audio array processing system and method
Kumatani et al. Multi-geometry spatial acoustic modeling for distant speech recognition
CN112133323A (en) Unsupervised classification and supervised modification fusion voice separation method related to spatial structural characteristics
CN116866129A (en) Wireless communication signal detection method
Kim et al. A method for underwater acoustic signal classification using convolutional neural network combined with discrete wavelet transform
CN115508821A (en) Multisource fuses unmanned aerial vehicle intelligent detection system
CN112394324A (en) Microphone array-based remote sound source positioning method and system
CN113707136B (en) Audio and video mixed voice front-end processing method for voice interaction of service robot
CN116702847A (en) Pulse neural network, sound source tracking method, chip and electronic equipment
Qureshi et al. Gunshots localization and classification model based on wind noise sensitivity analysis using extreme learning machine
CN116320853A (en) Pretreatment device and method thereof, sound source orientation device and method thereof, and chip
Song et al. Research on scattering transform of urban sound events detection based on self-attention mechanism
RU2695985C1 (en) Neuron network system for detection and rapid identification of sea targets
Nguyen et al. A parallel neural network-based scheme for radar emitter recognition
CN115825853A (en) Sound source orientation method and device, sound source separation and tracking method and chip
TWI749547B (en) Speech enhancement system based on deep learning
Firoozabadi et al. Combination of nested microphone array and subband processing for multiple simultaneous speaker localization
CN112034424A (en) Neural network sound source direction finding method and system based on double microphones
Zermini Deep Learning for Speech Separation
Moonasar et al. A committee of neural networks for automatic speaker recognition (ASR) systems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination