WO2020024816A1 - 音频信号处理方法、装置、设备和存储介质 - Google Patents

音频信号处理方法、装置、设备和存储介质 Download PDF

Info

Publication number
WO2020024816A1
WO2020024816A1 PCT/CN2019/096813 CN2019096813W WO2020024816A1 WO 2020024816 A1 WO2020024816 A1 WO 2020024816A1 CN 2019096813 W CN2019096813 W CN 2019096813W WO 2020024816 A1 WO2020024816 A1 WO 2020024816A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
audio signal
signal
target
correlation
Prior art date
Application number
PCT/CN2019/096813
Other languages
English (en)
French (fr)
Inventor
田彪
银鞍
余涛
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2020024816A1 publication Critical patent/WO2020024816A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/84Detection of presence or absence of voice signals for discriminating voice from noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming

Definitions

  • the present invention relates to the technical field of data processing, and in particular, to a method, an apparatus, a device, and a storage medium for processing audio signals.
  • Audio recognition technology With the continuous development of audio recognition technology, audio recognition technology has developed rapidly in areas such as car driving, smart homes, and smart business systems. Audio recognition technology can quickly and accurately perform corresponding functions through audio recognition.
  • the existing audio recognition technology can detect the specific information of multiple interference sources, thereby separating the interference signals from the detected interference sources to obtain the target audio. signal.
  • the interference source is complex and changeable. The quality of the audio signal collected by this signal processing method is poor, the signal-to-noise ratio of audio recognition is low, and the practicability is not high.
  • Embodiments of the present invention provide an audio signal processing method, device, device, and storage medium, which can implement adaptive audio enhancement in a noisy environment with multiple interference sources and improve the signal-to-noise ratio of the audio signal.
  • an audio signal processing method including:
  • Multiple audio collection devices using a microphone array receive audio signals to determine whether the audio signal includes the target audio signal; if the audio signal includes the target audio signal, determine the correlation of multiple audio collection devices corresponding to the audio signal; use the audio signal to correspond Correlation of multiple sound collection devices, audio enhancement processing is performed on the audio signal to obtain an audio enhanced audio signal.
  • an audio signal processing apparatus including:
  • An audio signal detection module is used to receive audio signals using multiple sound collection devices of a microphone array to determine whether a target audio signal is included in the audio signal; a correlation determination module is used to determine a corresponding audio signal if the audio signal includes the target audio signal Correlation of multiple sound collection devices; audio signal enhancement module is used to use the correlation of multiple sound collection devices corresponding to audio signals to perform audio enhancement processing on audio signals to obtain audio enhanced audio signals.
  • an audio signal processing device including: a memory and a processor; the memory is used to store a program; the processor is used to read executable program code stored in the memory to execute the foregoing Audio signal processing method.
  • a computer-readable storage medium stores instructions, and when the instructions are run on a computer, the computer executes the audio signal processing methods of the foregoing aspects. .
  • an audio interactive device including:
  • An audio signal detector for receiving audio signals using multiple sound collecting devices of a microphone array to determine whether a target audio signal is included in the audio signal; a target audio separator for determining if the audio signal includes a target audio signal, corresponding to the audio signal Correlation of multiple sound collection devices; the target audio enhancer is used to use the correlation of multiple sound collection devices corresponding to the audio signal to perform audio enhancement processing on the audio signals to obtain audio enhanced audio signals.
  • the audio signal processing method, device, device and storage medium in the embodiments of the present invention it is possible to detect whether a target audio signal exists in the audio signal in a noisy environment with multiple interference sources, and determine multiple sets corresponding to the audio signal according to the detection result. Correlation between audio devices and the correlation between audio collection devices corresponding to audio signals to perform audio enhancement processing on audio signals to obtain enhanced audio signals.
  • the entire audio signal processing process does not need to detect specific information of multiple interference sources That is, the audio enhancement process of the audio signal has nothing to do with the specific interference source, so it can adapt to complex and changeable interference environments, and has stronger practicability.
  • FIG. 1 is a schematic diagram illustrating an application scenario of an audio signal processing method according to an exemplary embodiment of the present invention
  • FIG. 2 is a schematic diagram illustrating a scene where a microphone array performs sound source localization on a target area according to an embodiment of the present invention
  • FIG. 3 is a flowchart illustrating an audio signal processing method according to an embodiment of the present invention.
  • FIG. 4 is a flowchart illustrating an audio signal processing method according to another embodiment of the present invention.
  • FIG. 5 is a schematic structural diagram of an audio signal processing apparatus according to an embodiment of the present invention.
  • FIG. 6 is a structural diagram illustrating an exemplary hardware architecture of a computing device that can implement an audio signal processing method and apparatus according to an embodiment of the present invention
  • FIG. 7 is a schematic structural diagram of an audio interaction device according to an embodiment of the present invention.
  • audio interaction systems such as smart audio equipment, smart audio shopping machines, smart audio ticket machines, and smart audio elevators
  • audio interaction systems usually need to perform audio in noisy environments with multiple interference sources, such as shopping malls, subway stations, and social venues.
  • Signal acquisition and audio signal processing are examples of signals that are input to the audio interaction systems.
  • a microphone array may be used to perform signal sampling and signal processing on audio signals from different directions in space in a noisy environment with multiple interference sources.
  • Each acoustic sensor such as a microphone in the microphone array may be referred to as an array element, and each microphone array includes at least two array elements.
  • Each array element can be regarded as a sound collection channel, and a multi-channel sound signal can be obtained by using a microphone array containing multiple array elements.
  • the microphone array in the embodiment of the present invention may be an array formed by a group of acoustic sensors located at different positions in a space according to a certain shape rule, and is a device for spatially sampling a sound signal propagating in space.
  • the shape and arrangement of the acoustic sensors in the microphone array can be called the topology of the microphone array.
  • the microphone array can be divided into a linear microphone array, a planar microphone array, and a stereo microphone array.
  • a linear microphone array may indicate that the array element centers of the microphone array are located on the same straight line, such as a horizontal array;
  • a planar microphone array may indicate that the array element centers of the microphone array are distributed on a plane, such as a triangular array, a circular array, T-type array, L-type array, square array, etc .
  • Stereo microphone array can indicate that the array element center of the microphone array is distributed in the three-dimensional space, such as a polyhedron array, a spherical array, and the like.
  • the audio signal processing method in the embodiment of the present invention does not specifically limit the specific form of the microphone array used.
  • the microphone array may be a horizontal array, a T-shaped array, an L-shaped array, or a cube array.
  • the multiple embodiments described below use L-shaped arrays as examples to illustrate the acquisition of multi-channel audio signals.
  • the description cannot be interpreted as limiting the scope of this solution or the possibility of implementation.
  • the processing method of microphone arrays of other topologies other than L-type arrays is consistent with the processing method of L-type arrays.
  • the actual application scenario of audio signal processing usually includes various interference sources such as environmental noise, other audio signal interference, such as human voice interference, reverberation, and echo.
  • reverberation can be understood as an acoustic phenomenon in which the sound signal and the sound signal are repeatedly reflected and absorbed by an obstacle to form a superposition of sound waves during propagation; echo can also be referred to as acoustic echo (acoustic echo).
  • acoustic echo acoustic echo
  • It can be understood as the repeated sound signal formed by the sound played by the speaker of the audio interactive device through propagation and reflection in the space, and the repeated sound signal will be transmitted back to the noise interference formed by the microphone.
  • the above-mentioned various interference sources such as environmental noise, other audio signal interference, reverberation, and echo constitute a strong interference and complex and changeable acoustic environment, which impairs the quality of audio signals collected by the audio signal processing system.
  • the target audio signal may represent an audio signal from a target area that can drive an audio interactive device to interact.
  • the target audio signal may be a voice signal or a meaningful audio signal played by a machine, as long as the audio signal can drive an audio interaction device to perform audio interaction.
  • FIG. 1 is a schematic diagram of an application scenario of an audio signal processing method according to an exemplary embodiment of the present invention.
  • the audio ticketing environment of the subway station may include a ticket purchaser 10 and an audio ticketing system 20.
  • the audio ticketing system 20 may include a human-computer interactive display device 21 and an audio interactive device 22.
  • the audio ticket purchase system 20 can enable the ticket purchaser 10 to use the form of audio interaction to implement functions such as buying tickets by a specified station name, buying tickets at a specified fare, or fuzzy searching for a destination.
  • the audio interactive display device 22 may include a microphone array (not shown in the figure), and the audio interactive device 22 may use multiple sound collection channels provided by multiple array elements in the microphone array to collect real-time Acoustic signals in a ticketing environment.
  • the human-computer interaction display device 21 may display suggested audio interaction instructions, and the suggested audio interaction instructions may have a normative guiding effect on the audio interaction between the ticket purchaser 10 and the audio interaction device 22 Instruction example. For example, “I want to go to site B", “Buy two tickets to site C”, and “two tickets for fare A”, etc .; the human-computer interaction display device 21 may be based on the audio interactive instructions issued by the purchaser 10 The destination, after being processed by the audio interactive device 22, invokes a map service to display the recommended subway lines and stations closest to the destination; and the human-computer interactive display device 21 can also display payment information so that the purchaser 10 can display the payment information based on the displayed After payment is made, ticket issuance is completed by the audio ticket purchase system 20.
  • the sound source of the ticket purchaser 10 can be used as the target sound source.
  • the audio signal collected by the audio interactive device 22 using the microphone array can include not only the target audio signal from the target sound source, but also a microphone.
  • the array pickup range includes non-target audio signals including environmental noise, vocal interference, reverberation, and echo.
  • the environmental noise may include, for example, the running noise of subway trains and the noise generated from the operation of ventilation and air conditioning equipment;
  • the human voice interference may be, for example, an audio signal sent by someone other than the ticket purchaser 10.
  • a non-target audio signal may also be referred to as an interference signal or a noise signal.
  • embodiments of the present invention provide an audio signal processing method, device, system, and storage medium, which can be used in public places with multiple interference sources, etc.
  • audio enhancement is performed on the target audio to improve the quality of the audio signal and the signal-to-noise ratio.
  • FIG. 2 is a schematic diagram of a scene where a microphone array performs sound source localization on a target area according to an embodiment of the present invention.
  • sound source localization refers to determining the sound source direction or the spatial position of the sound source of the audio signal based on the audio signal collected by the microphone array in an actual application scenario, so as to perform position detection on the audio sound source or Direction detection to determine the microphone array and the spatial position relationship with the sound source.
  • the sound source of audio may come from a person having a conversation with the audio interaction device 22 through a microphone array.
  • the position and direction of the ticket purchaser 10 relative to the microphone array can be determined based on the microphone array to detect the position and direction of the ticket purchaser 10.
  • the microphone array may be used to receive audio signals from different directions.
  • the received audio signals may include the ticket purchaser 10, and For example, noise signals from noise sources such as noise 1, noise 2, and noise 3.
  • a method for locating a sound source by using a microphone array may generally include the following two solutions.
  • the first solution is to determine the position of the sound source by calculating the distance between each array element and the sound source through the strength of the sound signal received by the two array elements in the microphone array from the same sound source;
  • the solution is to use the time difference between the two elements in the microphone array to receive the sound signal from the same sound source, and use this time difference to determine the localization of the sound source.
  • the embodiment of the present invention may use The method of time delay estimation (TDE) is used to locate the sound source.
  • TDE time delay estimation
  • a method for locating a sound source by using a time delay estimation method may include the following two steps.
  • a time delay estimation algorithm is used to calculate the time difference (Time Difference Of Arrival, TDOA) from the same sound source to different array elements in the microphone array.
  • TDOA Time Difference Of Arrival
  • step S02 the time difference between the same sound source and different array elements in the microphone array is obtained. And the geometric position relationship between the array elements in the microphone array estimates the position of the sound source.
  • the delay estimation method commonly used in step S01 may include a generalized cross-correlation delay estimation method, a weighted generalized cross-correlation estimation method, an adaptive delay estimation method, and the like.
  • the following uses the calculation of the time difference between the sound signals of the same sound source reaching two different array elements in the microphone array as an example to introduce the time delay estimation method in the embodiment of the present invention.
  • the generalized cross-correlation delay estimation method may use sound signals received by two array elements in a microphone array from the same sound source as two microphone signals, and calculate the mutual relationship between the two microphone signals. Correlation function, using this cross-correlation function to describe the mutual relationship between two microphone signals, to measure the degree of similarity between the two microphone signals and the difference in position on the time axis, and perform a peak search on the cross-correlation function to obtain The time corresponding to the peak value of the cross-correlation function is the time difference between the two microphone signals.
  • the sound signal is a periodic signal
  • time-domain analysis and frequency-domain analysis are two different ways of periodically analyzing the sound signal.
  • the time domain can be used to describe the relationship between audio signals and time, that is, using time as a variable to analyze the dynamic change of audio signals with time
  • the frequency domain can be used to describe the relationship between audio signals and frequencies, that is, to Frequency is used as a variable to analyze the characteristics of the audio signal at different frequencies.
  • the audio signal when the audio signal is analyzed in the time domain, information such as the period and amplitude of the audio signal can be obtained relatively intuitively, and the audio signal is converted from the time domain signal to the frequency domain signal, and by analyzing the spectral characteristics of the audio signal By processing audio signals in the frequency domain, you can obtain higher processing efficiency and performance.
  • the time domain signal and the frequency domain signal of the sound signal can be converted to each other.
  • an audio signal can be transformed from a time domain signal to a frequency domain signal by a Fourier transform algorithm, and a frequency domain signal can be transformed into a time domain signal by an inverse Fourier transform algorithm.
  • the basic principle of the Fourier transform algorithm can be understood as: the sound signal obtained through continuous measurement is expressed as an infinite superposition of sine wave signals of different frequencies. Therefore, the Fourier transform can use the directly measured sound signal as the original signal, and calculate the frequency, amplitude, and phase of different sine wave signals in the sound signal in this superposition manner, thereby transforming the time domain signal into a frequency domain signal.
  • a power spectrum density function (hereinafter referred to as a power spectrum) of the sound signal may be used to describe a change relationship of the sound signal power with frequency.
  • the mutual power density spectral function of the two microphone signals (hereinafter may be referred to as cross-spectrum or cross-spectrum) can describe the correlation between the two microphone signals in the frequency domain.
  • Each spectral line in the cross-power density spectral function can be understood as Is an impact function with amplitude weighting.
  • cross-spectrum and cross-correlation functions are two different representations describing the correlation between the two microphone signals in the above embodiments from the frequency domain and the time domain, respectively.
  • the cross power spectrum function of the two microphone signals can be first calculated in the frequency domain and multiplied by the corresponding Weight function to enhance the high signal-to-noise frequency portion of the two microphone signals, thereby suppressing the influence of interference sources, and finally inverse Fourier transform the function of the cross power spectrum to the time domain to obtain the A cross-correlation function.
  • the time corresponding to the peak of the cross-correlation function is the time difference between the two microphone signals.
  • the performance of the generalized cross-correlation delay estimation method depends on the selected weight function.
  • the most representative of the weighted generalized cross-correlation estimation methods are cross-correlation algorithms that use Maximum Likelihood (ML) weighting and phase transformation (PHAT) weighting Cross-correlation algorithm, users can choose according to the actual situation and computing needs.
  • ML Maximum Likelihood
  • PHAT phase transformation
  • the maximum likelihood weighting requires the power spectrum of the known sound source signal and the power spectrum of the interference source, so ideally, the generalized cross-correlation delay estimation method with maximum likelihood weighting has a high accuracy and can reach the maximum Good estimation, but considering the complexity and change of the interference source in the strong interference environment in practical applications, the maximum likelihood weighted cross-correlation estimation method has a high complexity and a large amount of calculation in calculating the power spectrum of the interference source.
  • the phase transformation weighting is a weighting method in which different weighting functions are selected according to the prior information of the sound source signal and the interference source signal, according to the sound source signal and different interference source signals.
  • the phase conversion weighting does not need to calculate the power spectrum of the sound source signal and the power spectrum of the interference source signal.
  • the peak value of the cross-correlation function of the two microphone signals is still prominent and easy to distinguish when subjected to noise interference, showing a relatively good robustness.
  • the adaptive delay estimation method is a method that does not rely on the prior information of the sound signal, but continuously adjusts the function parameters and function structure of the delay calculation according to the change of the sound signal in the actual application scenario. Furthermore, the calculation method of the time difference between the arrival of the same sound source at two different array elements in the microphone array is estimated. Therefore, the adaptive delay estimation method is suitable for tracking dynamic and time-varying audio input environments.
  • the audio signals received by each array element in the microphone array may be superimposed with multiple reverberation signals, thereby causing the peak points of the cross-correlation function of the two microphone signals to be multiple
  • the phase transformation weighted cross-correlation algorithm described in the above embodiment can be used to locate the sound source.
  • a three-dimensional spatial coordinate system corresponding to the microphone array may be established in advance when performing sound source localization.
  • the coordinate origin M 0 of the three-dimensional space coordinate system may be the center position of the microphone array in the audio interactive device 22, or the position of any one of the array elements in the microphone array, or another designated position.
  • the offset distance of each element from the coordinate origin M 0 may be determined according to the arrangement order of the elements in the microphone array and the distance between the elements, thereby determining each element The three-dimensional space coordinates of M i with respect to the coordinate origin M 0 .
  • the ticket purchaser 10 is located at a spatial position point S in the three-dimensional space as the target sound source, and the three-dimensional spatial coordinates of the position point S can be expressed as S (x 0 , y 0 , z 0 ), where: x 0 , y 0 , z 0 are the coordinate values of the X-axis, Y-axis, and Z-axis of the coordinate system of the position point S in the three-dimensional space. (x 0 , y 0 , z 0 ) represents the three-dimensional space of the spatial position point S. coordinate.
  • the three-dimensional spatial coordinates and coordinate vectors of the spatial location point S satisfy:
  • r 0 represents the distance between the spatial location point S (x 0 , y 0 , z 0 ) where the purchaser 10 is located and the coordinate origin M 0 (x 0 , y 0 , z 0 ) of the three-dimensional spatial coordinate system
  • the elevation angle ⁇ 0 represents the angle between the line formed by the space point S and the coordinate origin M 0 and the positive direction of the Z axis
  • the horizontal angle Represents the angle between the line formed by the projection S ′ of the space point S on the XOY plane and the coordinate origin M 0 and the positive direction of the X axis.
  • the value range can be Pitch angle ⁇ may be in the range 0 to 0 ° ⁇ 0 ⁇ 360 °.
  • r 0 may be referred to as a distance between the spatial position point S and the microphone array
  • ⁇ 0 may be referred to as a pitch angle between the spatial position point S and the microphone array. It is called the horizontal angle between the spatial position point S and the microphone array.
  • the time difference can be used to calculate the distance difference between the audio signals of the same sound source reaching two different array elements.
  • the difference in distance between the audio signals of the same sound source and two different array elements can be used.
  • the three-dimensional spatial coordinates of each array element in the microphone array and the three-dimensional spatial coordinates of the sound source can be calculated using the geometric analysis principle. Position or orientation of the sound source relative to the microphone array.
  • the accuracy of the sound source localization is related to the angle (elevation angle and / or horizontal angle) and distance between the sound source and the microphone array.
  • the target area of the audio signal position information can be set in advance, and the sound source in the target area can be detected, thereby reducing the sound source acquisition range of the audio interactive device 22, and improving the accuracy and calculation efficiency of tracking the sound source target.
  • the spatial region range of the sound source position of the target audio signal may be determined according to an actual application scenario.
  • the ticket buyer 10 is usually located in a relatively fixed area near the audio ticketing system 20, and the probability that the audio signal from the area includes the position of the sound source is higher. Therefore, in one embodiment, the set target area satisfies the following conditions: the coordinate vector of any spatial point R (x i , y i , z i ) in the range of the area satisfies r i ⁇ r max , ⁇ i ⁇ ⁇ max ,
  • the distance between the spatial point R and the microphone array within the spatial region of the sound source position is less than or equal to a preset maximum distance r max
  • the horizontal angle between the spatial point R and the microphone array is less than or equal to a preset maximum horizontal angle.
  • the pitch angle between the spatial point R and the microphone array is less than or equal to a preset maximum pitch angle ⁇ max .
  • an audio signal from a target area can be obtained, such as noise 1, noise 2 and ticket purchaser 10 shown in FIG. 2.
  • the setting can also effectively filter some interference sources, especially the audio interference of non-ticket purchasers, thereby reducing the detection range of the sound source target and improving the efficiency and accuracy of tracking the sound source target.
  • audio activity detection (Voice Activity Detection, VAD) can be used to determine whether the audio signal from the target area includes the target audio signal.
  • VAD Voice Activity Detection
  • the audio activity detection component may be used to detect the start time and the end time of the target audio signal from the audio signal, thereby extracting the target audio signal, eliminating interference between mute segments and non-target audio signals, and reducing audio.
  • the computing pressure of the recognition system improves the response speed of the audio recognition system.
  • the audio activity detection component may be obtained by training in advance using positive samples labeled as target audio and negative samples labeled as non-target audio.
  • a neural network model can be constructed, using positive samples labeled as target audio and negative samples labeled as non-target audio, training the neural network model to obtain an audio activity detection model that can perform audio activity detection.
  • the audio activity detection model generates an audio activity detection component.
  • the neural network model may be, for example, any one of a neural network such as a deep neural network, a recurrent neural network, or a convolutional neural network.
  • the positive samples labeled as the target audio may be audio signals containing acoustic characteristics of the voice signal
  • the negative samples labeled as non-target audio may be audio signals not containing the acoustic characteristics of the voice signal.
  • the audio signals in different detection results may be based on the detection results of the target audio activity. Perform statistics separately and perform audio enhancement processing on the statistical audio signals to obtain audio enhanced audio signals.
  • the following describes a specific process of how to perform statistical analysis on the target audio signal and the non-target audio signal by using the audio activity detection result of the audio signal, and use the statistical result to perform audio enhancement processing.
  • FIG. 3 shows a flowchart of an audio signal processing method according to an embodiment of the present invention.
  • the audio signal processing method in the embodiment of the present invention may include:
  • Step S310 Use a plurality of sound collection devices of a microphone array to receive audio signals, and determine whether a target audio signal is included in the audio signal.
  • the target audio signal is an audio signal from the target area that can drive the audio interactive device to interact.
  • the target audio signal may be a voice signal or a meaningful sound played by a machine, as long as it can drive an audio interactive device to interact.
  • the sound source localization method described in the above embodiment may be used to locate the sound source of the collected audio signal, and obtain the audio signal from the target area in the sound source obtained by the localization;
  • the audio signal performs audio activity detection to determine whether the audio signal from the target area includes the target audio signal.
  • detecting the audio activity of the audio signal it is possible to determine whether a target audio signal exists in each specified time period of the audio signal, so as to detect the target audio signal in each specified time period or each specified time in the audio signal. Non-target audio signals within the segment are counted separately.
  • the following embodiments in this document use the time period occupied by each audio frame in the audio signal as a specified time period to sequentially process the audio frames in the audio signal.
  • the description cannot be interpreted as limiting the scope or implementation possibility of this solution.
  • the processing method of audio signals in other customized time periods is consistent with the processing method of audio signals of each audio frame.
  • the sampling rate of the audio signal represents the number of times that the amplitude samples of the sound wave are sampled per second when the waveform of the audio signal is processed.
  • the unit of measurement for the sampling frequency may be Hertz. As an example, when the sampling rate of the audio signal is 16 kHz, it means that the audio signal is sampled 16,000 times per second.
  • the number of audio frames included in a unit time may be determined according to the duration of each audio frame. As an example, when the duration of each audio frame is 0.01 seconds or 10 ms, it means that every 10 ms of audio sample data constitutes an audio frame, that is, includes 100 audio frames per second.
  • the sampling times corresponding to the sampling signal in each audio frame can be calculated according to the sampling rate of the audio signal and the duration of each audio frame.
  • the sampling rate of the audio signal is 16 kHz
  • the sampling duration of each audio frame is 0.01 seconds, that is, the audio signal is sampled 16,000 times per second, and includes 100 audio frames per second. Includes 160 sampling points.
  • each sound collecting device is regarded as collecting audio signals in one dimension in space, and in an audio frame
  • the sampling data of each sound collecting device of the microphone array is stored in a row, and it can be obtained Sample matrix for each audio frame.
  • Each row of the sample matrix of the audio frame may represent the audio characteristics of the audio signal in the dimensions corresponding to different sound collection devices, and each column of the sample matrix of the audio frame may represent a different audio sampling time point.
  • the sample matrix of each audio frame can be regarded as the multi-dimensional normal distribution of the audio characteristics of the audio signal within the sampling time of one frame, and each sound collecting device can be regarded as collecting audio signals in one dimension in space.
  • One of the main dimensions in the audio frame sample can best represent the audio characteristics of the target audio data that is not disturbed by noise. Since the main dimension is interfered by other dimensions during the audio signal acquisition process, it is necessary to determine the dimensions of the audio signal in space and time. Correlation.
  • the covariance matrix can characterize the correlation between different dimensions of the audio signal
  • the audio signal can be counted through the covariance matrix of the audio signal.
  • the rows of the covariance matrix represent the sorted sound collection devices; the columns of the covariance matrix represent the sound collection devices with the same order as the rows of the covariance matrix; any of the sound collection devices corresponding to the audio frame is arbitrary
  • the correlation between the two sound collection devices indicates the correlation between the sound collection device corresponding to the position of the element row in the covariance matrix and the sound collection device corresponding to the position of the element column.
  • determining the covariance matrix corresponding to the audio frame may include the following steps:
  • the rows of the audio sampling matrix represent the sorted sound collection devices.
  • the columns of the audio sampling matrix represent multiple audio sampling time points.
  • the elements of the audio sampling matrix represent the sound collection corresponding to the position of the element row.
  • Device the audio characteristics of the audio signal collected at the sampling time corresponding to the position of the column of the element;
  • the audio sample matrix corresponding to the audio frame is used to determine the covariance matrix corresponding to the audio frame.
  • the values of the elements in the i-th row and the j-th column can be used to represent the i-th sound collecting device and the j-th in the microphone array in the time period of the current audio frame.
  • Related relationship of sound collection equipment if the absolute value of the value of the element in the i-th row and the j-th column is larger, it indicates that the correlation between the i-th sound collecting device and the j-th sound collecting device is greater.
  • the sample matrix of the audio frame can be swapped in rows and columns to obtain a transposed matrix of the sample matrix of the audio frame.
  • the relationship between the rows and columns in the transpose matrix of the sample matrix of the audio frame can be regarded as the mapping relationship between the audio sampling time point and the sound collecting device within the sampling time of the audio frame.
  • the sampling of the audio frame may be determined through the conversion of the mapping relationship. Correlation between the sound collecting device and the sound collecting device during the duration.
  • the mapping relationship can be transformed by matrix multiplication.
  • the sample matrix of the audio frame and the transposed matrix of the sample matrix of the audio frame are obtained through matrix multiplication to obtain a new matrix.
  • the new matrix can represent the corresponding relationship between the sound collecting devices.
  • the matrix is the covariance matrix of the audio frame.
  • the covariance matrix corresponding to the frame of audio signals can be calculated.
  • Step S320 if the audio signal includes the target audio signal, determine the correlation between multiple sound collection devices corresponding to the audio signal.
  • the target audio signal is included in the audio frame through audio activity detection.
  • the step of determining the correlation between multiple sound collection devices corresponding to the audio signal may specifically include:
  • a correlation matrix of multiple sound collection devices corresponding to the audio signal is established, and the value of the element in the correlation matrix represents the correlation between any two sound collection devices of the multiple sound collection devices corresponding to the audio signal.
  • the step of determining the correlation between multiple sound collection devices corresponding to the audio signal may specifically include:
  • each audio frame including the target audio signal in the audio signal determines the covariance matrix corresponding to the audio frame, and the value of the element in the covariance matrix corresponding to the audio frame represents any two sets of multiple sound collection devices corresponding to the audio frame Correlation between audio equipment;
  • the covariance matrix corresponding to the audio signal is determined, and the covariance matrix corresponding to the audio signal is used as the correlation matrix of multiple sound collecting devices corresponding to the audio signal.
  • the covariance matrix corresponding to the audio frame and the covariance matrix corresponding to the previous audio frame are calculated iteratively.
  • the covariance matrix of the audio signal is obtained by matrix addition operation, and the covariance matrix of the audio signal is updated in the form of incremental update.
  • each frame of the audio signal is dynamically changed.
  • different weight values when performing matrix addition calculation of the covariance matrix of two audio frames, different weight values may be set for audio frames of different sampling time periods.
  • the weight value of the covariance matrix of the audio frame preceding the sampling period may be set to be smaller than the weight value of the covariance matrix of the audio frame subsequent to the sampling period.
  • the first sampling time period corresponding to the first audio frame and the second sampling time period corresponding to the second audio frame are compared, the first sampling time The segment is earlier than the second sampling time period, and the weight value of the covariance matrix of the first audio frame may be set smaller than the weight value of the covariance matrix of the second audio frame.
  • Step S330 Perform audio enhancement processing on the audio signal by using the correlation between multiple sound collection devices corresponding to the audio signal to obtain an audio enhanced audio signal.
  • the noise reduction process should reduce the correlation between the sound collecting devices corresponding to the audio signal containing the target audio as much as possible.
  • the element on the main diagonal position of the covariance matrix of the audio frame signal is the variance of the microphone array in each dimension, which can be used to measure the energy or weight of the audio signal in each dimension; and the covariance Elements outside the main diagonal of the matrix can be used to measure the correlation between any two sound collecting devices among multiple sound collecting devices.
  • the elements outside the diagonal of the covariance matrix of the audio signal should be made as small as possible, for example, the covariance matrix of the audio signal should be outside the diagonal
  • the element at the position has the value 0.
  • the covariance matrix of the audio signal can be diagonalized, and a new matrix can be obtained after the diagonalization.
  • the elements of the main diagonal position of the new matrix are the eigenvalues of the covariance matrix of the audio signal.
  • the elements of the non-diagonal position can be set to 0, that is, the diagonalization process is used to retain the The correlation between the audio collection devices has been reduced to the lowest level, so as to avoid noise interference caused by the correlation between the audio collection devices.
  • step S330 may include:
  • Step S331 Use the feature vectors of the correlation matrix to construct a feature vector matrix of the audio signal.
  • Each column of the feature vector matrix is one of the feature vectors of the correlation matrix, and the feature vector matrix indicates that multiple sound collecting devices are not mutually exclusive.
  • Step S332 using the feature vector matrix of the audio signal to perform feature space transformation on the audio signal to obtain an audio signal from which correlation between the sound collecting devices is removed, and using the audio signal from which correlation between the sound collecting devices is removed as an audio enhanced audio signal .
  • the covariance matrix of the audio signal can be diagonalized to remove the audio signals that are correlated between the sound collecting devices, which can specifically include the following steps:
  • Step S11 Perform matrix feature decomposition on the correlation matrix corresponding to the audio signal to obtain the feature value and the feature vector corresponding to the feature value of the correlation matrix corresponding to the audio signal.
  • a feature vector matrix is formed by using a feature vector corresponding to the feature value, and the feature vector matrix may also be referred to as a projection matrix.
  • step S13 the projection matrix is used to perform feature space transformation on the sample matrix of the audio signal to obtain a sample matrix of a new audio signal corresponding to the audio signal after the feature space transformation.
  • the sample matrix of the new audio signal is an audio enhanced audio signal.
  • the correlation between the sound collection devices corresponding to the audio signals after the feature space changes has been reduced to the weakest, thereby achieving noise reduction on the audio signals.
  • the audio signal processing method further includes:
  • Step S340 if the audio signal includes the target audio signal, use the audio signal as the target signal to determine the signal frequency of the target signal; if the audio signal includes the non-target audio signal, use the audio signal as the noise signal, and determine the signal frequency of the noise signal.
  • Step S350 Perform audio enhancement processing on the target signal based on the signal frequency of the target signal and the signal frequency of the noise signal to obtain an audio enhanced audio signal.
  • the audio signal with a specific frequency is retained, and the audio signal with a frequency other than the specific frequency is filtered, thereby removing interference noise other than the signal frequency of the target audio signal.
  • the audio signal may be filtered by an audio filter.
  • An audio filter can be regarded as a frequency selection device for an audio signal.
  • An audio filter can pass an audio signal of a specific frequency in the audio signal, attenuate an audio signal of a frequency other than the specific frequency, and thereby filter out interference noise in the audio signal.
  • the specific frequency may be, for example, a signal frequency of a target signal, and frequencies other than the specific frequency may be, for example, a signal frequency of a noise signal.
  • the filtering process includes retaining audio signals of a specific frequency and removing audio signals within a threshold range of the filtering frequency.
  • a target signal may be used as an audio observation signal, and a noise signal is used as an audio reference signal.
  • the audio observation signal and the audio reference signal are input into an audio filter.
  • the filtering frequency threshold range of the audio filter may not be a fixed value. Frequency range, but the audio reference signal can be used to make the filtering frequency threshold range of the audio filter follow the frequency of the audio reference signal, thereby eliminating noise interference in the audio signal and making the filtering of the audio signal more targeted , To achieve adaptive filtering of audio signals, to achieve audio signal audio enhancement.
  • the correlation between the audio collection devices corresponding to the audio signal may be removed first to obtain an audio signal with the correlation removed; secondly, To obtain the frequency range of the non-target audio signal, and determine the filtering frequency threshold range according to the frequency range of the non-target audio signal; finally, for the de-correlated audio signal, the audio signal with the signal frequency within the filtering frequency threshold range is removed to obtain an enhanced Audio signal for better audio enhancement.
  • the audio signal processing method of the embodiment of the present invention it is possible to detect in real time whether a target audio signal exists in the audio signal, and if a target audio signal exists in the audio signal, determine a correlation characteristic between sound collecting devices in the audio signal, and according to the set The correlation characteristics between audio devices enhance the audio signal to obtain an enhanced target audio signal.
  • the entire audio signal processing process does not need to detect specific information of multiple interference sources, so it can adapt to complex and changing interference environments and improve audio.
  • the noise reduction ratio of the signal is more practical.
  • FIG. 4 is a schematic flowchart of an audio signal processing method according to an embodiment of the present invention. As shown in FIG. 4, in one embodiment, the audio signal processing method 400 in the embodiment of the present invention includes the following steps:
  • Step S410 Receive audio signals by using multiple sound collection devices of the microphone array, and determine whether the audio signals include a target audio signal.
  • step S410 the step of determining whether the audio signal includes a target audio signal may include:
  • Step S411 localize the sound source of the audio signal to determine the position information of the sound source in the audio signal
  • Step S412 Obtain a target location area according to the position information of the sound source, and determine whether the audio signal from the target location area includes the target audio signal.
  • Step S420 If the audio signal includes the target audio signal, determine the correlation between multiple sound collection devices corresponding to the audio signal.
  • the step of determining the correlation between multiple sound collection devices corresponding to the audio signal may specifically include: establishing a correlation matrix of the multiple sound collection devices corresponding to the audio signal, and the value of the elements in the correlation matrix indicates Correlation between any two sound collecting devices of a plurality of sound collecting devices corresponding to the audio signal.
  • the step of determining the correlation between multiple sound collection devices corresponding to the audio signal may specifically include:
  • a target position area is acquired, and it is determined whether the audio signal from the target position area includes the target audio signal.
  • the step of determining the correlation of multiple sound collection devices corresponding to the audio signal may specifically include: for one audio frame in the audio signal, determining whether the audio frame includes the target audio signal through audio activity detection.
  • the step of audio activity detection may specifically include:
  • Audio activity detection is performed on the audio signal using the trained audio activity detection model.
  • the samples used for training the audio activity detection model may include the acoustic characteristics of the speech signal.
  • the step of determining the correlation between multiple sound collection devices corresponding to the audio signal may specifically include:
  • Step S421 Obtain each audio frame including the target audio signal in the audio signal, determine a covariance matrix corresponding to the audio frame, and the value of the element in the covariance matrix corresponding to the audio frame represents any of multiple sound collection devices corresponding to the audio frame Correlation between two sound collection devices;
  • step S422 the covariance matrix corresponding to the audio signal is determined according to the covariance matrix corresponding to the audio frame, and the covariance matrix corresponding to the audio signal is used as the correlation matrix of the multiple sound collecting devices corresponding to the audio signal.
  • the rows of the covariance matrix represent the sorted sound collection devices; the columns of the covariance matrix represent the sound collection devices with the same order as the rows of the covariance matrix; any of the sound collection devices corresponding to the audio frame is arbitrary
  • the correlation between the two sound collection devices indicates the correlation between the sound collection device corresponding to the position of the element row in the covariance matrix and the sound collection device corresponding to the position of the element column.
  • the step of determining a covariance matrix corresponding to the audio frame may include: determining an audio sampling matrix corresponding to the audio frame, wherein rows of the audio sampling matrix represent sorted sound collection devices, and columns of the audio sampling matrix represent multiple At the audio sampling time point, the element of the audio sampling matrix represents the sound collecting device corresponding to the position of the element row, and the audio characteristics of the audio signal collected at the sampling time corresponding to the position of the element column; Determine the covariance matrix corresponding to the audio frame.
  • step S430 audio correlation processing is performed on the audio signal by using the correlation between multiple sound collection devices corresponding to the audio signal to obtain an audio enhanced audio signal.
  • the correlation matrix of multiple sound collection devices corresponding to the audio signal may be used to perform audio enhancement processing on the audio signal to obtain an audio enhanced audio signal.
  • step S430 may specifically include:
  • Step S431 the feature vector of the audio signal is constructed by using the feature vector of the correlation matrix.
  • Each column of the feature vector matrix is one of the feature vectors of the correlation matrix, and the feature vector matrix indicates that multiple sound collecting devices are not mutually exclusive. Related feature space.
  • Step S432 Use the feature vector matrix of the audio signal to perform feature space transformation on the audio signal to obtain an audio signal from which the correlation between the sound collecting devices is removed, and use the audio signal from which the correlation between the sound collecting devices is removed as an audio enhanced audio signal. .
  • the audio signal processing method 400 may further include:
  • Step S440 if the audio signal includes the target audio signal, use the audio signal as the target signal, and determine the signal frequency of the target signal.
  • Step S450 if the audio signal includes a non-target audio signal, use the audio signal as a noise signal, and determine the signal frequency of the noise signal.
  • Step S460 Perform audio enhancement processing on the target signal based on the signal frequency of the target signal and the signal frequency of the noise signal to obtain an audio enhanced audio signal.
  • step S460 may specifically include:
  • Step S21 Acquire a value range of the signal frequency of the noise, and use the value range of the signal frequency of the noise as the noise frequency range;
  • step S22 the audio signal with the signal frequency in the noise frequency range in the target signal is removed to obtain an audio enhanced audio signal.
  • the audio signal processing method of the embodiment of the present invention it is possible to perform sound source localization and audio activity detection on audio signals, and perform audio enhancement processing increments.
  • the entire audio signal processing process does not need to detect specific information of multiple interference sources, so it can Adapt to complex and changeable interference environments and improve the signal-to-noise ratio of audio signals.
  • FIG. 5 shows a schematic block diagram of an audio signal processing apparatus according to an embodiment of the present invention.
  • the audio signal processing apparatus 500 may include:
  • An audio signal detection module 510 configured to receive audio signals using multiple sound collection devices of a microphone array, and determine whether the audio signals include a target audio signal;
  • a correlation determining module 520 configured to determine the correlation of multiple sound collection devices corresponding to the audio signal if the audio signal includes the target audio signal;
  • the audio signal enhancement module 530 is configured to perform audio enhancement processing on the audio signal by using the correlation between multiple sound collection devices corresponding to the audio signal to obtain an audio enhanced audio signal.
  • the audio signal detection module 510 may include:
  • the sound source localization unit is configured to locate the sound source of the audio signal and determine position information of the sound source in the audio signal.
  • the target audio acquisition unit is configured to acquire a target location area according to the position information of the sound source, and determine whether the audio signal from the target location area includes the target audio signal.
  • the correlation determining module 520 may specifically include:
  • a correlation matrix of multiple sound collection devices corresponding to the audio signal is established, and the value of the element in the correlation matrix represents the correlation between any two sound collection devices of the multiple sound collection devices corresponding to the audio signal.
  • the correlation determining module 520 may specifically include:
  • For one audio frame in the audio signal it is determined whether the target audio signal is included in the audio frame through audio activity detection.
  • the correlation determination module 520 may specifically include: a correlation determination module, including:
  • An audio frame correlation determining unit configured to obtain each audio frame including a target audio signal in an audio signal, determine a covariance matrix corresponding to the audio frame, and the value of an element in the covariance matrix corresponding to the audio frame indicates that the audio frame corresponds to a plurality of audio frames.
  • the audio signal correlation determining unit is configured to determine a covariance matrix corresponding to the audio signal according to a covariance matrix corresponding to the audio frame, and use the covariance matrix corresponding to the audio signal as a correlation matrix of a plurality of sound collecting devices corresponding to the audio signal.
  • the rows of the covariance matrix represent the sorted sound collection devices; the columns of the covariance matrix represent the sound collection devices with the same order as the rows of the covariance matrix; any of the sound collection devices corresponding to the audio frame is arbitrary
  • the correlation between the two sound collection devices indicates the correlation between the sound collection device corresponding to the position of the element row in the covariance matrix and the sound collection device corresponding to the position of the element column.
  • the audio frame correlation determining unit is specifically configured to:
  • the rows of the audio sampling matrix represent the sorted sound collection devices.
  • the columns of the audio sampling matrix represent multiple audio sampling time points.
  • the elements of the audio sampling matrix represent the sound collection corresponding to the position of the element row.
  • the device the audio characteristics of the audio signal collected at the sampling time corresponding to the position of the column of the element; using the audio sampling matrix corresponding to the audio frame to determine the covariance matrix corresponding to the audio frame.
  • the audio signal enhancement module 530 may specifically include:
  • a feature vector determining unit configured to use a feature vector of a correlation matrix to construct a feature vector matrix of an audio signal, where each column of the feature vector matrix is one of the feature vectors of the correlation matrix, and the feature vector matrix represents multiple sound collecting devices Unrelated feature spaces.
  • the audio signal enhancement module is also used to use the feature vector matrix of the audio signal to perform a feature space transformation on the audio signal to obtain an audio signal that removes the correlation between the sound collecting devices, and uses the audio signal that removes the correlation between the sound collecting devices as Audio enhanced audio signal.
  • the audio signal processing apparatus 500 may further include:
  • the target signal frequency determining unit is configured to determine a signal frequency of the target signal by using the audio signal as a target signal if the audio signal includes the target audio signal.
  • the noise signal frequency determining unit is configured to determine a signal frequency of the noise signal by using the audio signal as a noise signal if the audio signal includes a non-target audio signal.
  • the audio signal enhancement module 530 may be further configured to perform audio enhancement processing on the target signal based on the signal frequency of the target signal and the signal frequency of the noise signal to obtain an audio enhanced audio signal.
  • the audio signal enhancement module 530 may be further configured to:
  • filtering methods such as adaptive filtering are used to filter non-target audio signals in the audio signals to obtain enhanced audio signals.
  • audio enhancement processing may be performed on the audio signal by removing correlation interference between the sound collecting devices corresponding to the audio signal and adaptive signal frequency filtering to obtain an enhanced audio signal.
  • the audio signal processing device of the embodiment of the invention it is possible to detect in real time whether there is a target audio signal in a target area in the audio signal in a noisy environment with multiple interference sources, determine the signal characteristics of the target audio signal according to the detection result, and perform audio enhancement on the audio signal Processing, to obtain enhanced audio signals, improve the signal-to-noise ratio of audio signals, has better practicability.
  • FIG. 6 is a structural diagram showing an exemplary hardware architecture of a computing device capable of implementing an audio signal processing method and apparatus according to an embodiment of the present invention.
  • the computing device 600 includes an input device 601, an input interface 602, a central processing unit 603, a memory 604, an output interface 605, and an output device 606.
  • the input interface 602, the central processing unit 603, the memory 604, and the output interface 605 are connected to each other through a bus 610.
  • the input device 601 and the output device 606 are connected to the bus 610 through an input interface 602 and an output interface 605, respectively, and then to the computing device 600. Connection of other components.
  • the input device 601 receives input information from the outside (for example, a microphone array), and transmits the input information to the central processing unit 603 through the input interface 602.
  • the central processing unit 603 inputs input based on computer-executable instructions stored in the memory 604.
  • the information is processed to generate output information, the output information is temporarily or permanently stored in the memory 604, and then the output information is transmitted to the output device 606 through the output interface 605; the output device 606 outputs the output information to the outside of the computing device 600 for the user use.
  • the computing device shown in FIG. 6 may also be implemented to include: a memory storing computer-executable instructions; and a processor, which may be described with reference to FIGS. 1 to 5 when executing the computer-executable instructions. Audio signal processing method and device.
  • the processor may communicate with a microphone array used by the audio interactive device, thereby executing computer-executable instructions based on related information from the audio interactive device, thereby implementing the audio signal processing method and apparatus described in conjunction with FIGS. 1 to 5.
  • the computing device 600 shown in FIG. 6 may be implemented as an audio signal processing device, including a memory and a processor; the memory is used to store executable program code; and the processor is used to read the stored data in the memory.
  • the program code is executable to perform the audio signal processing method described above in connection with FIGS. 1 to 5.
  • FIG. 7 is a schematic structural diagram of an audio interaction device according to an embodiment of the present invention. As shown in FIG. 7, an embodiment of the present invention provides an audio interactive device.
  • the audio interactive device 700 includes:
  • An audio signal detector 710 configured to receive audio signals using multiple sound collection devices of a microphone array, and determine whether a target audio signal is included in the audio signal;
  • a target audio separator 720 configured to determine the correlation between multiple sound collection devices corresponding to the audio signal if the audio signal includes the target audio signal;
  • the target audio enhancer 730 is configured to perform audio enhancement processing on the audio signal by using the correlation of multiple sound collection devices corresponding to the audio signal to obtain an audio enhanced audio signal.
  • the audio interactive device 700 further includes:
  • the target audio separator is also used for determining the signal frequency of the target signal if the audio signal includes the target audio signal as the target signal;
  • the interference audio separator 740 is configured to determine a signal frequency of the noise signal by using the audio signal as a noise signal if the audio signal includes a non-target audio signal;
  • the target audio enhancer 730 is further configured to perform audio enhancement processing on the target signal based on the signal frequency of the target signal and the signal frequency of the noise signal to obtain an audio enhanced audio signal.
  • adaptive audio enhancement can be realized in a noisy environment with multiple interference sources, thereby improving the signal-to-noise ratio of the audio signal and having better practicability.
  • the above embodiments it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof.
  • software it may be implemented in whole or in part in the form of a computer program product or a computer-readable storage medium.
  • the computer program product or computer-readable storage medium includes one or more computer instructions.
  • the computer program instructions When the computer program instructions are loaded and executed on a computer, the processes or functions according to the embodiments of the present invention are wholly or partially generated.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be from a website site, computer, server, or data center Transmission by wire (for example, coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (for example, infrared, wireless, microwave, etc.) to another website site, computer, server, or data center.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, a data center, and the like that includes one or more available medium integration.
  • the available medium may be a magnetic medium (for example, a floppy disk, a hard disk, a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium (for example, a solid state disk (Solid State Disk (SSD)), and the like.
  • a magnetic medium for example, a floppy disk, a hard disk, a magnetic tape
  • an optical medium for example, a DVD
  • a semiconductor medium for example, a solid state disk (Solid State Disk (SSD)

Abstract

一种音频信号处理方法、装置、设备和存储介质。该方法包括:使用麦克风阵列的多个集音设备接收音频信号,确定音频信号中是否包括目标音频信号(S410);如果音频信号中包括目标音频信号,确定音频信号对应的多个集音设备的相关性(S420);利用音频信号对应的多个集音设备的相关性,对音频信号进行音频增强处理,得到音频增强的音频信号(S430)。该方法可以实现在多干扰源的嘈杂环境下实现自适应的音频增强,提高音频信号的信噪比。

Description

音频信号处理方法、装置、设备和存储介质
本申请要求2018年07月30日递交的申请号为201810882538.8、发明名称为“音频信号处理方法、装置、设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及数据处理技术领域,尤其涉及一种音频信号处理方法、装置、设备和存储介质。
背景技术
随着音频识别技术的不断发展,音频识别技术在汽车驾驶、智能家居、智能商务系统等领域得到了快速的发展,音频识别技术可以通过对音频的识别,快速、准确地执行相应的功能。
为了确保多干扰源的嘈杂环境中音频识别系统的可用性,现有的音频识别技术可以通过检测多个干扰源的具体信息,从而对检测到的干扰源发出的干扰信号进行信号分离,得到目标音频信号。但是在强干扰环境中,干扰源复杂多变,这种信号处理方法采集到的音频信号质量较差,音频识别信噪比低,实用性不高。
发明内容
本发明实施例提供一种音频信号处理方法、装置、设备和存储介质,可以实现在多干扰源的嘈杂环境下实现自适应的音频增强,提高音频信号的信噪比。
根据本发明实施例的一方面,提供一种音频信号处理方法,包括:
使用麦克风阵列的多个集音设备接收音频信号,确定音频信号中是否包括目标音频信号;如果音频信号中包括目标音频信号,确定音频信号对应的多个集音设备的相关性;利用音频信号对应的多个集音设备的相关性,对音频信号进行音频增强处理,得到音频增强的音频信号。
根据本发明实施例的另一方面,提供一种音频信号处理装置,包括:
音频信号检测模块,用于使用麦克风阵列的多个集音设备接收音频信号,确定音频信号中是否包括目标音频信号;相关性确定模块,用于如果音频信号中包括目标音频信号,确定音频信号对应的多个集音设备的相关性;音频信号增强模块,用于利用音频信 号对应的多个集音设备的相关性,对音频信号进行音频增强处理,得到音频增强的音频信号。
根据本发明实施例的再一方面,提供一种音频信号处理设备,包括:存储器和处理器;该存储器用于存储程序;该处理器用于读取存储器中存储的可执行程序代码以执行上述的音频信号处理方法。
根据本发明实施例的又一方面,提供了一种计算机可读存储介质,该计算机可读存储介质中存储有指令,当指令在计算机上运行时,使得计算机执行上述各方面的音频信号处理方法。
根据本发明实施例的还一方面,提供了一种音频交互设备,包括:
音频信号检测器,用于使用麦克风阵列的多个集音设备接收音频信号,确定音频信号中是否包括目标音频信号;目标音频分离器,用于如果音频信号中包括目标音频信号,确定音频信号对应的多个集音设备的相关性;目标音频增强器,用于利用音频信号对应的多个集音设备的相关性,对音频信号进行音频增强处理,得到音频增强的音频信号。
根据本发明实施例中的音频信号处理方法、装置、设备和存储介质,可以在多干扰源的嘈杂环境下检测音频信号中是否存在目标音频信号,并根据检测结果确定音频信号对应的多个集音设备的相关性,并利用音频信号对应的集音设备之间的相关性,对音频信号进行音频增强处理,得到增强的音频信号,整个音频信号处理过程不需要检测多个干扰源的具体信息,即音频信号的音频增强过程与具体的干扰源无关,因此可以适应复杂多变的干扰环境,具有更强的实用性。
附图说明
为了更清楚地说明本发明实施例的技术方案,下面将对本发明实施例中所需要使用的附图作简单地介绍,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是示出根据本发明示例性实施例的音频信号处理方法的应用场景示意图;
图2是示出根据本发明实施例的麦克风阵列对目标区域进行声源定位的场景示意图;
图3是示出根据本发明一实施例的音频信号处理方法的流程图;
图4是示出根据本发明另一实施例的音频信号处理方法的流程图;
图5根据本发明一实施例的音频信号处理装置的结构示意图;
图6是示出了可以实现根据本发明实施例的音频信号处理方法和装置的计算设备的示例性硬件架构的结构图;
图7示出了根据本发明实施例的音频交互设备的结构示意图。
具体实施方式
下面将详细描述本发明的各个方面的特征和示例性实施例,为了使本发明的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细描述。应理解,此处所描述的具体实施例仅被配置为解释本发明,并不被配置为限定本发明。对于本领域技术人员来说,本发明可以在不需要这些具体细节中的一些细节的情况下实施。下面对实施例的描述仅仅是为了通过示出本发明的示例来提供对本发明更好的理解。
需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括……”限定的要素,并不排除在包括要素的过程、方法、物品或者设备中还存在另外的相同要素。
在本发明实施例中,音频交互系统例如智能音响设备、智能音频购物机、智能音频售票机和智能音频电梯,通常需要在商场、地铁站和社交场所等存在多干扰源的嘈杂环境中进行音频信号采集和音频信号处理。
在本发明下述实施例的描述中,可以使用麦克风阵列对存在多干扰源的嘈杂环境中来自空间不同方向的音频信号进行信号采样和信号处理。麦克风阵列中的每个声学传感器例如麦克风可以称为是一个阵元,每个麦克风阵列至少包括两个阵元。每个阵元可以视为一个声音采集通道,可以利用包含多个阵元的麦克风阵列获得多通道声音信号。
本发明实施例中的麦克风阵列可以是一组位于空间不同位置的声学传感器按照一定的形状规则布置形成的阵列,是对空间传播的声音信号进行空间采样的一种装置。麦克风阵列中声学传感器被布置的形状排列规则,可以称为是麦克风阵列的拓扑结构,根据麦克风阵列的拓扑结构,可以将麦克风阵列分为线性麦克风阵列、平面麦克风阵列和立体麦克风阵列。
作为一个示例,线性麦克风阵列可以表示麦克风阵列的阵元中心位于同一条直线上, 例如水平阵列;平面麦克风阵列可以表示麦克风阵列的阵元中心分布在一个平面上,例如三角形阵列、圆形阵列、T型阵列、L型阵列、方型阵列等;立体麦克风阵列可以表示麦克风阵列的阵元中心分布在立体空间中,例如多面体阵列、球形阵列等。
本发明实施例的音频信号处理方法对使用的麦克风阵列的具体形式不做具体限定。作为一个示例,麦克风阵列可以是水平阵列、T型阵列、L型阵列、正方体阵列。为了简化描述起见,本文下述的多个实施例以L型阵列为例来阐述采集多通道音频信号。但该描述并不能被解读为限制本方案的范围或实施可能性,L型阵列以外的其他拓扑结构的麦克风阵列的处理方法与对L型阵列的处理方法保持一致。
在本发明实施例中,音频信号处理的实际应用场景中通常包含环境噪声、其他音频信号干扰例如人声干扰、混响以及回声等多种干扰源。其中,混响可以理解为是声音信号和该声音信号在传播时经障碍物多次反射和吸收而形成声波叠加的一种声学现象;回声也可以称为是声学回波(Acoustic Echo),回声可以理解为是音频交互设备自身扬声器播放的声音在空间内经传播和反射形成的重复的声音信号,该重复的声音信号会回传给麦克风所形成的噪声干扰。上述环境噪声、其他音频信号干扰、混响以及回声等多种干扰源构成强干扰且复杂多变的声学环境,有损音频信号处理系统采集到的音频信号的质量。
在本发明实施例中,目标音频信号可以表示来自目标区域的可以驱动音频交互设备进行交互的音频信号。作为一个示例,目标音频信号可以是语音信号,也可以是通过机器播放的有含义的音频信号,只要该音频信号可以驱动音频交互设备进行音频交互即可。
下面以地铁站音频购票为例,描述本发明实施例的音频信号处理方法的实际应用场景。图1示出了根据本发明示例性实施例的音频信号处理方法的应用场景示意图。
如图1所示,地铁站的音频购票环境可以包括购票人10和音频购票系统20,音频购票系统20可以包括人机交互显示设备21和音频交互设备22。音频购票系统20可以让购票人10使用音频交互的形式,实现通过指定站名购票、指定票价购票、或者目的地模糊搜索购票等功能。
在一个实施例中,音频交互显示设备22可以包括麦克风阵列(图中未示出),音频交互设备22可以利用麦克风阵列中的多个阵元所提供的多个声音采集通道,实时采集来自实际购票环境中的声音信号。
继续参考图1,在一个实施例中,人机交互显示设备21可以显示建议的音频交互指令,建议的音频交互指令可以是对购票人10与音频交互设备22的音频交互具有规范引 导作用的指令范例。例如“我要去站点B”、“买两张去站点C的票”以及“两张票价A的票”等;人机交互显示设备21可以根据购票人10发出的音频交互指令中的目的地,经音频交互设备22处理后,调用地图服务显示距离该目的地最近的推荐地铁线路和站点;以及人机交互显示设备21还可以显示支付信息,以便购票人10根据显示的支付信息进行支付后,由音频购票系统20完成出票。
在实际购票环境中,可以将购票人10的声源作为目标声源,音频交互设备22使用麦克风阵列采集到的音频信号中不仅可以包括来自目标声源的目标音频信号,还可以包括麦克风阵列拾音范围内的包括环境噪声、人声干扰、混响以及回声等多种干扰源的非目标音频信号。
作为一个示例,环境噪声例如可以包括地铁列车的运行噪声以及通风空调设备的运行产生的噪声等;人声干扰例如可以是购票人10以外的其他人员发出的音频信号。在下述实施例的描述中,也可以将非目标音频信号称为干扰信号或噪声信号。
为了在多干扰源的嘈杂环境下拾取有效的音频信号,提供稳健的音频识别效果,本发明实施例提供一种音频信号处理方法、装置、系统和存储介质,可以在多干扰源的公共场所等嘈杂环境中实现对目标音频进行音频增强,提高音频信号的质量和信噪比。
为了更好的理解本发明,下面将结合附图,以地铁站音频购票环境为例,详细描述根据本发明实施例的音频信号处理方法,应注意,这些实施例并不是用来限制本发明公开的范围。
图2示出了根据本发明实施例的麦克风阵列对目标区域进行声源定位的场景示意图。
在本发明实施例中,声源定位是指在实际应用场景中,基于麦克风阵列采集的音频信号,确定音频信号的声源方向或声源的空间位置,从而对音频的声源进行位置探测或者方向探测,确定麦克风阵列和与声源的空间位置关系。作为一个示例,音频的声源可以来自通过麦克风阵列与音频交互设备22进行对话的人。在地铁音频购票环境中,可以基于麦克风阵列对来自购票人10的位置和方向进行探测,确定购票人10相对于麦克风阵列的位置或方向。
如图2所示,在本发明实施例中,可以利用麦克风阵列将接收到的来自不同方位的音频信号,在麦克风阵列的拾音范围内,接收的音频信号可以包括来自购票人10,以及例如噪声1、噪声2和噪声3等干扰源的噪声信号。
在本发明实施例中,利用麦克风阵列进行声源定位的方法通常可以包括以下两种方 案。第一种方案是通过麦克风阵列中的两个阵元接收到同一声源的声音信号的强弱,计算每个阵元与该声源之间的距离,从而判断声源的位置;第二种方案是利用麦克风阵列中的两个阵元接收到来自同一声源的声音信号的时间差,利用该时间差确定对声源进行定位。
考虑到多干扰源的强干扰环境对声音信号的强弱影响较大,而对麦克风阵列中每个阵元接收到声音信号的时间差影响较小,检测结果更精确,因此本发明实施例可以采用时延估计(Time Delay Estimation,TDE)的方法进行声源定位。
在一个实施例中,采用时延估计的方法进行声源定位的方法可以包括以下两个步骤。步骤S01,利用时延估计算法计算来自同一个声源到达麦克风阵列中不同阵元的时间差(Time Difference Of Arrival,TDOA);步骤S02,利用同一个声源到达麦克风阵列中不同阵元的时间差,以及麦克风阵列中阵元之间的几何位置关系估计声源的位置。
在本发明实施例中,步骤S01中常用的时延估计方法可以包括:广义互相关时延估计法、加权广义互相关估计法、自适应时延估计法等。为了便于理解,下面以计算同一声源的声音信号到达麦克风阵列中的两个不同阵元的时间差为例,介绍本发明实施例的时延估计法。
在一个实施例中,广义互相关时延估计法可以将麦克风阵列中的两个阵元接收到的来自同一声源的声音信号作为两个麦克风信号,通过计算这两个麦克风信号之间的互相关函数,利用该互相关函数描述两个麦克风信号之间的相互关系,以此衡量两个麦克风信号之间相似程度和在时间轴上的位置差别,对该互相关函数进行峰值搜索,搜索得到的互相关函数的峰值对应的时刻即这两个麦克风信号的时延差。
在本发明实施例中,声音信号是一种周期信号,时域分析和频域分析是对声音信号进行周期性分析的两种不同方式。简单的说,时域可以用于描述音频信号与时间的关系,即以时间作为变量,分析音频信号随时间的动态变化;而频域可以用于描述音频信号与频率之间的关系,即以频率作为变量,分析音频信号在不同频率时的特性。
在该实施例中,对音频信号进行时域分析时,可以较为直观地获取音频信号的周期和振幅等信息,而将音频信号从时域信号转化为频域信号,通过分析音频信号的频谱特性,在频域对音频信号进行处理,可以获得更高的处理效率和性能。
在一个实施例中,声音信号的时域信号和频域信号可以相互转换。例如,可以通过傅立叶变换算法将音频信号从时域信号变换为频域信号,以及通过傅里叶反变换算法可以将频域信号变换为时域信号。
具体地,傅立叶变换算法的基本原理可以理解为:将通过连续测量得到的声音信号,表示为不同频率的正弦波信号的无限叠加。因此,傅立叶变换可以将直接测量到的声音信号作为原始信号,并以该叠加方式计算声音信号中不同正弦波信号的频率、振幅和相位,从而将时域信号变换为频域信号。
在一个实施例中,对声音信号进行频域分析时,声音信号的功率谱密度函数(如下简称功率谱)可以用于描述声音信号功率随频率的变化关系。两个麦克风信号的互功率密度谱函数(如下可以简称为互谱或互频谱)可以从频域上描述该两个麦克风信号的相互关系,互功率密度谱函数中的每一个谱线可以理解为是一个具有幅度加权的冲击函数。
由此可见,互谱与互相关函数是分别从频域和时域描述上述实施例中的两个麦克风信号的相互关系的两种不同表示。
因此,作为一个示例,本发明实施例中通过广义互相关时延估计法进行声源定位时,首先可以在频域计算出两个麦克风信号的互功率谱函数,并在频域内乘以相应的权重函数,以增强两个麦克风信号中信噪比较高的频率部分,从而抑制干扰源的影响,最后将互功率谱函数通过傅里叶反变换到时域,得到两个麦克风信号之间的互相关函数,该互相关函数的峰值对应的时刻就是该两个麦克风信号的时延差。
在一个实施例中,广义互相关时延估计法的性能取决于选取的权重函数。其中,根据选取的权重函数的不同,加权广义互相关估计法中最具代表性的是采用最大似然(Maximum Likelihood,ML)加权的互相关算法和采用相位变换(Phase Transform,PHAT)加权的互相关算法,用户可以根据实际情况和计算需求进行选择。
在该实施例中,最大似然加权需要已知声源信号的功率谱及干扰源的功率谱,因此理想情况下,最大似然加权的广义互相关时延估计法精度较高,可以达到最优估计,但是考虑到实际应用中强干扰环境中干扰源的复杂多变,最大似然加权的互相关估计法在计算干扰源的功率谱的复杂度高且计算量大。
在该实施例中,相位变换加权是一种利用声源信号和干扰源信号的先验信息,根据声源信号和不同的干扰源信号选择不同的加权函数的加权方式。相位变换加权不需要计算声源信号的功率谱及计算干扰源信号的功率谱,受到噪声干扰时两个麦克风信号的互相关函数的峰值仍然较为突出且容易分辨,表现出了相对较好的鲁棒性。
在一个实施例中,自适应时延估计法是一种可以不依赖声音信号的先验信息,而是根据实际应用场景中声音信号的变化,不断调整参与时延计算的函数参数和函数结构,进而估计出同一个声源到达麦克风阵列中两个不同阵元的时间差的计算方法,因此,自 适应时延估计法适用于跟踪动态和时变的音频输入环境。
在本发明实施例的多干扰音频环境中,麦克风阵列中的每个阵元接收到的音频信号可能叠加了多种混响信号,从而造成两个麦克风信号的互相关函数的峰值点可能有多个,针对此问题,可以采用上述实施例描述的相位变换加权的互相关算法进行声源定位。
继续参考图2,进行声源定位时可以预先建立麦克风阵列对应的三维空间坐标系。作为一个示例,该三维空间坐标系的坐标原点M 0可以是音频交互设备22中麦克风阵列的中心位置,或者麦克风阵列中的任意一个阵元的位置,或指定的其他位置。
在一个实施例中,可以根据麦克风阵列中阵元之间的排列顺序和阵元之间的间隔距离,确定每个之阵元相对于坐标原点M 0的偏移距离,从而确定每个阵元M i相对于坐标原点M 0的三维空间坐标。
在一个实施例中,假设购票人10作为目标声源位于三维空间中的空间位置点S,该位置点S的三维空间坐标可以表示为S(x 0,y 0,z 0),其中,x 0,y 0,z 0分别为位置点S在三维空间中坐标系的X轴、Y轴和Z轴的坐标值,(x 0,y 0,z 0)表示空间位置点S的三维空间坐标。
在该实施例中,空间位置点S的三维空间坐标和坐标矢量满足:
Figure PCTCN2019096813-appb-000001
其中,r 0表示购票人10所在的空间位置点S(x 0,y 0,z 0)与三维空间坐标系的坐标原点M 0(x 0,y 0,z 0)之间的距离,俯仰角θ 0表示空间点S与坐标原点M 0形成的连线与Z轴正方向的夹角,水平角
Figure PCTCN2019096813-appb-000002
表示空间点S在XOY平面的投影S′与坐标原点M 0形成的连线与X轴正方向的夹角。其中,水平角
Figure PCTCN2019096813-appb-000003
的取值范围可以是
Figure PCTCN2019096813-appb-000004
俯仰角为θ 0的取值范围可以是0°≤θ 0≤360°。
在一个实施例中,可以将r 0称为是空间位置点S与麦克风阵列的距离,θ 0称为是空间位置点S与麦克风阵列的俯仰角,
Figure PCTCN2019096813-appb-000005
称为是空间位置点S与麦克风阵列的水平角。
在本发明实施例中,在检测到来自音频信号到达两个不同的阵元的时间差后,可以利用该时间差可以计算该同一声源的音频信号到达两个不同的阵元的距离差,在上述步骤S02中,可以利用同一声源的音频信号到达两个不同的阵元的距离差,麦克风阵列中每个阵元的三维空间坐标以及该声源的三维空间坐标,利用几何解析原理,计算该声源相对于麦克风阵列的位置或者方向。
在本发明实施例中,声源定位精度与声源与麦克风阵列的角度(俯仰角和/或水平角)和距离有关,为了提高麦克风阵列的声源定位精度,以及提高音频交互设备22的处理效率,可以预先设置音频信号位置信息的目标区域,检测目标区域内的声源,从而缩小音频交互设备22的声源采集范围,提高跟踪声源目标的准确度和计算效率。
在本发明实施例中,目标音频信号的声源位置的空间区域范围可以根据实际应用场景来确定。音频购票应用场景中,购票人10通常会位于靠近音频购票系统20的一个较为固定的区域范围内,来自该区域范围内的音频信号中包括声源位置的概率更高。因此,在一个实施例中,设定的目标区域满足如下条件,即该区域范围内任意一个空间点R(x i,y i,z i)的坐标矢量满足r i≤r max,θ i≤θ max
Figure PCTCN2019096813-appb-000006
也就是说,的声源位置的空间区域范围内的空间点R与麦克风阵列的距离小于等于预设的距离最大值r max,空间点R与麦克风阵列的水平角小于等于预设的水平角最大值
Figure PCTCN2019096813-appb-000007
空间点R与麦克风阵列的俯仰角小于等于预设的俯仰角最大值θ max
根据本发明实施例的音频信号处理方法,对音频信号进行声源定位后,可以获取来自目标区域的音频信号例如图2所示出的噪声1、噪声2和购票人10,通过目标区域的设定,还可以有效过滤部分干扰源,尤其是非购票人员的音频干扰,从而缩小声源目标的检测范围,提高跟踪声源目标的效率和精确程度。
在一个实施例中,可以通过音频活动检测(Voice Activity Detection,VAD),确定来自目标区域内的音频信号是否包括目标音频信号。
在本发明实施例中,可以利用音频活动检测组件,从音频信号中检测目标音频信号所在的起点时刻和终点时刻,从而提取该目标音频信号,排除静音段和非目标音频信号的干扰,减少音频识别系统的计算压力,提高音频识别系统的响应速度。
在一个实施例中,该音频活动检测组件可以是利用标注为目标音频的正样本和标注为非目标音频的负样本,预先进行训练得到的。
作为一个示例,可以构建神经网络模型,利用标注为目标音频的正样本和标注为非目标音频的负样本,对该神经网络模型进行训练,得到可以进行音频活动检测的音频活动检测模型,根据该音频活动检测模型生成音频活动检测组件。应理解,本发明实施例对神经网络模型的具体形式不做具体限定,神经网络模型例如可以是深度神经网络、循环神经网络或卷积神经网络等神经网络中的任一种。
作为一个示例,目标音频信号为语音信号时,标注为目标音频的正样本可以是包含 语音信号声学特征的音频信号,标注为非目标音频的负样本可以是不包含语音信号声学特征的音频信号。
在一个实施例中,为了在复杂多变的声学环境下对多干扰源形成的噪声信号进行抑制,提高音频信号的信噪比,可以根据目标音频活动检测结果,对不同检测结果中的音频信号分别进行统计,并针对统计的音频信号进行音频增强处理,得到音频增强后的音频信号。
下面结合图3,描述如何通过音频信号的音频活动检测结果,对目标音频信号和非目标音频信号分别进行统计,并利用统计结果进行音频增强处理的具体过程。
图3示出了根据本发明一实施例的音频信号处理方法的流程图。如图3所示,本发明实施例中的音频信号处理方法可以包括:
步骤S310,使用麦克风阵列的多个集音设备接收音频信号,确定音频信号中是否包括目标音频信号。
在该步骤中,目标音频信号为来自目标区域的可以驱动音频交互设备进行交互的音频信号。作为一个示例,该目标音频信号可以是语音信号,也可以是机器播放的有含义的声音,只要能驱动音频交互设备进行交互即可。
在一个实施例中,可以通过上述实施例中描述的声源定位方法,对采集的音频信号进行声源定位,获取定位得到的声源中来自目标区域的音频信号;并可以对来自目标区域的音频信号进行音频活动检测,确定来自目标区域的音频信号中是否包括目标音频信号。
具体地,通过对音频信号的音频活动检测,可以确定该音频信号的每个指定时间段内是否存在目标音频信号,从而对音频信号中每个指定时间段内的目标音频信号或每个指定时间段内的非目标音频信号分别进行统计。
为了简化描述起见,本文下述的多个实施例以音频信号中每个音频帧所占用时间段作为一个指定时间段,对音频信号中的音频帧依次进行处理。但该描述并不能被解读为限制本方案的范围或实施可能性,其他自定义的时间段内的音频信号的处理方法与对每个音频帧的音频信号的处理方法保持一致。
在一个实施例中,音频信号的采样率表示对于音频信号的波形进行处理时,每秒抽取声波幅度样本的次数。采样频率的计量单位可以是赫兹。作为一个示例,对音频信号的采样率为16kHz时,表示每秒对音频信号采样16000次。
在一个实施例中,可以根据每个音频帧的时长,确定单位时间内包含的音频帧的数 量。作为一个示例,每个音频帧的时长为0.01秒即10ms时,表示每10ms的音频采样数据构成一个音频帧,即每秒包括100个音频帧。
在该实施例中,如果以每个音频帧作为音频信号的处理单位,根据音频信号的采样率和每个音频帧的时长,可以计算得到每个音频帧中的采样信号对应的采样次数。作为一个示例,音频信号的采样率为16kHz,每个音频帧的采样时长为0.01秒,即每秒对音频信号采样16000次,且每秒包括100个音频帧,则可以确定每个音频帧中包括160个采样点。
为了便于理解,下面以音频信号中的每个音频帧的采样数据为例,描述对音频信号的音频活动检测结果进行统计的具体步骤。
在本发明实施例中,如果将每个集音设备视为在空间中的一个维度采集音频信号,在一个音频帧中,将麦克风阵列的每个集音设备的采样数据按行存放,可以得到每个音频帧的样本矩阵。音频帧的样本矩阵的每行可以表示音频信号在不同的集音设备对应的维度的音频特征,音频帧的样本矩阵的每列可以表示不同的音频采样时间点。
因此,可以将每个音频帧的样本矩阵视为音频信号在一个帧的采样时间内的音频特征的多维正态分布,将每个集音设备视为在空间中的一个维度采集音频信号,假设音频帧样本中某个主要维度最能代表未被噪声干扰的目标音频数据的音频特征,由于音频信号采集过程中,该主要维度受到其他维度的干扰,因此需要确定音频信号各维度在空间和时间中的相关性。
进一步地,由于协方差矩阵可以表征音频信号的不同维度之间的相关性,因此可以通过音频信号的协方差矩阵对该音频信号进行统计。
下面通过具体的实施例描述确定每帧音频信号的协方差矩阵的具体步骤。
在一个实施例中,协方差矩阵的行表示排序后的集音设备;协方差矩阵的列表示与协方差矩阵的行具有相同排序的集音设备;音频帧对应的多个集音设备中任意两个集音设备之间的相关性表示:协方差矩阵中元素的行所在位置对应的集音设备与元素的列所在位置对应的集音设备之间的相关性。
在一个实施例中,确定音频帧对应的协方差矩阵,可以包括如下步骤:
确定音频帧对应的音频采样矩阵,音频采样矩阵的行表示经排序的集音设备,音频采样矩阵的列表示多个音频采样时间点,音频采样矩阵的元素表示元素的行所在位置对应的集音设备,在元素的列所在位置对应的采样时间点采集的音频信号的音频特征;
利用音频帧对应的音频采样矩阵,确定音频帧对应的协方差矩阵。
举例说明,在每帧音频信号的协方差矩阵中,第i行第j列的元素取值,可以用于表示在当前音频帧的时间段内麦克风阵列中第i个集音设备和第j个集音设备的相关关系。在该示例中,如果第i行第j列的元素的取值的绝对值越大,表示第i个集音设备和第j个集音设备相关性越大。
举例来说,可以将该音频帧的样本矩阵作行列互换,得到音频帧的样本矩阵的转置矩阵。则音频帧的样本矩阵的转置矩阵中行和列的关系,可以视为该音频帧的采样时长内,音频采样时间点与集音设备的映射关系。
在该实施例中,在确定集音设备与音频采样时间点的映射关系,和确定音频采样时间点与集音设备的映射关系的情况下,可以通过映射关系的转换,确定该音频帧的采样时长内,集音设备与集音设备的相关关系。
在一个实施例中,可以通过矩阵乘法进行映射关系的转换。也就是说,将音频帧的样本矩阵与音频帧的样本矩阵的转置矩阵,通过矩阵乘法的运算,得到一个新的矩阵,该新的矩阵可以表征集音设备间的对应关系,该新的矩阵即该音频帧的协方差矩阵。
通过上述实施例可知,对于每帧音频信号,均可以计算得到该帧音频信号对应的协方差矩阵。
步骤S320,如果音频信号中包括目标音频信号,确定音频信号对应的多个集音设备的相关性。
在一个实施例中,对于音频信号中的一个音频帧,通过音频活动检测,确定音频帧中是否包括目标音频信号。
在一个实施例中,确定音频信号对应的多个集音设备的相关性的步骤,具体可以包括:
建立音频信号对应的多个集音设备的相关性矩阵,相关性矩阵中元素的取值表示音频信号对应的多个集音设备中任意两个集音设备之间的相关性。
在一个实施例中,确定音频信号对应的多个集音设备的相关性的步骤,具体可以包括:
获取音频信号中包括目标音频信号的每个音频帧,确定音频帧对应的协方差矩阵,音频帧对应的协方差矩阵中元素的取值表示音频帧对应的多个集音设备中任意两个集音设备之间的相关性;
根据音频帧对应的协方差矩阵,确定音频信号对应的协方差矩阵,将音频信号对应的协方差矩阵作为音频信号对应的多个集音设备的相关性矩阵。
在该实施例中,根据上述实施例中计算音频帧的协方差矩阵的方法,将该音频帧对应的协方差矩阵与上一个音频帧对应的协方差矩阵,迭代计算每个音频帧对应的协方差矩阵,通过矩阵加法运算,得到音频信号对应的协方差矩阵,实现以增量更新的形式更新该音频信号对应的协方差矩阵。
在该实施例中,每帧音频信号是动态变化的,当检测到一帧音频信号包括目标音频信号时,不需要获取全部目标音频信号帧,再计算获取的全部目标音频信号帧的协方差矩阵,而只需要计算包含目标音频信号的音频信号对应的协方差矩阵,并通过矩阵运算对音频信号对应的集音设备的相关性特征进行增量更新,从而提高音频信号帧的信号特征的统计效率和运算性能。
在一个实施例中,在进行两个音频帧的协方差矩阵的矩阵加法计算时,可以为不同采样时间段的音频帧设置不同的权重值。
在本发明实施例中,为了弱化间隔时间较长的采样时间在先的采样时间段内的音频信号对当前采样时间段内的音频信号的影响,提高对音频信号进行特征分析的准确度,在进行音频帧的协方差矩阵的矩阵运算时,可以设置采样时间段在先的音频帧的协方差矩阵的权重值,小于采样时间段在后的音频帧的协方差矩阵的权重值。
作为一个示例,对于音频帧中的第一音频帧和第二音频帧,如果第一音频帧对应的第一采样时间段和第二音频帧对应的第二采样时间段相比,第一采样时间段早于第二采样时间段,可以设置第一音频帧的协方差矩阵的权重值小于第二音频帧的协方差矩阵的权重值。
步骤S330,利用音频信号对应的多个集音设备的相关性,对音频信号进行音频增强处理,得到音频增强的音频信号。
通过上述实施例可知,音频信号对应的集音设备之间具有相关性且对目标音频造成干扰。因此,降噪处理应使包含目标音频的音频信号对应的集音设备之间的相关性尽可能减弱。
由上述实施例的描述可知,音频帧信号的协方差矩阵的主对角线位置上的元素是麦克风阵列在各个维度上的方差,可用于衡量音频信号在各个维度的能量或权重;而协方差矩阵的主对角线以外位置上的元素可用于衡量多个集音设备中任意两个集音设备之间的相关性。为了使保留下来的集音设备间的相关性尽可能小,应使音频信号的协方差矩阵中对角线以外位置上的元素尽可能小,例如使音频信号的协方差矩阵中对角线以外位置上的元素取值为0。
在一个实施例中,可对音频信号的协方差矩阵进行对角化处理,对角化处理后得到新的矩阵。对角化处理后,新的矩阵的主对角线位置的元素为音频信号的协方差矩阵的特征值,非对角线位置的元素均可以取值为0,即通过对角化处理,保留下的集音设备之间的相关性已经减到最弱,从而避免因集音设备之间的相关性带来的噪声干扰。
在一个实施例中,步骤S330可以包括:
步骤S331,利用相关性矩阵的特征向量,构建音频信号的特征向量矩阵,特征向量矩阵的每列为相关性矩阵的特征向量中的一个,且特征向量矩阵表示多个集音设备之间互不相关的特征空间;
步骤S332,利用音频信号的特征向量矩阵,对音频信号进行特征空间变换,得到去除集音设备之间相关性的音频信号,将去除集音设备之间相关性的音频信号作为音频增强的音频信号。
也就是说,可以通过对音频信号的协方差矩阵进行对角化处理,以去除集音设备之间相关性的音频信号,具体可以包括以下步骤:
步骤S11,对音频信号对应的相关性矩阵进行矩阵特征分解,得到该音频信号对应的相关性矩阵的特征值和与特征值对应的特征向量。
步骤S12,利用与特征值对应的特征向量组成特征向量矩阵,该特征向量矩阵也可以称为是投影矩阵。
步骤S13,利用该投影矩阵,对音频信号的样本矩阵进行特征空间变换,得到特征空间变换后的音频信号对应的新的音频信号的样本矩阵。
在该步骤中,该新的音频信号的样本矩阵即音频增强的音频信号。特征空间变化后的音频信号对应的集音设备之间的相关性已经降到最弱,从而实现对音频信号的降噪。
在一个实施例中,音频信号处理方法,还包括:
步骤S340,如果音频信号中包括目标音频信号,将音频信号作为目标信号,确定目标信号的信号频率;如果音频信号中包括非目标音频信号,将音频信号作为噪声信号,确定噪声信号的信号频率。
步骤S350,基于目标信号的信号频率和噪声信号的信号频率,对目标信号进行音频增强处理,得到音频增强的音频信号。
在该实施例中,通过对音频信号的信号频率进行滤波处理,保留具有特定频率的音频信号,滤除特定频率以外的其他频率的音频信号,从而去除目标音频信号的信号频率以外的干扰噪声。
在一个实施例中,可以通过音频滤波器对音频信号进行滤波处理。音频滤波器可以视为音频信号的频率选择装置,通过音频滤波器可以使音频信号中特定频率的音频信号通过,衰减特定频率以外的其他频率的音频信号,从而滤除音频信号中的干扰噪声。该特定频率例如可以是目标信号的信号频率,特定频率以外的其他频率例如可以是噪声信号的信号频率。
在该实施例中,滤波处理包括保留特定频率的音频信号,去除滤波频率阈值范围内的音频信号。
在本发明实施例中,可以将目标信号作为音频观察信号,噪声信号作为音频参考信号,将音频观察信号和音频参考信号输入音频滤波器,该音频滤波器的滤波频率阈值范围可以不是一个固定的频率范围,而是可以利用音频参考信号,使音频滤波器的滤波频率阈值范围跟随音频参考信号的频率而变化,从而消除音频信号中的噪声干扰,并使得对音频信号的滤波处理更有针对性,实现对音频信号的自适应滤波处理,实现音频信号的音频增强。
在本发明实施例中,为了获得更好的音频增强效果,如果音频信号中包括目标音频信号,可以首先去除音频信号对应的集音设备之间的相关性,得到去除相关性的音频信号;其次,获取非目标音频信号的频率范围,根据非目标音频信号的频率范围确定滤波频率阈值范围;最后,对去除相关性的音频信号,去除信号频率在滤波频率阈值范围内的音频信号,得到增强的音频信号,得到更好的音频增强效果。
根据本发明实施例的音频信号处理方法,可以实时检测音频信号中是否存在目标音频信号,如果音频信号中存在目标音频信号,确定该音频信号中集音设备之间的相关性特征,根据该集音设备之间的相关性特征对音频信号进行增强处理,得到增强的目标音频信号,整个音频信号处理过程不需要检测多个干扰源的具体信息,因此可以适应复杂多变的干扰环境,提高音频信号的降噪比,具有更强的实用性。
图4示出了根据本发明一实施例的音频信号处理方法的流程示意图。如图4所示,在一个实施例中,本发明实施例中的音频信号处理方法400包括以下步骤:
步骤S410,使用麦克风阵列的多个集音设备接收音频信号,确定音频信号中是否包括目标音频信号。
在一个实施例中,步骤S410中,确定音频信号中是否包括目标音频信号的步骤,具体可以包括:
步骤S411,对音频信号进行声源定位,确定音频信号中声源的位置信息;
步骤S412,根据声源的位置信息,获取目标位置区域,确定来自目标位置区域的音频信号中是否包括目标音频信号。
步骤S420,如果音频信号中包括目标音频信号,确定音频信号对应的多个集音设备的相关性。
在一个实施例中,确定音频信号对应的多个集音设备的相关性的步骤,具体可以包括:建立音频信号对应的多个集音设备的相关性矩阵,相关性矩阵中元素的取值表示音频信号对应的多个集音设备中任意两个集音设备之间的相关性。
在一个实施例中,确定音频信号对应的多个集音设备的相关性的步骤,具体可以包括:
对音频信号进行声源定位,确定音频信号中声源的位置信息;
根据声源的位置信息,获取目标位置区域,确定来自目标位置区域的音频信号中是否包括目标音频信号。
在一个实施例中,确定音频信号对应的多个集音设备的相关性的步骤,具体可以包括:对于音频信号中的一个音频帧,通过音频活动检测,确定音频帧中是否包括目标音频信号。
在该步骤中,音频活动检测的步骤具体可以包括:
使用训练好的音频活动检测模型对音频信号进行音频活动检测,其中,对音频活动检测模型进行训练所使用的样本,可以包括语音信号的声学特征。
在一个实施例中,确定音频信号对应的多个集音设备的相关性的步骤,具体可以包括:
步骤S421,获取音频信号中包括目标音频信号的每个音频帧,确定音频帧对应的协方差矩阵,音频帧对应的协方差矩阵中元素的取值表示音频帧对应的多个集音设备中任意两个集音设备之间的相关性;
步骤S422,根据音频帧对应的协方差矩阵,确定音频信号对应的协方差矩阵,将音频信号对应的协方差矩阵作为音频信号对应的多个集音设备的相关性矩阵。
在一个实施例中,协方差矩阵的行表示排序后的集音设备;协方差矩阵的列表示与协方差矩阵的行具有相同排序的集音设备;音频帧对应的多个集音设备中任意两个集音设备之间的相关性表示:协方差矩阵中元素的行所在位置对应的集音设备与元素的列所在位置对应的集音设备之间的相关性。
在一个实施例中,确定音频帧对应的协方差矩阵的步骤,可以包括:确定音频帧对 应的音频采样矩阵,音频采样矩阵的行表示经排序的集音设备,音频采样矩阵的列表示多个音频采样时间点,音频采样矩阵的元素表示元素的行所在位置对应的集音设备,在元素的列所在位置对应的采样时间点采集的音频信号的音频特征;利用音频帧对应的音频采样矩阵,确定音频帧对应的协方差矩阵。
步骤S430,利用音频信号对应的多个集音设备的相关性,对音频信号进行音频增强处理,得到音频增强的音频信号。
在一个实施例中,可以利用音频信号对应的多个集音设备的相关性矩阵,对音频信号进行音频增强处理,得到音频增强的音频信号。
在一个实施例中,步骤S430具体可以包括:
步骤S431,利用相关性矩阵的特征向量,构建音频信号的特征向量矩阵,特征向量矩阵的每列为相关性矩阵的特征向量中的一个,且特征向量矩阵表示多个集音设备之间互不相关的特征空间。
步骤S432,利用音频信号的特征向量矩阵,对音频信号进行特征空间变换,得到去除集音设备之间相关性的音频信号,将去除集音设备之间相关性的音频信号作为音频增强的音频信号。
在一个实施例中,音频信号处理方法400还可以包括:
步骤S440,如果音频信号中包括目标音频信号,将音频信号作为目标信号,确定目标信号的信号频率。
步骤S450,如果音频信号中包括非目标音频信号,将音频信号作为噪声信号,确定噪声信号的信号频率。
步骤S460,基于目标信号的信号频率和噪声信号的信号频率,对目标信号进行音频增强处理,得到音频增强的音频信号。
在一个实施例中,步骤S460具体可以包括:
步骤S21,获取噪声的信号频率的取值范围,将噪声的信号频率的取值范围作为噪声频率范围;
步骤S22,去除目标信号中信号频率在噪声频率范围内的音频信号,得到音频增强的音频信号。
根据本发明实施例的音频信号处理方法,可以对音频信号进行声源定位和音频活动检测,并进行音频增强处理增量,整个音频信号处理过程不需要检测多个干扰源的具体信息,因此可以适应复杂多变的干扰环境,提高音频信号的信噪比。
图5示出了根据本发明实施例的音频信号处理装置的模块示意图,如图5所示,音频信号处理装置500可以包括:
音频信号检测模块510,用于使用麦克风阵列的多个集音设备接收音频信号,确定音频信号中是否包括目标音频信号;
相关性确定模块520,用于如果音频信号中包括目标音频信号,确定音频信号对应的多个集音设备的相关性;
音频信号增强模块530,用于利用音频信号对应的多个集音设备的相关性,对音频信号进行音频增强处理,得到音频增强的音频信号。
在一个实施例中,音频信号检测模块510可以包括:
声源定位单元,用于对音频信号进行声源定位,确定音频信号中声源的位置信息。
目标音频获取单元,用于根据声源的位置信息,获取目标位置区域,确定来自目标位置区域的音频信号中是否包括目标音频信号。
在一个实施例中,相关性确定模块520具体可以包括:
建立音频信号对应的多个集音设备的相关性矩阵,相关性矩阵中元素的取值表示音频信号对应的多个集音设备中任意两个集音设备之间的相关性。
在一个实施例中,相关性确定模块520具体可以包括:
对于音频信号中的一个音频帧,通过音频活动检测,确定音频帧中是否包括目标音频信号。
在一个实施例中,相关性确定模块520具体可以包括:相关性确定模块,包括:
音频帧相关性确定单元,用于获取音频信号中包括目标音频信号的每个音频帧,确定音频帧对应的协方差矩阵,音频帧对应的协方差矩阵中元素的取值表示音频帧对应的多个集音设备中任意两个集音设备之间的相关性;
音频信号相关性确定单元,用于根据音频帧对应的协方差矩阵,确定音频信号对应的协方差矩阵,将音频信号对应的协方差矩阵作为音频信号对应的多个集音设备的相关性矩阵。
在该实施例中,协方差矩阵的行表示排序后的集音设备;协方差矩阵的列表示与协方差矩阵的行具有相同排序的集音设备;音频帧对应的多个集音设备中任意两个集音设备之间的相关性表示:协方差矩阵中元素的行所在位置对应的集音设备与元素的列所在位置对应的集音设备之间的相关性。
在该实施例中,音频帧相关性确定单元,具体用于:
确定音频帧对应的音频采样矩阵,音频采样矩阵的行表示经排序的集音设备,音频采样矩阵的列表示多个音频采样时间点,音频采样矩阵的元素表示元素的行所在位置对应的集音设备,在元素的列所在位置对应的采样时间点采集的音频信号的音频特征;利用音频帧对应的音频采样矩阵,确定音频帧对应的协方差矩阵。
在一个实施例中,音频信号增强模块530具体可以包括:
特征向量确定单元,用于利用相关性矩阵的特征向量,构建音频信号的特征向量矩阵,特征向量矩阵的每列为相关性矩阵的特征向量中的一个,且特征向量矩阵表示多个集音设备之间互不相关的特征空间。
音频信号增强模块,还用于利用音频信号的特征向量矩阵,对音频信号进行特征空间变换,得到去除集音设备之间相关性的音频信号,将去除集音设备之间相关性的音频信号作为音频增强的音频信号。
在一个实施例中,音频信号处理装置500还可以包括:
目标信号频率确定单元,用于如果音频信号中包括目标音频信号,将音频信号作为目标信号,确定目标信号的信号频率。
噪声信号频率确定单元,用于如果音频信号中包括非目标音频信号,将音频信号作为噪声信号,确定噪声信号的信号频率。
音频信号增强模块530,还可以用于基于目标信号的信号频率和噪声信号的信号频率,对目标信号进行音频增强处理,得到音频增强的音频信号。
在一个实施例中,音频信号增强模块530还可以用于:
获取噪声的信号频率的取值范围,将噪声的信号频率的取值范围作为噪声频率范围;去除目标信号中信号频率在噪声频率范围内的音频信号,得到音频增强的音频信号。
在该实施例中,采用自适应滤波等滤波方式对音频信号中的非目标音频信号进行过滤,得到增强的音频信号。
在该实施例中,可以采用去除音频信号对应的集音设备之间的相关性干扰和自适应信号频率滤波的方式对音频信号进行音频增强处理,得到增强的音频信号。
根据发明实施例的音频信号处理装置,可以在多干扰源的嘈杂环境下实时检测音频信号中在目标区域是否存在目标音频信号,根据检测结果确定目标音频信号的信号特征,对音频信号进行音频增强处理,得到增强的音频信号,提高音频信号的信噪比,具有更好的实用性。
本发明实施例的音频信号处理装置的具体细节,可以参考前述音频信号处理方法实 施例中的对应过程,在此不再赘述。
图6是示出能够实现根据本发明实施例的音频信号处理方法和装置的计算设备的示例性硬件架构的结构图。
如图6所示,计算设备600包括输入设备601、输入接口602、中央处理器603、存储器604、输出接口605、以及输出设备606。其中,输入接口602、中央处理器603、存储器604、以及输出接口605通过总线610相互连接,输入设备601和输出设备606分别通过输入接口602和输出接口605与总线610连接,进而与计算设备600的其他组件连接。具体地,输入设备601接收来自外部(例如,麦克风阵列)的输入信息,并通过输入接口602将输入信息传送到中央处理器603;中央处理器603基于存储器604中存储的计算机可执行指令对输入信息进行处理以生成输出信息,将输出信息临时或者永久地存储在存储器604中,然后通过输出接口605将输出信息传送到输出设备606;输出设备606将输出信息输出到计算设备600的外部供用户使用。
也就是说,图6所示的计算设备也可以被实现为包括:存储有计算机可执行指令的存储器;以及处理器,该处理器在执行计算机可执行指令时可以实现结合图1至图5描述的音频信号处理方法和装置。这里,处理器可以与音频交互设备使用的麦克风阵列进行通信,从而基于来自音频交互设备的相关信息执行计算机可执行指令,从而实现结合图1至图5描述的音频信号处理方法和装置。
在一个实施例中,图6所示的计算设备600可以被实现为一种音频信号处理设备,包括存储器和处理器;存储器用于储存有可执行程序代码;处理器用于读取存储器中存储的可执行程序代码以执行如上结合图1至图5描述的音频信号处理方法。
本发明实施例的计算设备的具体细节,可以参考前述音频信号处理方法实施例中的对应过程,在此不再赘述。
图7示出了根据本发明实施例的音频交互设备的结构示意图。如图7所示,本发明实施例提供一种音频交互设备,在一个实施例中,音频交互设备700包括:
音频信号检测器710,用于使用麦克风阵列的多个集音设备接收音频信号,确定音频信号中是否包括目标音频信号;
目标音频分离器720,用于如果音频信号中包括目标音频信号,确定音频信号对应的多个集音设备的相关性;
目标音频增强器730,用于利用音频信号对应的多个集音设备的相关性,对音频信号进行音频增强处理,得到音频增强的音频信号。
在一个实施例中,该音频交互设备700还包括:
目标音频分离器,还用于如果音频信号中包括目标音频信号,将音频信号作为目标信号,确定目标信号的信号频率;
干扰音频分离器740,用于如果音频信号中包括非目标音频信号,将音频信号作为噪声信号,确定噪声信号的信号频率;
目标音频增强器730,还用于基于目标信号的信号频率和噪声信号的信号频率,对目标信号进行音频增强处理,得到音频增强的音频信号。
本发明实施例的音频交互设备的具体细节,可以参考前述音频信号处理方法实施例中的对应过程,在此不再赘述。
根据发明实施例的音频交互设备,可以在多干扰源的嘈杂环境下实现自适应的音频增强,从而提高音频信号的信噪比,具有更好的实用性。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品或计算机可读存储介质的形式实现。所述计算机程序产品或计算机可读存储介质包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本发明实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘Solid State Disk(SSD))等。
需要明确的是,本发明并不局限于上文所描述并在图中示出的特定配置和处理。为了简明起见,这里省略了对已知方法的详细描述。在上述实施例中,描述和示出了若干具体的步骤作为示例。但是,本发明的方法过程并不限于所描述和示出的具体步骤,本领域的技术人员可以在领会本发明的精神后,作出各种改变、修改和添加,或者改变步骤之间的顺序。
以上所述,仅为本发明的具体实施方式,所属领域的技术人员可以清楚地了解到, 为了描述的方便和简洁,上述描述的系统、模块和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。应理解,本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本发明的保护范围之内。

Claims (24)

  1. 一种音频信号处理方法,包括:
    使用麦克风阵列的多个集音设备接收音频信号,确定所述音频信号中是否包括目标音频信号;
    如果所述音频信号中包括目标音频信号,确定所述音频信号对应的所述多个集音设备的相关性;
    利用所述音频信号对应的所述多个集音设备的相关性,对所述音频信号进行音频增强处理,得到音频增强的音频信号。
  2. 根据权利要求1所述的音频信号处理方法,其中,所述确定所述音频信号对应的所述多个集音设备的相关性,包括:
    建立所述音频信号对应的所述多个集音设备的相关性矩阵,所述相关性矩阵中元素的取值表示所述音频信号对应的所述多个集音设备中任意两个集音设备之间的相关性。
  3. 根据权利要求1所述的音频信号处理方法,其中,所述确定所述音频信号中是否包括目标音频信号,包括:
    对所述音频信号进行声源定位,确定所述音频信号中声源的位置信息;
    根据所述声源的位置信息,获取目标位置区域,确定来自所述目标位置区域的音频信号中是否包括目标音频信号。
  4. 根据权利要求1所述的音频信号处理方法,其中,所述确定所述音频信号中是否包括目标音频信号,包括:
    对于所述音频信号中的一个音频帧,通过音频活动检测,确定所述音频帧中是否包括所述目标音频信号。
  5. 根据权利要求1所述的音频信号处理方法,其中,所述确定所述音频信号对应的所述多个集音设备的相关性,包括:
    获取所述音频信号中包括目标音频信号的每个音频帧,确定所述音频帧对应的协方差矩阵,所述音频帧对应的协方差矩阵中元素的取值表示所述音频帧对应的所述多个集音设备中任意两个集音设备之间的相关性;
    根据所述音频帧对应的协方差矩阵,确定所述音频信号对应的协方差矩阵,将所述音频信号对应的协方差矩阵作为所述音频信号对应的所述多个集音设备的相关性矩阵。
  6. 根据权利要求5所述的音频信号处理方法,其中,
    所述协方差矩阵的行表示排序后的集音设备;
    所述协方差矩阵的列表示与所述协方差矩阵的行具有相同排序的集音设备;
    所述音频帧对应的所述多个集音设备中任意两个集音设备之间的相关性表示:所述协方差矩阵中元素的行所在位置对应的集音设备与所述元素的列所在位置对应的集音设备之间的相关性。
  7. 根据权利要求5所述的音频信号处理方法,其中,所述确定所述音频帧对应的协方差矩阵,包括:
    确定所述音频帧对应的音频采样矩阵,所述音频采样矩阵的行表示经排序的集音设备,所述音频采样矩阵的列表示多个音频采样时间点,所述音频采样矩阵的元素表示所述元素的行所在位置对应的集音设备,在所述元素的列所在位置对应的采样时间点采集的音频信号的音频特征;
    利用所述音频帧对应的音频采样矩阵,确定所述音频帧对应的协方差矩阵。
  8. 根据权利要求2所述的音频信号处理方法,所述利用所述音频信号对应的所述多个集音设备的相关性矩阵,对所述音频信号进行音频增强处理,得到音频增强的音频信号,包括:
    利用所述相关性矩阵的特征向量,构建所述音频信号的特征向量矩阵,所述特征向量矩阵的每列为所述相关性矩阵的特征向量中的一个,且所述特征向量矩阵表示所述多个集音设备之间互不相关的特征空间;
    利用所述音频信号的特征向量矩阵,对所述音频信号进行特征空间变换,得到去除所述集音设备之间相关性的音频信号,将所述去除所述集音设备之间相关性的音频信号作为所述音频增强的音频信号。
  9. 根据权利要求1所述的音频信号处理方法,还包括:
    如果所述音频信号中包括目标音频信号,将所述音频信号作为目标信号,确定所述目标信号的信号频率;
    如果所述音频信号中包括非目标音频信号,将所述音频信号作为噪声信号,确定所述噪声信号的信号频率;
    基于所述目标信号的信号频率和所述噪声信号的信号频率,对所述目标信号进行音频增强处理,得到音频增强的音频信号。
  10. 根据权利要求9所述的音频信号处理方法,其中,所述基于所述目标信号的信号频率和所述噪声的信号频率,对所述目标信号进行音频增强处理,得到音频增强的音频信号,包括:
    获取所述噪声的信号频率的取值范围,将所述噪声的信号频率的取值范围作为噪声频率范围;
    去除所述目标信号中信号频率在所述噪声频率范围内的音频信号,得到所述音频增强的音频信号。
  11. 一种音频信号处理装置,包括:
    音频信号检测模块,用于使用麦克风阵列的多个集音设备接收音频信号,确定所述音频信号中是否包括目标音频信号;
    相关性确定模块,用于如果所述音频信号中包括目标音频信号,确定所述音频信号对应的所述多个集音设备的相关性;
    音频信号增强模块,用于利用所述音频信号对应的所述多个集音设备的相关性,对所述音频信号进行音频增强处理,得到音频增强的音频信号。
  12. 根据权利要求11所述的音频信号处理装置,其中,所述相关性确定模块,具体用于:
    建立所述音频信号对应的所述多个集音设备的相关性矩阵,所述相关性矩阵中元素的取值表示所述音频信号对应的所述多个集音设备中任意两个集音设备之间的相关性。
  13. 根据权利要求11所述的音频信号处理装置,其中,所述音频信号检测模块,包括:
    声源定位单元,用于对所述音频信号进行声源定位,确定所述音频信号中声源的位置信息;
    目标区域确定单元,用于根据所述声源的位置信息,获取目标位置区域,确定来自所述目标位置区域的音频信号中是否包括目标音频信号。
  14. 根据权利要求11所述的音频信号处理装置,其中,所述相关性确定模块,具体用于:
    对于所述音频信号中的一个音频帧,通过音频活动检测,确定所述音频帧中是否包括所述目标音频信号。
  15. 根据权利要求11所述的音频信号处理装置,其中,所述相关性确定模块,包括:
    音频帧相关性确定单元,用于获取所述音频信号中包括目标音频信号的每个音频帧,确定所述音频帧对应的协方差矩阵,所述音频帧对应的协方差矩阵中元素的取值表示所述音频帧对应的所述多个集音设备中任意两个集音设备之间的相关性;
    音频信号相关性确定单元,用于根据所述音频帧对应的协方差矩阵,确定所述音频 信号对应的协方差矩阵,将所述音频信号对应的协方差矩阵作为所述音频信号对应的所述多个集音设备的相关性矩阵。
  16. 根据权利要求15所述的音频信号处理装置,其中,
    所述协方差矩阵的行表示排序后的集音设备;
    所述协方差矩阵的列表示与所述协方差矩阵的行具有相同排序的集音设备;
    所述音频帧对应的所述多个集音设备中任意两个集音设备之间的相关性表示:所述协方差矩阵中元素的行所在位置对应的集音设备与所述元素的列所在位置对应的集音设备之间的相关性。
  17. 根据权利要求15所述的音频信号处理装置,其中,所述音频帧相关性确定单元,具体用于:
    确定所述音频帧对应的音频采样矩阵,所述音频采样矩阵的行表示经排序的集音设备,所述音频采样矩阵的列表示多个音频采样时间点,所述音频采样矩阵的元素表示所述元素的行所在位置对应的集音设备,在所述元素的列所在位置对应的采样时间点采集的音频信号的音频特征;
    利用所述音频帧对应的音频采样矩阵,确定所述音频帧对应的协方差矩阵。
  18. 根据权利要求12所述的音频信号处理装置,其中,所述音频信号增强模块,包括:
    特征向量确定单元,用于利用所述相关性矩阵的特征向量,构建所述音频信号的特征向量矩阵,所述特征向量矩阵的每列为所述相关性矩阵的特征向量中的一个,且所述特征向量矩阵表示所述多个集音设备之间互不相关的特征空间;
    所述音频信号增强模块,还用于利用所述音频信号的特征向量矩阵,对所述音频信号进行特征空间变换,得到去除所述集音设备之间相关性的音频信号,将所述去除所述集音设备之间相关性的音频信号作为所述音频增强的音频信号。
  19. 根据权利要求11所述的音频信号处理装置,还包括:
    目标信号频率确定单元,用于如果所述音频信号中包括目标音频信号,将所述音频信号作为目标信号,确定所述目标信号的信号频率;
    噪声信号频率确定单元,用于如果所述音频信号中包括非目标音频信号,将所述音频信号作为噪声信号,确定所述噪声信号的信号频率;
    所述音频信号增强模块,还用于基于所述目标信号的信号频率和所述噪声信号的信号频率,对所述目标信号进行音频增强处理,得到音频增强的音频信号。
  20. 根据权利要求19所述的音频信号处理装置,其中,所述音频信号增强模块,还用于:
    获取所述噪声的信号频率的取值范围,将所述噪声的信号频率的取值范围作为噪声频率范围;
    去除所述目标信号中信号频率在所述噪声频率范围内的音频信号,得到所述音频增强的音频信号。
  21. 一种音频信号处理设备,其特征在于,包括存储器和处理器;
    所述存储器用于储存有可执行程序代码;
    所述处理器用于读取所述存储器中存储的可执行程序代码以执行权利要求1至10任一项所述的音频信号处理方法。
  22. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质包括指令,当所述指令在计算机上运行时,使得计算机执行如权利要求1-10任意一项所述的音频信号处理方法。
  23. 一种音频交互设备,包括:
    音频信号检测器,用于使用麦克风阵列的多个集音设备接收音频信号,确定所述音频信号中是否包括目标音频信号;
    目标音频分离器,用于如果所述音频信号中包括目标音频信号,确定所述音频信号对应的所述多个集音设备的相关性;
    目标音频增强器,用于利用所述音频信号对应的所述多个集音设备的相关性,对所述音频信号进行音频增强处理,得到音频增强的音频信号。
  24. 根据权利要求23所述的音频交互设备,还包括:
    所述目标音频分离器,还用于如果所述音频信号中包括目标音频信号,将所述音频信号作为目标信号,确定所述目标信号的信号频率;
    干扰音频分离器,用于如果所述音频信号中包括非目标音频信号,将所述音频信号作为噪声信号,确定所述噪声信号的信号频率;
    所述目标音频增强器,还用于基于所述目标信号的信号频率和所述噪声信号的信号频率,对所述目标信号进行音频增强处理,得到音频增强的音频信号。
PCT/CN2019/096813 2018-07-30 2019-07-19 音频信号处理方法、装置、设备和存储介质 WO2020024816A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810882538.8 2018-07-30
CN201810882538.8A CN110782911A (zh) 2018-07-30 2018-07-30 音频信号处理方法、装置、设备和存储介质

Publications (1)

Publication Number Publication Date
WO2020024816A1 true WO2020024816A1 (zh) 2020-02-06

Family

ID=69231022

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/096813 WO2020024816A1 (zh) 2018-07-30 2019-07-19 音频信号处理方法、装置、设备和存储介质

Country Status (2)

Country Link
CN (1) CN110782911A (zh)
WO (1) WO2020024816A1 (zh)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113516989A (zh) * 2020-03-27 2021-10-19 浙江宇视科技有限公司 声源音频的管理方法、装置、设备和存储介质
CN115552519A (zh) * 2020-05-11 2022-12-30 三菱电机楼宇解决方案株式会社 声源确定装置、声源确定方法和声源确定程序
WO2022000174A1 (zh) * 2020-06-29 2022-01-06 深圳市大疆创新科技有限公司 音频处理方法、音频处理装置、电子设备
CN112837703A (zh) * 2020-12-30 2021-05-25 深圳市联影高端医疗装备创新研究院 医疗成像设备中语音信号获取方法、装置、设备和介质
CN113311391A (zh) * 2021-04-25 2021-08-27 普联国际有限公司 基于麦克风阵列的声源定位方法、装置、设备及存储介质
CN113744750B (zh) * 2021-07-27 2022-07-05 北京荣耀终端有限公司 一种音频处理方法及电子设备
CN114355289B (zh) * 2022-03-19 2022-06-10 深圳市烽火宏声科技有限公司 声源定位方法、装置、存储介质及计算机设备
CN117537918A (zh) * 2023-11-30 2024-02-09 广东普和检测技术有限公司 室内噪声检测方法以及相关装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101432805A (zh) * 2006-05-02 2009-05-13 高通股份有限公司 用于盲源分离(bss)的增强技术
CN101515197A (zh) * 2008-02-19 2009-08-26 株式会社日立制作所 音响指示设备、音源位置的指示方法和计算机系统
CN102831898A (zh) * 2012-08-31 2012-12-19 厦门大学 带声源方向跟踪功能的麦克风阵列语音增强装置及其方法
CN102969000A (zh) * 2012-12-04 2013-03-13 中国科学院自动化研究所 一种多通道语音增强方法
CN105989851A (zh) * 2015-02-15 2016-10-05 杜比实验室特许公司 音频源分离

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3654470B2 (ja) * 1996-09-13 2005-06-02 日本電信電話株式会社 サブバンド多チャネル音声通信会議用反響消去方法
EP2560161A1 (en) * 2011-08-17 2013-02-20 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Optimal mixing matrices and usage of decorrelators in spatial audio processing
JP2016042132A (ja) * 2014-08-18 2016-03-31 ソニー株式会社 音声処理装置、音声処理方法、並びにプログラム
CN107564539B (zh) * 2017-08-29 2021-12-28 苏州奇梦者网络科技有限公司 面向麦克风阵列的声学回声消除方法及装置
CN108122563B (zh) * 2017-12-19 2021-03-30 北京声智科技有限公司 提高语音唤醒率及修正doa的方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101432805A (zh) * 2006-05-02 2009-05-13 高通股份有限公司 用于盲源分离(bss)的增强技术
CN101515197A (zh) * 2008-02-19 2009-08-26 株式会社日立制作所 音响指示设备、音源位置的指示方法和计算机系统
CN102831898A (zh) * 2012-08-31 2012-12-19 厦门大学 带声源方向跟踪功能的麦克风阵列语音增强装置及其方法
CN102969000A (zh) * 2012-12-04 2013-03-13 中国科学院自动化研究所 一种多通道语音增强方法
CN105989851A (zh) * 2015-02-15 2016-10-05 杜比实验室特许公司 音频源分离

Also Published As

Publication number Publication date
CN110782911A (zh) 2020-02-11

Similar Documents

Publication Publication Date Title
WO2020024816A1 (zh) 音频信号处理方法、装置、设备和存储介质
US11398235B2 (en) Methods, apparatuses, systems, devices, and computer-readable storage media for processing speech signals based on horizontal and pitch angles and distance of a sound source relative to a microphone array
CN110556103B (zh) 音频信号处理方法、装置、系统、设备和存储介质
Vu et al. Blind speech separation employing directional statistics in an expectation maximization framework
EP2530484B1 (en) Sound source localization apparatus and method
CN109599124A (zh) 一种音频数据处理方法、装置及存储介质
TWI647961B (zh) 聲場的高階保真立體音響表示法中不相關聲源方向之決定方法及裝置
Liu et al. Continuous sound source localization based on microphone array for mobile robots
CN110491403A (zh) 音频信号的处理方法、装置、介质和音频交互设备
CN102147458B (zh) 一种针对宽带声源的波达方向估计方法及其装置
CN102103200A (zh) 一种分布式非同步声传感器的声源空间定位方法
CN111239687A (zh) 一种基于深度神经网络的声源定位方法及系统
CN103278801A (zh) 一种变电站噪声成像侦测装置及侦测计算方法
Traa et al. Multichannel source separation and tracking with RANSAC and directional statistics
Velasco et al. Novel GCC-PHAT model in diffuse sound field for microphone array pairwise distance based calibration
CN112394324A (zh) 一种基于麦克风阵列的远距离声源定位的方法及系统
Hu et al. Decoupled direction-of-arrival estimations using relative harmonic coefficients
Hosseini et al. Time difference of arrival estimation of sound source using cross correlation and modified maximum likelihood weighting function
Jia et al. Multi-source DOA estimation in reverberant environments using potential single-source points enhancement
Tourbabin et al. Speaker localization by humanoid robots in reverberant environments
CN112363112A (zh) 一种基于线性麦克风阵列的声源定位方法及装置
Hu et al. Evaluation and comparison of three source direction-of-arrival estimators using relative harmonic coefficients
CN116106826A (zh) 声源定位方法、相关装置和介质
Oualil et al. A probabilistic framework for multiple speaker localization
ÇATALBAŞ et al. 3D moving sound source localization via conventional microphones

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19845228

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19845228

Country of ref document: EP

Kind code of ref document: A1