WO2021103672A1 - Audio data processing method and apparatus, and electronic device and storage medium - Google Patents

Audio data processing method and apparatus, and electronic device and storage medium Download PDF

Info

Publication number
WO2021103672A1
WO2021103672A1 PCT/CN2020/110038 CN2020110038W WO2021103672A1 WO 2021103672 A1 WO2021103672 A1 WO 2021103672A1 CN 2020110038 W CN2020110038 W CN 2020110038W WO 2021103672 A1 WO2021103672 A1 WO 2021103672A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio data
channel audio
time
channel
mask
Prior art date
Application number
PCT/CN2020/110038
Other languages
French (fr)
Chinese (zh)
Inventor
罗大为
Original Assignee
北京搜狗科技发展有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京搜狗科技发展有限公司 filed Critical 北京搜狗科技发展有限公司
Publication of WO2021103672A1 publication Critical patent/WO2021103672A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0224Processing in the time domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0264Noise filtering characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming

Definitions

  • This application relates to the field of audio data processing, and in particular to an audio data processing method and device, electronic equipment, and storage medium.
  • the microphone array technology usually focuses on a unified array system for synchronous acquisition, and the unified array system for synchronous acquisition has higher requirements for hardware design, manufacturing, and deployment.
  • an audio data processing method and device, electronic equipment, and storage medium are proposed in order to overcome the above problems or at least partially solve the above problems, including:
  • a method for audio data processing comprising:
  • first multi-channel audio data wherein the first multi-channel audio data is composed of audio data collected by one or more microphone arrays;
  • the audio signal output is performed by using the first single-channel audio data.
  • the step of performing beamforming processing on the second multi-channel audio data according to the time-frequency mask to obtain the first single-channel audio data includes:
  • the beam weight is used to perform beamforming processing on the second multi-channel audio data to obtain the first single-channel audio data.
  • the time-frequency mask includes a target speech mask and an interference noise mask
  • the step of determining a channel transfer function and an interference noise covariance matrix according to the time-frequency mask includes:
  • the interference noise covariance matrix is calculated.
  • the step of generating a time-frequency mask for the second multi-channel audio data includes:
  • a time-frequency mask for the second multi-channel audio data is determined.
  • the step of determining a time-frequency mask for the second multi-channel audio data according to the first time-frequency mask includes:
  • Target voice data Combining the target voice data to generate a second time-frequency mask for target voice data in the second multi-channel audio data; wherein the target voice data includes the target voice data;
  • the step of using the first single-channel audio data to output an audio signal includes:
  • the second single-channel audio data is used for audio signal output.
  • the step of using the second single-channel audio data to output an audio signal includes:
  • the third single-channel audio data is used for audio signal output.
  • the step of performing de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data includes:
  • de-reverberation parameter to perform de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data
  • the method also includes:
  • the first single-channel audio data and/or the second single-channel audio data and/or the third single-channel audio data are used to iteratively update the dereverberation parameters.
  • the method before the step of performing de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data, the method further includes:
  • the audio data in the first multi-channel audio data is aligned.
  • An audio data processing device comprising:
  • the first multi-channel audio data acquisition module is configured to acquire first multi-channel audio data; wherein, the first multi-channel audio data is composed of audio data collected by one or more microphone arrays;
  • a de-reverberation processing module configured to perform de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data
  • a time-frequency mask generating module configured to generate a time-frequency mask for the second multi-channel audio data
  • a beamforming processing module configured to perform beamforming processing on the second multi-channel audio data according to the time-frequency mask to obtain first single-channel audio data
  • the audio signal output module is configured to use the first single-channel audio data to output audio signals.
  • the beamforming processing module includes:
  • the function and matrix determination sub-module is used to determine the channel transfer function and the interference noise covariance matrix according to the time-frequency mask
  • the beam weight determination sub-module is configured to use the channel transfer function and the interference and noise covariance matrix to determine the beam weight
  • the first single-channel audio data obtaining submodule is configured to use the beam weight to perform beamforming processing on the second multi-channel audio data to obtain the first single-channel audio data.
  • the time-frequency mask includes a target speech mask and an interference noise mask
  • the function and matrix determination submodule includes:
  • a target speech covariance matrix generating unit configured to use the target speech mask to generate a target speech covariance matrix
  • a channel transfer function obtaining unit configured to use the target voice covariance matrix to calculate the channel transfer function
  • the interference noise covariance matrix obtaining unit is configured to use the interference noise mask to calculate the interference noise covariance matrix.
  • the time-frequency mask generation module includes:
  • the first time-frequency mask generation sub-module is configured to generate a first time-frequency mask for target-like voice data in the second multi-channel audio data
  • the time-frequency mask according to determination sub-module is configured to determine a time-frequency mask for the second multi-channel audio data according to the first time-frequency mask.
  • the determined time-frequency mask sub-module includes:
  • the class target voice data obtaining unit is configured to obtain class target voice data corresponding to the first time-frequency mask
  • the second time-frequency mask generating unit is configured to combine the class-target voice data to generate a second time-frequency mask for the target voice data in the second multi-channel audio data; wherein, the class-target voice data includes The target voice data;
  • the combined time-frequency mask unit is used to combine the first video mask and the second video mask to generate a time-frequency mask for the second multi-channel audio data.
  • the first audio signal output module includes:
  • An adaptive filter processing sub-module configured to perform adaptive filter processing on the first single-channel audio data to obtain second single-channel audio data
  • the second audio signal output sub-module is configured to use the second single-channel audio data to output audio signals.
  • the second audio signal output submodule includes:
  • the current application type determining unit is used to determine the current application type
  • a third single-channel audio data obtaining unit configured to adopt a single-channel noise reduction strategy corresponding to the current application type to perform noise reduction processing on the second single-channel audio data to obtain third single-channel audio data;
  • the third audio signal output unit is configured to use the third single-channel audio data to output audio signals.
  • the de-reverberation processing module includes:
  • De-reverberation parameter acquisition sub-module for acquiring de-reverberation parameters
  • the second multi-channel audio data obtaining sub-module is configured to use the de-reverberation parameter to perform de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data;
  • the device also includes:
  • An iterative update module is configured to use the first single-channel audio data and/or the second single-channel audio data and/or the third single-channel audio data to iteratively update the dereverberation parameters.
  • the device further includes:
  • a correlation degree determination module configured to determine the correlation degree of audio data in the first multi-channel audio data
  • the alignment processing module is configured to perform alignment processing on the audio data in the first multi-channel audio data according to the correlation degree.
  • An electronic device that includes a memory and one or more programs, wherein one or more programs are stored in the memory and configured to be executed by one or more processors, including: Instructions to do the following:
  • first multi-channel audio data wherein the first multi-channel audio data is composed of audio data collected by one or more microphone arrays;
  • the audio signal output is performed by using the first single-channel audio data.
  • the step of performing beamforming processing on the second multi-channel audio data according to the time-frequency mask to obtain the first single-channel audio data includes:
  • the beam weight is used to perform beamforming processing on the second multi-channel audio data to obtain the first single-channel audio data.
  • the time-frequency mask includes a target speech mask and an interference noise mask
  • the step of determining a channel transfer function and an interference noise covariance matrix according to the time-frequency mask includes:
  • the interference noise covariance matrix is calculated.
  • the step of generating a time-frequency mask for the second multi-channel audio data includes:
  • a time-frequency mask for the second multi-channel audio data is determined.
  • the step of determining a time-frequency mask for the second multi-channel audio data according to the first time-frequency mask includes:
  • the step of using the first single-channel audio data to output an audio signal includes:
  • the second single-channel audio data is used for audio signal output.
  • the step of using the second single-channel audio data to output an audio signal includes:
  • the third single-channel audio data is used for audio signal output.
  • the step of performing de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data includes:
  • de-reverberation parameter to perform de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data
  • the electronic device also includes instructions for performing the following operations:
  • the first single-channel audio data and/or the second single-channel audio data and/or the third single-channel audio data are used to iteratively update the dereverberation parameters.
  • the electronic device further includes instructions for performing the following operations:
  • the audio data in the first multi-channel audio data is aligned.
  • a readable storage medium When instructions in the storage medium are executed by a processor of an electronic device, the electronic device can execute the audio data processing method described above.
  • the first multi-channel audio data is composed of audio data collected by one or more microphone arrays, and then the first multi-channel audio data is de-reverberated to obtain
  • the second multi-channel audio data is used to generate a time-frequency mask for the second multi-channel audio data, and the time-frequency mask is used to perform beamforming processing on the second multi-channel audio data to obtain the first single-channel audio data.
  • Single-channel audio data, audio signal output realizes the audio processing of multiple microphone arrays that are collected asynchronously, avoids the high cost caused by only using a unified array of synchronous collection for audio processing, expands the pickup range, and improves
  • the robustness is improved, and by adopting the time-frequency mask, there is no need to rely on the position information of the microphone array during audio processing, which improves the noise reduction and anti-interference capabilities.
  • FIG. 1 is a flow chart of the steps of a method for processing audio data according to an embodiment of the present application
  • FIG. 2 is a flow chart of the steps of another audio data processing method provided by an embodiment of the present application.
  • FIG. 3 is a flow chart of the steps of another audio data processing method provided by an embodiment of the present application.
  • FIG. 4 is a flowchart of the steps of another audio data processing method provided by an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of an audio data processing apparatus provided by an embodiment of the present application.
  • FIG. 6 is a structural block diagram of an electronic device for audio data processing provided by an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of another electronic device for audio data processing provided by an embodiment of the present application.
  • Step 101 Acquire first multi-channel audio data; wherein, the first multi-channel audio data is composed of audio data collected by one or more microphone arrays;
  • one or more microphone arrays can form a non-synchronized acquisition array system, which specifically can be that the obtained multi-channel signals are not completely synchronized in time due to inconsistent synchronization clocks or transmission delays, and a single microphone array Synchronous collection can be performed internally. If there are microphones that are not collected synchronously in a single microphone array, it can also be used as a single microphone array, and the sampling rate of the audio data collected by each microphone array is the same.
  • a control module can control the working state of one or more microphone arrays, and then can control one or more microphone arrays to perform synchronous start and data transmission.
  • the control module can control one or more microphone arrays to start and start recording, and the one or more microphone arrays will send the collected data to the transmission module.
  • the transmission module can adopt a preset subcontracting strategy to transfer each microphone
  • the data collected by the array is synchronously transmitted to the processing module, which can perform data transmission in a wired or wireless manner, and the processing module can then obtain the first multi-channel audio data composed of audio data collected by one or more microphone arrays.
  • the missing data can be marked with zeros and transmitted to the processing module.
  • Step 102 Perform de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data;
  • the processing module uses linear prediction or Kalman filtering and other filtering methods to de-reverberate the first multi-channel audio data, and then suppress the reverberation in the original signal to obtain the first multi-channel audio data.
  • Two multi-channel audio data, and the de-reverberation processing can ensure that the phase relationship of the data does not change, and does not affect subsequent processing.
  • the method may further include the following steps:
  • the audio data collected by each microphone array may have an offset, if there is a clock offset of 20 milliseconds, you can determine the degree of correlation of the audio data in the first multi-channel audio data, and then perform alignment processing according to the degree of correlation to ensure that the data is offset. Moving within 1 frame does not affect subsequent processing.
  • a reference frequency band and a reference channel can be selected, and then the cross-correlation coefficient (that is, the degree of correlation) of the first multi-channel audio data in the reference frequency band is calculated within the preset maximum offset range, and the search accuracy is less than that of subsequent processing Frame length, determine the offset corresponding to the maximum value of the cross-correlation coefficient between channels, and then align it based on the reference channel.
  • the cross-correlation coefficient that is, the degree of correlation
  • Step 103 Generate a time-frequency mask for the second multi-channel audio data
  • the time-frequency mask can generate corresponding masking coefficients according to the size relationship of different components in each time-frequency point, which can be used for tasks such as the separation of speech and noise.
  • a classifier can be used to separate the target voice signal and other interference and noise signals in the second multi-channel audio data in the time-frequency domain, such as separating human voice and environmental noise, and then the target voice signal can be obtained.
  • step 103 may include the following sub-steps:
  • Sub-step 11 generating a first time-frequency mask for the target voice data in the second multi-channel audio data
  • the second multi-channel audio data can be input into the first preset model, and the first preset model can output the first time-frequency mask for the target voice data in the second multi-channel audio data, such as the second Multi-channel audio data can include audio data corresponding to human voice and audio data corresponding to environmental noise.
  • the class target audio data is audio data corresponding to human voice, and the first time frequency for the audio data of human voice can be obtained. Mask.
  • the first preset module can adopt a generative model, such as a complex Gaussian mixture model, or can adopt a discriminative model, such as DNN (Deep Neural Networks, deep neural network), TDNN (time delay neural network), LSTM (Long Short-Term Memory), CNN (Convolutional Neural Networks, Convolutional Neural Network), TCNN and other neural network structures composed of discriminant models.
  • a generative model such as a complex Gaussian mixture model
  • a discriminative model such as DNN (Deep Neural Networks, deep neural network), TDNN (time delay neural network), LSTM (Long Short-Term Memory), CNN (Convolutional Neural Networks, Convolutional Neural Network), TCNN and other neural network structures composed of discriminant models.
  • Sub-step 12 Determine a time-frequency mask for the second multi-channel audio data according to the first time-frequency mask.
  • the first time-frequency mask can be directly used as the time-frequency mask for the second multi-channel audio data, or further optimization can be performed according to the first time-frequency mask to achieve The masking effect of the specified target audio data in the class target audio data.
  • sub-step 12 may include the following sub-steps:
  • Sub-step 121 Acquire target-like voice data corresponding to the first time-frequency mask
  • the first video mask can be used to process the second multi-channel audio data, and then the class-target voice data corresponding to the first time-frequency mask can be obtained from the second multi-channel audio data.
  • Sub-step 122 combining the target voice data to generate a second time-frequency mask for target voice data in the second multi-channel audio data; wherein the target voice data includes the target voice data;
  • the target voice data can be input into a second preset model, and the second preset model can generate a second time-frequency mask for the target voice data in the second multi-channel audio data, such as the first Two multi-channel audio data may include audio data corresponding to human voice and audio data corresponding to environmental noise, and audio data corresponding to human voice may include audio data corresponding to user A and audio data corresponding to user B, target audio If the data is audio data corresponding to user A, the second time-frequency mask for the audio data corresponding to user A can be obtained, and then the masking effect of a designated person can be realized, which can be applied to scenarios such as home human-computer interaction.
  • the second preset model may be a model such as SpeakerBeam or iVector+DeepCluster.
  • Sub-step 123 combining the first video mask and the second video mask to generate a time-frequency mask for the second multi-channel audio data.
  • the first video mask and the second video mask can be dot-multiplied to obtain the time-frequency mask for the second multi-channel audio data.
  • Step 104 Perform beamforming processing on the second multi-channel audio data according to the time-frequency mask to obtain first single-channel audio data;
  • beamforming is a technology that uses the spatial spectrum characteristics of the received signal from the array to perform spatial filtering on the signal to achieve directional reception.
  • the time-frequency mask can be used to perform beamforming processing on the second multi-channel audio data to obtain the first single-channel audio data.
  • Step 105 Use the first single-channel audio data to output an audio signal.
  • the first single-channel audio data can be used for audio signal output, thereby achieving enhancement of the voice signal and reducing the influence of interference noise.
  • the first multi-channel audio data is composed of audio data collected by one or more microphone arrays, and then the first multi-channel audio data is de-reverberated to obtain
  • the second multi-channel audio data is used to generate a time-frequency mask for the second multi-channel audio data, and the time-frequency mask is used to perform beamforming processing on the second multi-channel audio data to obtain the first single-channel audio data.
  • Single-channel audio data, audio signal output realizes the audio processing of multiple microphone arrays that are collected asynchronously, avoids the high cost caused by only using a unified array of synchronized collection for audio processing, expands the pickup range, and improves
  • the robustness is improved, and by adopting the time-frequency mask, there is no need to rely on the position information of the microphone array during audio processing, which improves the noise reduction and anti-interference capabilities.
  • FIG. 2 there is shown a step flowchart of another audio data processing method provided by an embodiment of the present application, which may specifically include the following steps:
  • Step 201 Acquire first multi-channel audio data; wherein, the first multi-channel audio data is composed of audio data collected by one or more microphone arrays;
  • Step 202 Perform de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data;
  • Step 203 Generate a time-frequency mask for the second multi-channel audio data
  • Step 204 Determine the channel transfer function and the interference noise covariance matrix according to the time-frequency mask
  • the channel transfer function and the interference noise covariance matrix can be determined according to the time-frequency mask.
  • the time-frequency mask may include a target speech mask and an interference noise mask
  • the sum of the target speech mask and the interference noise mask may be a fixed value, such as the target speech mask and the interference noise mask.
  • the sum of codes can be 1, and step 204 can include the following sub-steps:
  • Sub-step 21 using the target voice mask to generate a target voice covariance matrix, and using the target voice covariance matrix to calculate a channel transfer function;
  • the target voice mask can be used to generate the target voice covariance matrix, and then the target voice covariance matrix can be used to calculate the channel transfer function, as follows:
  • the signal model of the microphone array can be expressed as:
  • x i (t) is the signal received by the i-th microphone
  • s(t) is the target voice signal
  • f i (t) is the channel transfer function of the signal received by the i-th microphone
  • n i (t) is the i-th microphone Noise and interference signals received by the microphone.
  • each frequency point can be expressed as:
  • x f, t and n f, t are the multi-channel data vector (ie the second multi-channel audio data) and noise interference signal received at the frequency at time t, respectively, and s f, t are the target voice signals at that time.
  • D f is the corresponding channel transfer function vector.
  • the noise interference is not related to the target speech signal, it can be further derived as:
  • Is the target speech covariance matrix estimation of the current frequency Is the target speech mask corresponding to the frequency point at time t, with Are the estimation of the channel transfer function vector and the target variance respectively, that is, by Perform eigen decomposition, take the main eigenvalue and eigenvector to get the channel transfer function vector.
  • multi-frame accumulation can be changed to the accumulation method with fading coefficient, which is convenient for real-time processing.
  • the interference noise covariance matrix is calculated by using the interference noise mask.
  • the interference noise mask can also be used to calculate the interference noise covariance matrix, as follows:
  • Step 205 using the channel transfer function and the interference and noise covariance matrix to determine beam weights
  • the beam weight w f can be calculated, and the minimum variance distortion (MVDR) beamforming method can be used, as follows:
  • Step 206 Perform beamforming processing on the second multi-channel audio data by using the beam weight to obtain first single-channel audio data
  • the beam weight may be used to perform beamforming processing on the second multi-channel audio data to obtain the first single-channel audio data.
  • Step 207 Use the first single-channel audio data to output an audio signal.
  • the channel transfer function and the interference noise covariance matrix are determined according to the time-frequency mask, and then the channel transfer function and the interference noise covariance matrix are used to determine the beam weights, and then the beam weights are used to determine the Two multi-channel audio data are subjected to beamforming processing to obtain the first single-channel audio data, which realizes the use of time-frequency mask estimation to obtain the channel transfer function and interference noise covariance matrix, and then performs beamforming to reduce the voice distortion caused by beamforming , And does not need to rely on the position information of the microphone array, can obtain the processing performance similar to the synchronous array, and improve the noise reduction and anti-interference ability.
  • FIG. 3 there is shown a step flow chart of another audio data processing method provided by an embodiment of the present application, which may specifically include the following steps:
  • Step 301 Acquire first multi-channel audio data; wherein, the first multi-channel audio data is composed of audio data collected by one or more microphone arrays;
  • Step 302 Perform de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data
  • Step 303 Generate a time-frequency mask for the second multi-channel audio data
  • Step 304 Perform beamforming processing on the second multi-channel audio data according to the time-frequency mask to obtain first single-channel audio data;
  • Step 305 Perform adaptive filtering processing on the first single-channel audio data to obtain second single-channel audio data
  • the first single-channel audio data after beamforming may still have some noise and interference
  • the first single-channel audio data can be adaptively filtered to obtain the second single-channel
  • GSC Generalized Sidelobe Canceller
  • the interference noise time-frequency mask By outputting the interference noise time-frequency mask as the blocking branch, it is judged whether it is the target voice segment to adjust the adaptive filter coefficient update. Segment update filter and fix filter coefficients in the speech segment.
  • Step 306 Use the second single-channel audio data to output an audio signal.
  • the second single-channel audio data can be used to output the audio signal, thereby achieving enhancement of the voice signal and reducing the influence of interference noise.
  • the second single-channel audio data is obtained by performing adaptive filtering processing on the first single-channel audio data, and then the second single-channel audio data is used to output the audio signal, which realizes the self-control of the audio data.
  • Adapt to filter processing improve the purity of the output voice.
  • FIG. 4 there is shown a step flow chart of another audio data processing method provided by an embodiment of the present application, which may specifically include the following steps:
  • Step 401 Acquire first multi-channel audio data; wherein, the first multi-channel audio data is composed of audio data collected by one or more microphone arrays;
  • Step 402 Obtain dereverberation parameters
  • a de-reverberation parameter can be obtained, and the de-reverberation parameter can be related to the voice variance of the target voice data, and it can be used as a filter coefficient of a filter for de-reverberation processing.
  • Step 403 Using the de-reverberation parameters, perform de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data;
  • the de-reverberation parameter can be used to perform de-reverberation processing on the first multi-channel audio data to obtain the second multi-channel audio data.
  • Step 404 Generate a time-frequency mask for the second multi-channel audio data
  • Step 405 Perform beamforming processing on the second multi-channel audio data according to the time-frequency mask to obtain first single-channel audio data;
  • Step 406 Perform adaptive filtering processing on the first single-channel audio data to obtain second single-channel audio data
  • Step 407 Determine the current application type
  • the current application type can be determined.
  • Step 408 Use the single-channel noise reduction strategy corresponding to the current application type to perform noise reduction processing on the second single-channel audio data to obtain third single-channel audio data;
  • the single-channel noise reduction strategy corresponding to the current application type can perform noise reduction processing on the second single-channel audio data to obtain the third single-channel audio data, such as using log-MMSE (Minimum Mean Square Error) ), IMCRA (Improved Minima Controlled Recursive Averaging) and OMLSA (Optimally Modified Log-Spectral Amplitude Estimator) and other noise reduction schemes based on signal statistics, or use a noise reduction network composed of structures such as DNN, LSTM, TDNN, CNN and TCNN.
  • log-MMSE Minimum Mean Square Error
  • IMCRA Improved Minima Controlled Recursive Averaging
  • OMLSA Optimally Modified Log-Spectral Amplitude Estimator
  • Step 409 Use the third single-channel audio data to output an audio signal.
  • the third single-channel audio data can be used to output the audio signal, thereby achieving enhancement of the voice signal and reducing the influence of interference noise.
  • the method may further include the following steps:
  • the first single-channel audio data and/or the second single-channel audio data and/or the third single-channel audio data are used to iteratively update the dereverberation parameters.
  • the first single-channel audio data or the second single-channel audio can be used Data or the third single-channel audio data, iteratively update the de-reverberation parameters, and then get more accurate de-reverberation parameters and improve the de-reverberation effect.
  • the single-channel noise reduction strategy corresponding to the current application type is used to perform noise reduction processing on the second single-channel audio data to obtain the third single-channel audio data, and then the third single-channel audio data is obtained.
  • Channel audio data, audio signal output realizes the use of different noise reduction strategies for different application requirements, so that the output voice can be more adapted to the application requirements.
  • the de-reverberation parameters are updated iteratively, realizing positive feedback on the internal performance of the entire system, and iteratively improving the system performance. Improve the de-reverberation effect.
  • FIG. 5 there is shown a schematic structural diagram of an audio data processing apparatus provided by an embodiment of the present application, which may specifically include the following modules:
  • the first multi-channel audio data acquisition module 501 is configured to acquire first multi-channel audio data; wherein, the first multi-channel audio data is composed of audio data collected by one or more microphone arrays;
  • the de-reverberation processing module 502 is configured to perform de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data;
  • the beamforming processing module 504 is configured to perform beamforming processing on the second multi-channel audio data according to the time-frequency mask to obtain first single-channel audio data;
  • the first audio signal output module 505 is configured to use the first single-channel audio data to output audio signals.
  • the beamforming processing module 504 includes:
  • the function and matrix determination sub-module is used to determine the channel transfer function and the interference noise covariance matrix according to the time-frequency mask
  • the beam weight determination sub-module is configured to use the channel transfer function and the interference and noise covariance matrix to determine the beam weight
  • the first single-channel audio data obtaining submodule is configured to use the beam weight to perform beamforming processing on the second multi-channel audio data to obtain the first single-channel audio data.
  • the time-frequency mask includes a target speech mask and an interference noise mask
  • the function and matrix determination submodule includes:
  • a target speech covariance matrix generating unit configured to use the target speech mask to generate a target speech covariance matrix
  • a channel transfer function obtaining unit configured to use the target voice covariance matrix to calculate the channel transfer function
  • the interference noise covariance matrix obtaining unit is configured to use the interference noise mask to calculate the interference noise covariance matrix.
  • the time-frequency mask generation module 503 includes:
  • the first time-frequency mask generation sub-module is configured to generate a first time-frequency mask for target-like voice data in the second multi-channel audio data
  • the time-frequency mask according to determination sub-module is configured to determine a time-frequency mask for the second multi-channel audio data according to the first time-frequency mask.
  • the time-frequency mask sub-module for determining includes:
  • the class target voice data obtaining unit is configured to obtain class target voice data corresponding to the first time-frequency mask
  • the second time-frequency mask generating unit is configured to combine the class-target voice data to generate a second time-frequency mask for the target voice data in the second multi-channel audio data; wherein, the class-target voice data includes The target voice data;
  • the combined time-frequency mask unit is used to combine the first video mask and the second video mask to generate a time-frequency mask for the second multi-channel audio data.
  • the first audio signal output module 505 includes:
  • An adaptive filter processing sub-module configured to perform adaptive filter processing on the first single-channel audio data to obtain second single-channel audio data
  • the second audio signal output sub-module is configured to use the second single-channel audio data to output audio signals.
  • the second audio signal output submodule includes:
  • the current application type determining unit is used to determine the current application type
  • a third single-channel audio data obtaining unit configured to adopt a single-channel noise reduction strategy corresponding to the current application type to perform noise reduction processing on the second single-channel audio data to obtain third single-channel audio data;
  • the third audio signal output unit is configured to use the third single-channel audio data to output audio signals.
  • the de-reverberation processing module 502 includes:
  • De-reverberation parameter acquisition sub-module for acquiring de-reverberation parameters
  • the second multi-channel audio data obtaining sub-module is configured to use the de-reverberation parameter to perform de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data;
  • the device also includes:
  • An iterative update module is configured to use the first single-channel audio data and/or the second single-channel audio data and/or the third single-channel audio data to iteratively update the dereverberation parameters.
  • the device further includes:
  • a correlation degree determination module configured to determine the correlation degree of audio data in the first multi-channel audio data
  • the alignment processing module is configured to perform alignment processing on the audio data in the first multi-channel audio data according to the correlation degree.
  • the first multi-channel audio data is composed of audio data collected by one or more microphone arrays, and then the first multi-channel audio data is de-reverberated to obtain
  • the second multi-channel audio data is used to generate a time-frequency mask for the second multi-channel audio data, and the time-frequency mask is used to perform beamforming processing on the second multi-channel audio data to obtain the first single-channel audio data.
  • Single-channel audio data, audio signal output realizes the audio processing of multiple microphone arrays that are collected asynchronously, avoids the high cost caused by only using a unified array of synchronous collection for audio processing, expands the pickup range, and improves
  • the robustness is improved, and by adopting the time-frequency mask, there is no need to rely on the position information of the microphone array during audio processing, which improves the noise reduction and anti-interference capabilities.
  • the device embodiments described above are merely illustrative, where the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in One place, or it can be distributed to multiple network units. Some or all of the modules can be selected according to actual needs to achieve the objectives of the solutions of the embodiments. Those of ordinary skill in the art can understand and implement it without creative work.
  • the description is relatively simple, and for related parts, please refer to the part of the description of the method embodiment.
  • Fig. 6 is a block diagram showing an electronic device 600 for audio data processing according to an exemplary embodiment.
  • the electronic device 600 may be a mobile phone, a computer, a digital broadcasting terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, etc.
  • the electronic device 600 may include one or more of the following components: a processing component 602, a memory 604, a power supply component 606, a multimedia component 608, an audio component 610, an input/output (I/O) interface 612, and a sensor component 614 , And the communication component 616.
  • the processing component 602 generally controls the overall operations of the electronic device 600, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations.
  • the processing element 602 may include one or more processors 620 to execute instructions to complete all or part of the steps of the foregoing method.
  • the processing component 602 may include one or more modules to facilitate the interaction between the processing component 602 and other components.
  • the processing component 602 may include a multimedia module to facilitate the interaction between the multimedia component 608 and the processing component 602.
  • the memory 604 is configured to store various types of data to support operations in the electronic device 600. Examples of these data include instructions for any application or method operating on the electronic device 600, contact data, phone book data, messages, pictures, videos, etc.
  • the memory 604 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable and Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic Disk or Optical Disk.
  • SRAM static random access memory
  • EEPROM electrically erasable programmable read-only memory
  • EPROM erasable and Programmable Read Only Memory
  • PROM Programmable Read Only Memory
  • ROM Read Only Memory
  • Magnetic Memory Flash Memory
  • Magnetic Disk Magnetic Disk or Optical Disk.
  • the power supply component 606 provides power for various components of the electronic device 600.
  • the power supply component 606 may include a power management system, one or more power supplies, and other components associated with the generation, management, and distribution of power for the electronic device 600.
  • the multimedia component 608 includes a screen that provides an output interface between the electronic device 600 and the user.
  • the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from the user.
  • the touch panel includes one or more touch sensors to sense touch, sliding, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure related to the touch or slide operation.
  • the multimedia component 608 includes a front camera and/or a rear camera. When the device 600 is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera can receive external multimedia data. Each front camera and rear camera can be a fixed optical lens system or have focal length and optical zoom capabilities.
  • the audio component 610 is configured to output and/or input audio signals.
  • the audio component 610 includes a microphone (MIC), and when the electronic device 600 is in an operation mode, such as a call mode, a recording mode, and a voice recognition mode, the microphone is configured to receive an external audio signal.
  • the received audio signal can be further stored in the memory 604 or sent via the communication component 616.
  • the audio component 610 further includes a speaker for outputting audio signals.
  • the I/O interface 612 provides an interface between the processing component 602 and a peripheral interface module.
  • the above-mentioned peripheral interface module may be a keyboard, a click wheel, a button, and the like. These buttons may include, but are not limited to: home button, volume button, start button, and lock button.
  • the sensor component 614 includes one or more sensors for providing the electronic device 600 with various aspects of state evaluation.
  • the sensor component 614 can detect the on/off status of the device 600 and the relative positioning of the components.
  • the component is the display and the keypad of the electronic device 600.
  • the sensor component 614 can also detect the electronic device 600 or a component of the electronic device 600.
  • the position of the electronic device 600 changes, the presence or absence of contact between the user and the electronic device 600, the orientation or acceleration/deceleration of the electronic device 600, and the temperature change of the electronic device 600.
  • the sensor component 614 may include a proximity sensor configured to detect the presence of nearby objects when there is no physical contact.
  • the sensor component 614 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications.
  • the sensor component 614 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
  • the communication component 616 is configured to facilitate wired or wireless communication between the electronic device 600 and other devices.
  • the electronic device 600 can access a wireless network based on a communication standard, such as WiFi, 2G, or 3G, or a combination thereof.
  • the communication component 616 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel.
  • the communication component 616 further includes a near field communication (NFC) module to facilitate short-range communication.
  • the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology and other technologies.
  • RFID radio frequency identification
  • IrDA infrared data association
  • UWB ultra-wideband
  • Bluetooth Bluetooth
  • the electronic device 600 may be implemented by one or more application specific integrated circuits (ASIC), digital signal processors (DSP), digital signal processing devices (DSPD), programmable logic devices (PLD), field A programmable gate array (FPGA), controller, microcontroller, microprocessor, or other electronic components are implemented to implement the above methods.
  • ASIC application specific integrated circuits
  • DSP digital signal processors
  • DSPD digital signal processing devices
  • PLD programmable logic devices
  • FPGA field A programmable gate array
  • controller microcontroller, microprocessor, or other electronic components are implemented to implement the above methods.
  • non-transitory computer-readable storage medium including instructions, such as the memory 604 including instructions, and the foregoing instructions may be executed by the processor 620 of the electronic device 600 to complete the foregoing method.
  • the non-transitory computer-readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.
  • a non-transitory computer-readable storage medium When instructions in the storage medium are executed by a processor of a mobile terminal, the mobile terminal can execute an audio data processing method.
  • the method includes:
  • first multi-channel audio data wherein the first multi-channel audio data is composed of audio data collected by one or more microphone arrays;
  • the audio signal output is performed by using the first single-channel audio data.
  • the step of performing beamforming processing on the second multi-channel audio data according to the time-frequency mask to obtain the first single-channel audio data includes:
  • the beam weight is used to perform beamforming processing on the second multi-channel audio data to obtain the first single-channel audio data.
  • the time-frequency mask includes a target speech mask and an interference noise mask
  • the step of determining a channel transfer function and an interference noise covariance matrix according to the time-frequency mask includes:
  • the interference noise covariance matrix is calculated.
  • the step of generating a time-frequency mask for the second multi-channel audio data includes:
  • a time-frequency mask for the second multi-channel audio data is determined.
  • the step of determining a time-frequency mask for the second multi-channel audio data according to the first time-frequency mask includes:
  • the step of using the first single-channel audio data to output an audio signal includes:
  • the second single-channel audio data is used for audio signal output.
  • the step of using the second single-channel audio data to output an audio signal includes:
  • the third single-channel audio data is used for audio signal output.
  • the step of performing de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data includes:
  • de-reverberation parameter to perform de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data
  • the method also includes:
  • the first single-channel audio data and/or the second single-channel audio data and/or the third single-channel audio data are used to iteratively update the dereverberation parameters.
  • the method before the step of performing de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data, the method further includes:
  • the audio data in the first multi-channel audio data is aligned.
  • FIG. 7 is a schematic structural diagram of an electronic device 700 for audio data processing according to an embodiment of the present application.
  • the electronic device 700 may be a server, and the server 700 may have relatively large differences due to different configurations or performances, and may include one or more central processing units (CPU) 722 (for example, one or more processors). ) And memory 732, one or more storage media 730 (for example, one or one storage device with a large amount of storage) for storing application programs 742 or data 744. Among them, the memory 732 and the storage medium 730 may be short-term storage or persistent storage.
  • the program stored in the storage medium 730 may include one or more modules (not shown in the figure), and each module may include a series of command operations on the server.
  • the central processing unit 722 may be configured to communicate with the storage medium 730, and execute a series of instruction operations in the storage medium 730 on the server 700.
  • the server 700 may also include one or more power supplies 726, one or more wired or wireless network interfaces 750, one or more input and output interfaces 758, one or more keyboards 756, and/or, one or more operating systems 741 , Such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM and so on.
  • the embodiments of the present application may be provided as methods, devices, or computer program products. Therefore, the embodiments of the present application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, the embodiments of the present application may adopt the form of computer program products implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing terminal equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device.
  • the instruction device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
  • These computer program instructions can also be loaded on a computer or other programmable data processing terminal equipment, so that a series of operation steps are executed on the computer or other programmable terminal equipment to produce computer-implemented processing, so that the computer or other programmable terminal equipment
  • the instructions executed above provide steps for implementing functions specified in a flow or multiple flows in the flowchart and/or a block or multiple blocks in the block diagram.

Abstract

An audio data processing method and apparatus, and an electronic device (600, 700) and a storage medium (730). The method comprises: acquiring first multi-channel audio data, wherein the first multi-channel audio data is composed of audio data collected by one or more microphone arrays (101); performing de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data (102); generating a time-frequency mask for the second multi-channel audio data (103); performing beamforming processing on the second multi-channel audio data according to the time-frequency mask to obtain first single-channel audio data (104); and outputting an audio signal by using the first single-channel audio data (105). The method realizes audio processing of a plurality of microphone arrays used for non-synchronous collection, thereby preventing the high cost caused by the fact that only unified arrays used for synchronous collection can be used for audio processing, enlarging the pickup range, and improving the robustness.

Description

一种音频数据处理的方法及装置、电子设备、存储介质Method and device for audio data processing, electronic equipment and storage medium
本申请要求在2019年11月29日提交中国专利局、申请号为201911207689.4、发明名称为“一种音频数据处理的方法及装置、电子设备、存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office, the application number is 201911207689.4, and the invention title is "a method and device for audio data processing, electronic equipment, and storage medium" on November 29, 2019, all of which The content is incorporated in this application by reference.
技术领域Technical field
本申请涉及音频数据处理领域,特别是涉及一种音频数据处理的方法及装置、电子设备、存储介质。This application relates to the field of audio data processing, and in particular to an audio data processing method and device, electronic equipment, and storage medium.
背景技术Background technique
目前,麦克风阵列技术通常集中于同步采集的统一阵列系统,而同步采集的统一阵列系统对硬件设计、制造及部署均有较高的要求。At present, the microphone array technology usually focuses on a unified array system for synchronous acquisition, and the unified array system for synchronous acquisition has higher requirements for hardware design, manufacturing, and deployment.
而且,由于只能单点部署,若要覆盖更大的范围,则需要部署大孔径且数量较多的麦克风,而随着阵列系统中麦克风数量的增强,成本会快速上升,空间部署难度也会增加,且鲁棒性会显著下降。Moreover, because it can only be deployed at a single point, if you want to cover a larger range, you need to deploy a large aperture and a large number of microphones. As the number of microphones in the array system increases, the cost will rise rapidly and the space deployment will be difficult Increase, and the robustness will decrease significantly.
发明内容Summary of the invention
鉴于上述问题,提出了以便提供克服上述问题或者至少部分地解决上述问题的一种音频数据处理的方法及装置、电子设备、存储介质,包括:In view of the above problems, an audio data processing method and device, electronic equipment, and storage medium are proposed in order to overcome the above problems or at least partially solve the above problems, including:
一种音频数据处理的方法,所述方法包括:A method for audio data processing, the method comprising:
获取第一多通道音频数据;其中,所述第一多通道音频数据由一个或多个麦克风阵列采集的音频数据组成;Acquiring first multi-channel audio data; wherein the first multi-channel audio data is composed of audio data collected by one or more microphone arrays;
对所述第一多通道音频数据进行解混响处理,得到第二多通道音频数据;Performing de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data;
生成针对所述第二多通道音频数据的时频掩码;Generating a time-frequency mask for the second multi-channel audio data;
根据所述时频掩码,对所述第二多通道音频数据进行波束形成处理,得到第一单通道音频数据;Performing beamforming processing on the second multi-channel audio data according to the time-frequency mask to obtain first single-channel audio data;
采用所述第一单通道音频数据,进行音频信号输出。The audio signal output is performed by using the first single-channel audio data.
可选地,所述根据所述时频掩码,对所述第二多通道音频数据进行波束形成处理,得到第一单通道音频数据的步骤包括:Optionally, the step of performing beamforming processing on the second multi-channel audio data according to the time-frequency mask to obtain the first single-channel audio data includes:
根据所述时频掩码,确定信道传递函数和干扰噪声协方差矩阵;Determine the channel transfer function and the interference noise covariance matrix according to the time-frequency mask;
采用所述信道传递函数和所述干扰噪声协方差矩阵,确定波束权值;Using the channel transfer function and the interference noise covariance matrix to determine beam weights;
采用所述波束权值,对所述第二多通道音频数据进行波束形成处理,得到第一单通道音频数据。The beam weight is used to perform beamforming processing on the second multi-channel audio data to obtain the first single-channel audio data.
可选地,所述时频掩码包括目标语音掩码和干扰噪声掩码,所述根据所述时频掩码,确定信道传递函数和干扰噪声协方差矩阵的步骤包括:Optionally, the time-frequency mask includes a target speech mask and an interference noise mask, and the step of determining a channel transfer function and an interference noise covariance matrix according to the time-frequency mask includes:
采用所述目标语音掩码,生成目标语音协方差矩阵;Using the target voice mask to generate a target voice covariance matrix;
采用所述目标语音协方差矩阵,计算得到信道传递函数;Using the target speech covariance matrix to calculate the channel transfer function;
采用所述干扰噪声掩码,计算得到干扰噪声协方差矩阵。Using the interference noise mask, the interference noise covariance matrix is calculated.
可选地,所述生成针对所述第二多通道音频数据的时频掩码的步骤包括:Optionally, the step of generating a time-frequency mask for the second multi-channel audio data includes:
生成针对所述第二多通道音频数据中类目标语音数据的第一时频掩码;Generating a first time-frequency mask for the target voice data in the second multi-channel audio data;
根据所述第一时频掩码,确定针对所述第二多通道音频数据的时频掩码。According to the first time-frequency mask, a time-frequency mask for the second multi-channel audio data is determined.
可选地,所述根据所述第一时频掩码,确定针对所述第二多通道音频数据的时频掩码的步骤包括:Optionally, the step of determining a time-frequency mask for the second multi-channel audio data according to the first time-frequency mask includes:
获取所述第一时频掩码对应的类目标语音数据;Acquiring target-like voice data corresponding to the first time-frequency mask;
结合所述类目标语音数据,生成针对所述第二多通道音频数据中目标语音数据的第二时频掩码;其中,所述类目标语音数据包含所述目标语音数据;Combining the target voice data to generate a second time-frequency mask for target voice data in the second multi-channel audio data; wherein the target voice data includes the target voice data;
结合所述第一视频掩码、所述第二视频掩码,生成针对所述第二多通道音频数据的时频掩码。Combining the first video mask and the second video mask to generate a time-frequency mask for the second multi-channel audio data.
可选地,所述采用所述第一单通道音频数据,进行音频信号输出的步骤包括:Optionally, the step of using the first single-channel audio data to output an audio signal includes:
对所述第一单通道音频数据进行自适应滤波处理,得到第二单通道音频数据;Performing adaptive filtering processing on the first single-channel audio data to obtain second single-channel audio data;
采用所述第二单通道音频数据,进行音频信号输出。The second single-channel audio data is used for audio signal output.
可选地,所述采用所述第二单通道音频数据,进行音频信号输出的步骤包括:Optionally, the step of using the second single-channel audio data to output an audio signal includes:
确定当前应用类型;Determine the current application type;
采用所述当前应用类型对应的单通道降噪策略,对所述第二单通道音频数据进行降噪处理,得到第三单通道音频数据;Using the single-channel noise reduction strategy corresponding to the current application type to perform noise reduction processing on the second single-channel audio data to obtain third single-channel audio data;
采用所述第三单通道音频数据,进行音频信号输出。The third single-channel audio data is used for audio signal output.
可选地,所述对所述第一多通道音频数据进行解混响处理,得到第二多通道音频数据的步骤包括:Optionally, the step of performing de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data includes:
获取解混响参数;Obtain de-reverberation parameters;
采用所述解混响参数,对所述第一多通道音频数据进行解混响处理,得到第二多通道音频数据;Using the de-reverberation parameter to perform de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data;
所述方法还包括:The method also includes:
采用所述第一单通道音频数据和/或,所述第二单通道音频数据和/或,所述第三单通道音频数据,迭代更新所述解混响参数。The first single-channel audio data and/or the second single-channel audio data and/or the third single-channel audio data are used to iteratively update the dereverberation parameters.
可选地,在所述对所述第一多通道音频数据进行解混响处理,得到第二多通道音频数据的步骤之前,所述方法还包括:Optionally, before the step of performing de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data, the method further includes:
确定所述第一多通道音频数据中音频数据的相关程度;Determining the degree of relevance of audio data in the first multi-channel audio data;
按照所述相关程度,对所述第一多通道音频数据中音频数据进行对齐处理。According to the correlation degree, the audio data in the first multi-channel audio data is aligned.
一种音频数据处理的装置,所述装置包括:An audio data processing device, the device comprising:
第一多通道音频数据获取模块,用于获取第一多通道音频数据;其中,所述第一多通道音频数据由一个或多个麦克风阵列采集的音频数据组成;The first multi-channel audio data acquisition module is configured to acquire first multi-channel audio data; wherein, the first multi-channel audio data is composed of audio data collected by one or more microphone arrays;
解混响处理模块,用于对所述第一多通道音频数据进行解混响处理,得到第二多通道音频数据;A de-reverberation processing module, configured to perform de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data;
时频掩码生成模块,用于生成针对所述第二多通道音频数据的时频掩码;A time-frequency mask generating module, configured to generate a time-frequency mask for the second multi-channel audio data;
波束形成处理模块,用于根据所述时频掩码,对所述第二多通道音频数据进行波束形成处理,得到第一单通道音频数据;A beamforming processing module, configured to perform beamforming processing on the second multi-channel audio data according to the time-frequency mask to obtain first single-channel audio data;
音频信号输出模块,用于采用所述第一单通道音频数据,进行音频信号 输出。The audio signal output module is configured to use the first single-channel audio data to output audio signals.
可选地,所述波束形成处理模块包括:Optionally, the beamforming processing module includes:
函数和矩阵确定子模块,用于根据所述时频掩码,确定信道传递函数和干扰噪声协方差矩阵;The function and matrix determination sub-module is used to determine the channel transfer function and the interference noise covariance matrix according to the time-frequency mask;
波束权值确定子模块,用于采用所述信道传递函数和所述干扰噪声协方差矩阵,确定波束权值;The beam weight determination sub-module is configured to use the channel transfer function and the interference and noise covariance matrix to determine the beam weight;
第一单通道音频数据得到子模块,用于采用所述波束权值,对所述第二多通道音频数据进行波束形成处理,得到第一单通道音频数据。The first single-channel audio data obtaining submodule is configured to use the beam weight to perform beamforming processing on the second multi-channel audio data to obtain the first single-channel audio data.
可选地,所述时频掩码包括目标语音掩码和干扰噪声掩码,所述函数和矩阵确定子模块包括:Optionally, the time-frequency mask includes a target speech mask and an interference noise mask, and the function and matrix determination submodule includes:
目标语音协方差矩阵生成单元,用于采用所述目标语音掩码,生成目标语音协方差矩阵;A target speech covariance matrix generating unit, configured to use the target speech mask to generate a target speech covariance matrix;
信道传递函数得到单元,用于采用所述目标语音协方差矩阵,计算得到信道传递函数;A channel transfer function obtaining unit, configured to use the target voice covariance matrix to calculate the channel transfer function;
干扰噪声协方差矩阵得到单元,用于采用所述干扰噪声掩码,计算得到干扰噪声协方差矩阵。The interference noise covariance matrix obtaining unit is configured to use the interference noise mask to calculate the interference noise covariance matrix.
可选地,所述时频掩码生成模块包括:Optionally, the time-frequency mask generation module includes:
第一时频掩码生成子模块,用于生成针对所述第二多通道音频数据中类目标语音数据的第一时频掩码;The first time-frequency mask generation sub-module is configured to generate a first time-frequency mask for target-like voice data in the second multi-channel audio data;
根据确定时频掩码子模块,用于根据所述第一时频掩码,确定针对所述第二多通道音频数据的时频掩码。The time-frequency mask according to determination sub-module is configured to determine a time-frequency mask for the second multi-channel audio data according to the first time-frequency mask.
可选地,所述根据确定时频掩码子模块包括:Optionally, the determined time-frequency mask sub-module includes:
类目标语音数据获取单元,用于获取所述第一时频掩码对应的类目标语音数据;The class target voice data obtaining unit is configured to obtain class target voice data corresponding to the first time-frequency mask;
第二时频掩码生成单元,用于结合所述类目标语音数据,生成针对所述第二多通道音频数据中目标语音数据的第二时频掩码;其中,所述类目标语音数据包含所述目标语音数据;The second time-frequency mask generating unit is configured to combine the class-target voice data to generate a second time-frequency mask for the target voice data in the second multi-channel audio data; wherein, the class-target voice data includes The target voice data;
结合确定时频掩码单元,用于结合所述第一视频掩码、所述第二视频掩码,生成针对所述第二多通道音频数据的时频掩码。The combined time-frequency mask unit is used to combine the first video mask and the second video mask to generate a time-frequency mask for the second multi-channel audio data.
可选地,所述第一音频信号输出模块包括:Optionally, the first audio signal output module includes:
自适应滤波处理子模块,用于对所述第一单通道音频数据进行自适应滤波处理,得到第二单通道音频数据;An adaptive filter processing sub-module, configured to perform adaptive filter processing on the first single-channel audio data to obtain second single-channel audio data;
第二音频信号输出子模块,用于采用所述第二单通道音频数据,进行音频信号输出。The second audio signal output sub-module is configured to use the second single-channel audio data to output audio signals.
可选地,所述第二音频信号输出子模块包括:Optionally, the second audio signal output submodule includes:
当前应用类型确定单元,用于确定当前应用类型;The current application type determining unit is used to determine the current application type;
第三单通道音频数据得到单元,用于采用所述当前应用类型对应的单通道降噪策略,对所述第二单通道音频数据进行降噪处理,得到第三单通道音频数据;A third single-channel audio data obtaining unit, configured to adopt a single-channel noise reduction strategy corresponding to the current application type to perform noise reduction processing on the second single-channel audio data to obtain third single-channel audio data;
第三音频信号输出单元,用于采用所述第三单通道音频数据,进行音频信号输出。The third audio signal output unit is configured to use the third single-channel audio data to output audio signals.
可选地,所述解混响处理模块包括:Optionally, the de-reverberation processing module includes:
解混响参数获取子模块,用于获取解混响参数;De-reverberation parameter acquisition sub-module for acquiring de-reverberation parameters;
第二多通道音频数据得到子模块,用于采用所述解混响参数,对所述第一多通道音频数据进行解混响处理,得到第二多通道音频数据;The second multi-channel audio data obtaining sub-module is configured to use the de-reverberation parameter to perform de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data;
所述装置还包括:The device also includes:
迭代更新模块,用于采用所述第一单通道音频数据和/或,所述第二单通道音频数据和/或,所述第三单通道音频数据,迭代更新所述解混响参数。An iterative update module is configured to use the first single-channel audio data and/or the second single-channel audio data and/or the third single-channel audio data to iteratively update the dereverberation parameters.
可选地,所述装置还包括:Optionally, the device further includes:
相关程度确定模块,用于确定所述第一多通道音频数据中音频数据的相关程度;A correlation degree determination module, configured to determine the correlation degree of audio data in the first multi-channel audio data;
对齐处理模块,用于按照所述相关程度,对所述第一多通道音频数据中音频数据进行对齐处理。The alignment processing module is configured to perform alignment processing on the audio data in the first multi-channel audio data according to the correlation degree.
一种电子设备,包括有存储器,以及一个或者一个以上的程序,其中一个或者一个以上程序存储于存储器中,且经配置以由一个或者一个以上处理器执行所述一个或者一个以上程序包含用于进行以下操作的指令:An electronic device that includes a memory and one or more programs, wherein one or more programs are stored in the memory and configured to be executed by one or more processors, including: Instructions to do the following:
获取第一多通道音频数据;其中,所述第一多通道音频数据由一个或多个麦克风阵列采集的音频数据组成;Acquiring first multi-channel audio data; wherein the first multi-channel audio data is composed of audio data collected by one or more microphone arrays;
对所述第一多通道音频数据进行解混响处理,得到第二多通道音频数据;Performing de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data;
生成针对所述第二多通道音频数据的时频掩码;Generating a time-frequency mask for the second multi-channel audio data;
根据所述时频掩码,对所述第二多通道音频数据进行波束形成处理,得到第一单通道音频数据;Performing beamforming processing on the second multi-channel audio data according to the time-frequency mask to obtain first single-channel audio data;
采用所述第一单通道音频数据,进行音频信号输出。The audio signal output is performed by using the first single-channel audio data.
可选地,所述根据所述时频掩码,对所述第二多通道音频数据进行波束形成处理,得到第一单通道音频数据的步骤包括:Optionally, the step of performing beamforming processing on the second multi-channel audio data according to the time-frequency mask to obtain the first single-channel audio data includes:
根据所述时频掩码,确定信道传递函数和干扰噪声协方差矩阵;Determine the channel transfer function and the interference noise covariance matrix according to the time-frequency mask;
采用所述信道传递函数和所述干扰噪声协方差矩阵,确定波束权值;Using the channel transfer function and the interference noise covariance matrix to determine beam weights;
采用所述波束权值,对所述第二多通道音频数据进行波束形成处理,得到第一单通道音频数据。The beam weight is used to perform beamforming processing on the second multi-channel audio data to obtain the first single-channel audio data.
可选地,所述时频掩码包括目标语音掩码和干扰噪声掩码,所述根据所述时频掩码,确定信道传递函数和干扰噪声协方差矩阵的步骤包括:Optionally, the time-frequency mask includes a target speech mask and an interference noise mask, and the step of determining a channel transfer function and an interference noise covariance matrix according to the time-frequency mask includes:
采用所述目标语音掩码,生成目标语音协方差矩阵;Using the target voice mask to generate a target voice covariance matrix;
采用所述目标语音协方差矩阵,计算得到信道传递函数;Using the target speech covariance matrix to calculate the channel transfer function;
采用所述干扰噪声掩码,计算得到干扰噪声协方差矩阵。Using the interference noise mask, the interference noise covariance matrix is calculated.
可选地,所述生成针对所述第二多通道音频数据的时频掩码的步骤包括:Optionally, the step of generating a time-frequency mask for the second multi-channel audio data includes:
生成针对所述第二多通道音频数据中类目标语音数据的第一时频掩码;Generating a first time-frequency mask for the target voice data in the second multi-channel audio data;
根据所述第一时频掩码,确定针对所述第二多通道音频数据的时频掩码。According to the first time-frequency mask, a time-frequency mask for the second multi-channel audio data is determined.
可选地,所述根据所述第一时频掩码,确定针对所述第二多通道音频数据的时频掩码的步骤包括:Optionally, the step of determining a time-frequency mask for the second multi-channel audio data according to the first time-frequency mask includes:
获取所述第一时频掩码对应的类目标语音数据;Acquiring target-like voice data corresponding to the first time-frequency mask;
结合所述类目标语音数据,生成针对所述第二多通道音频数据中目标语音数据的第二时频掩码;其中,所述类目标语音数据包含所述目标语音数据;Combining the class target voice data to generate a second time-frequency mask for the target voice data in the second multi-channel audio data; wherein the class target voice data includes the target voice data;
结合所述第一视频掩码、所述第二视频掩码,生成针对所述第二多通道音频数据的时频掩码。Combining the first video mask and the second video mask to generate a time-frequency mask for the second multi-channel audio data.
可选地,所述采用所述第一单通道音频数据,进行音频信号输出的步骤包括:Optionally, the step of using the first single-channel audio data to output an audio signal includes:
对所述第一单通道音频数据进行自适应滤波处理,得到第二单通道音频数据;Performing adaptive filtering processing on the first single-channel audio data to obtain second single-channel audio data;
采用所述第二单通道音频数据,进行音频信号输出。The second single-channel audio data is used for audio signal output.
可选地,所述采用所述第二单通道音频数据,进行音频信号输出的步骤包括:Optionally, the step of using the second single-channel audio data to output an audio signal includes:
确定当前应用类型;Determine the current application type;
采用所述当前应用类型对应的单通道降噪策略,对所述第二单通道音频数据进行降噪处理,得到第三单通道音频数据;Using the single-channel noise reduction strategy corresponding to the current application type to perform noise reduction processing on the second single-channel audio data to obtain third single-channel audio data;
采用所述第三单通道音频数据,进行音频信号输出。The third single-channel audio data is used for audio signal output.
可选地,所述对所述第一多通道音频数据进行解混响处理,得到第二多通道音频数据的步骤包括:Optionally, the step of performing de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data includes:
获取解混响参数;Obtain de-reverberation parameters;
采用所述解混响参数,对所述第一多通道音频数据进行解混响处理,得到第二多通道音频数据;Using the de-reverberation parameter to perform de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data;
所述电子设备还包含用于进行以下操作的指令:The electronic device also includes instructions for performing the following operations:
采用所述第一单通道音频数据和/或,所述第二单通道音频数据和/或,所述第三单通道音频数据,迭代更新所述解混响参数。The first single-channel audio data and/or the second single-channel audio data and/or the third single-channel audio data are used to iteratively update the dereverberation parameters.
可选地,所述电子设备还包含用于进行以下操作的指令:Optionally, the electronic device further includes instructions for performing the following operations:
确定所述第一多通道音频数据中音频数据的相关程度;Determining the degree of relevance of audio data in the first multi-channel audio data;
按照所述相关程度,对所述第一多通道音频数据中音频数据进行对齐处理。According to the correlation degree, the audio data in the first multi-channel audio data is aligned.
一种可读存储介质,当所述存储介质中的指令由电子设备的处理器执行时,使得电子设备能够执行如上所述的音频数据处理方法。A readable storage medium. When instructions in the storage medium are executed by a processor of an electronic device, the electronic device can execute the audio data processing method described above.
本申请实施例具有以下优点:The embodiments of this application have the following advantages:
在本申请实施例中,通过获取第一多通道音频数据,第一多通道音频数据由一个或多个麦克风阵列采集的音频数据组成,然后对第一多通道音频数据进行解混响处理,得到第二多通道音频数据,生成针对第二多通道音频数 据的时频掩码,采用时频掩码,对第二多通道音频数据进行波束形成处理,得到第一单通道音频数据,采用第一单通道音频数据,进行音频信号输出,实现了对非同步采集的多个麦克风阵列的音频处理,避免了仅能采用同步采集的统一阵列进行音频处理导致的高成本,扩大了拾音范围,提升了鲁棒性,且通过采用时频掩码,在进行音频处理时无需依赖于麦克风阵列的位置信息,提升了降噪和抗干扰能力。In the embodiment of the present application, by acquiring the first multi-channel audio data, the first multi-channel audio data is composed of audio data collected by one or more microphone arrays, and then the first multi-channel audio data is de-reverberated to obtain The second multi-channel audio data is used to generate a time-frequency mask for the second multi-channel audio data, and the time-frequency mask is used to perform beamforming processing on the second multi-channel audio data to obtain the first single-channel audio data. Single-channel audio data, audio signal output, realizes the audio processing of multiple microphone arrays that are collected asynchronously, avoids the high cost caused by only using a unified array of synchronous collection for audio processing, expands the pickup range, and improves The robustness is improved, and by adopting the time-frequency mask, there is no need to rely on the position information of the microphone array during audio processing, which improves the noise reduction and anti-interference capabilities.
上述说明仅是本发明技术方案的概述,为了能够更清楚了解本发明的技术手段,而可依照说明书的内容予以实施,并且为了让本发明的上述和其它目的、特征和优点能够更明显易懂,以下特举本发明的具体实施方式。The above description is only an overview of the technical solution of the present invention. In order to understand the technical means of the present invention more clearly, it can be implemented in accordance with the content of the description, and in order to make the above and other objectives, features and advantages of the present invention more obvious and easy to understand. In the following, specific embodiments of the present invention are specifically cited.
附图说明Description of the drawings
为了更清楚地说明本申请的技术方案,下面将对本申请的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solution of the present application more clearly, the following will briefly introduce the drawings that need to be used in the description of the present application. Obviously, the drawings in the following description are only some embodiments of the present application. Ordinary technicians can obtain other drawings based on these drawings without creative labor.
图1是本申请一实施例提供的一种音频数据处理的方法的步骤流程图;FIG. 1 is a flow chart of the steps of a method for processing audio data according to an embodiment of the present application;
图2是本申请一实施例提供的另一种音频数据处理的方法的步骤流程图;FIG. 2 is a flow chart of the steps of another audio data processing method provided by an embodiment of the present application;
图3是本申请一实施例提供的另一种音频数据处理的方法的步骤流程图;FIG. 3 is a flow chart of the steps of another audio data processing method provided by an embodiment of the present application;
图4是本申请一实施例提供的另一种音频数据处理的方法的步骤流程图;FIG. 4 is a flowchart of the steps of another audio data processing method provided by an embodiment of the present application;
图5是本申请一实施例提供的一种音频数据处理的装置的结构示意图;FIG. 5 is a schematic structural diagram of an audio data processing apparatus provided by an embodiment of the present application;
图6是本申请一实施例提供的的一种用于音频数据处理的电子设备的结构框图;FIG. 6 is a structural block diagram of an electronic device for audio data processing provided by an embodiment of the present application;
图7是本申请一实施例提供的另一种用于音频数据处理的电子设备的结构示意图。FIG. 7 is a schematic structural diagram of another electronic device for audio data processing provided by an embodiment of the present application.
具体实施例Specific embodiment
为使本申请的上述目的、特征和优点能够更加明显易懂,下面结合附图 和具体实施方式对本申请作进一步详细的说明。显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。In order to make the above objectives, features, and advantages of the application more obvious and understandable, the application will be further described in detail below with reference to the accompanying drawings and specific implementations. Obviously, the described embodiments are part of the embodiments of the present application, rather than all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.
参照图1,示出了本申请一实施例提供的一种音频数据处理的方法的步骤流程图,具体可以包括如下步骤:1, there is shown a step flow chart of an audio data processing method provided by an embodiment of the present application, which may specifically include the following steps:
步骤101,获取第一多通道音频数据;其中,所述第一多通道音频数据由一个或多个麦克风阵列采集的音频数据组成;Step 101: Acquire first multi-channel audio data; wherein, the first multi-channel audio data is composed of audio data collected by one or more microphone arrays;
其中,一个或多个麦克风阵列可以构成一非同步采集的阵列系统,其具体可以为由于存在不一致的同步时钟或传输延迟等,造成得到的多通道信号在时间上不完全同步,而单个麦克风阵列内部可以进行同步采集,若单个麦克风阵列中存在不同步采集的麦克风,则也可以将其单独作为一麦克风阵列,且各个麦克风阵列采集音频数据的采样率相同。Among them, one or more microphone arrays can form a non-synchronized acquisition array system, which specifically can be that the obtained multi-channel signals are not completely synchronized in time due to inconsistent synchronization clocks or transmission delays, and a single microphone array Synchronous collection can be performed internally. If there are microphones that are not collected synchronously in a single microphone array, it can also be used as a single microphone array, and the sampling rate of the audio data collected by each microphone array is the same.
在实际应用中,可以设置控制模块、传输模块及处理模块,通过控制模块可以对一个或多个麦克风阵列的工作状态进行控制,进而可以控制一个或多个麦克风阵列进行同步的启动和数据传输。In practical applications, a control module, a transmission module, and a processing module can be set. The control module can control the working state of one or more microphone arrays, and then can control one or more microphone arrays to perform synchronous start and data transmission.
当进行信号采集时,控制模块可以控制一个或多个麦克风阵列启动并开始录音,一个或多个麦克风阵列将采集的数据发送至传输模块,传输模块可以采用预设的分包策略,将各个麦克风阵列采集的数据同步传输至处理模块,其可以采用有线或无线的方式进行数据传输,处理模块进而可以获得由一个或多个麦克风阵列采集的音频数据组成的第一多通道音频数据。When the signal is collected, the control module can control one or more microphone arrays to start and start recording, and the one or more microphone arrays will send the collected data to the transmission module. The transmission module can adopt a preset subcontracting strategy to transfer each microphone The data collected by the array is synchronously transmitted to the processing module, which can perform data transmission in a wired or wireless manner, and the processing module can then obtain the first multi-channel audio data composed of audio data collected by one or more microphone arrays.
在一示例中,当部分数据包传输不及时,则可以等待预设时长,若超时未收到数据包,则可以将缺失的数据进行补零标记后传输至处理模块。In an example, when part of the data packets are not transmitted in time, you can wait for a preset period of time. If the data packets are not received over time, the missing data can be marked with zeros and transmitted to the processing module.
步骤102,对所述第一多通道音频数据进行解混响处理,得到第二多通道音频数据;Step 102: Perform de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data;
由于声音在传播过程中的反射和折射产生多径传播,导致麦克风接收的音频信号中,除了直达信号还有多径信号,这些紧随直达波的多径信号称为混响,常常会对语音唤醒和识别等人机交互功能产生不利影响。Due to the multipath propagation caused by the reflection and refraction of sound in the propagation process, in the audio signal received by the microphone, in addition to the direct signal, there are also multipath signals. These multipath signals that follow the direct wave are called reverberation and often affect the voice. Human-computer interaction functions such as wake-up and recognition have adverse effects.
在获得第一多通道音频数据后,处理模块采用线性预测或者卡尔曼滤波 等滤波等方式,对第一多通道音频数据进行解混响处理,进而对原始信号中的混响进行抑制,得到第二多通道音频数据,且该解混响处理可以保证数据的相位关系不改变,不影响后续处理。After obtaining the first multi-channel audio data, the processing module uses linear prediction or Kalman filtering and other filtering methods to de-reverberate the first multi-channel audio data, and then suppress the reverberation in the original signal to obtain the first multi-channel audio data. Two multi-channel audio data, and the de-reverberation processing can ensure that the phase relationship of the data does not change, and does not affect subsequent processing.
在本申请一实施例中,在步骤102之前,该方法还可以包括如下步骤:In an embodiment of the present application, before step 102, the method may further include the following steps:
确定所述第一多通道音频数据中音频数据的相关程度;按照所述相关程度,对所述第一多通道音频数据中音频数据进行对齐处理。Determine the degree of relevance of the audio data in the first multi-channel audio data; and perform alignment processing on the audio data in the first multi-channel audio data according to the degree of relevance.
由于各个麦克风阵列采集的音频数据可能存在偏移,如存在20毫秒的时钟偏移,可以确定第一多通道音频数据中音频数据的相关程度,然后按照相关程度进行对齐处理,以保证其数据偏移在1帧以内,不影响后续处理。Since the audio data collected by each microphone array may have an offset, if there is a clock offset of 20 milliseconds, you can determine the degree of correlation of the audio data in the first multi-channel audio data, and then perform alignment processing according to the degree of correlation to ensure that the data is offset. Moving within 1 frame does not affect subsequent processing.
具体的,可以选取参考频带和一个参考通道,然后在预设的最大偏移范围内计算第一多通道音频数据在参考频带中的互相关系数(即相关程度),且其搜索精度小于后续处理帧长,确定通道间互相关系数最大值对应的偏移,然后以参考通道为准进行对齐。Specifically, a reference frequency band and a reference channel can be selected, and then the cross-correlation coefficient (that is, the degree of correlation) of the first multi-channel audio data in the reference frequency band is calculated within the preset maximum offset range, and the search accuracy is less than that of subsequent processing Frame length, determine the offset corresponding to the maximum value of the cross-correlation coefficient between channels, and then align it based on the reference channel.
步骤103,生成针对所述第二多通道音频数据的时频掩码;Step 103: Generate a time-frequency mask for the second multi-channel audio data;
其中,时频掩码可以为根据每个时频点中不同成分的大小关系,生成对应的掩蔽系数,可以用于语音和噪声的分离等任务。Among them, the time-frequency mask can generate corresponding masking coefficients according to the size relationship of different components in each time-frequency point, which can be used for tasks such as the separation of speech and noise.
在获得第二多通道音频数据后,可以采用分类器,在时频域分离第二多通道音频数据中目标语音信号和其他干扰及噪声信号,如分离人声和环境噪声,进而可以得到针对第二多通道音频数据的时频掩码。After the second multi-channel audio data is obtained, a classifier can be used to separate the target voice signal and other interference and noise signals in the second multi-channel audio data in the time-frequency domain, such as separating human voice and environmental noise, and then the target voice signal can be obtained. 2. Time-frequency mask of multi-channel audio data.
在本申请一实施例中,步骤103可以包括如下子步骤:In an embodiment of the present application, step 103 may include the following sub-steps:
子步骤11,生成针对所述第二多通道音频数据中类目标语音数据的第一时频掩码;Sub-step 11, generating a first time-frequency mask for the target voice data in the second multi-channel audio data;
在具体实现中,可以将第二多通道音频数据输入第一预置模型,第一预置模型可以输出针对第二多通道音频数据中类目标语音数据的第一时频掩码,如第二多通道音频数据可以包括与人声对应的音频数据和与环境噪声对应的音频数据,类目标音频数据为与人声对应的音频数据,则可以得到针对与人声的音频数据的第一时频掩码。In a specific implementation, the second multi-channel audio data can be input into the first preset model, and the first preset model can output the first time-frequency mask for the target voice data in the second multi-channel audio data, such as the second Multi-channel audio data can include audio data corresponding to human voice and audio data corresponding to environmental noise. The class target audio data is audio data corresponding to human voice, and the first time frequency for the audio data of human voice can be obtained. Mask.
在一示例中,第一预置模块可以采用生成式模型,如复数混合高斯模型,或者可以采用判别式模型,如DNN(Deep Neural Networks,深度神经网络)、 TDNN(时间延迟神经网络)、LSTM(Long Short-Term Memory,长短期记忆网络)、CNN(Convolutional Neural Networks,卷积神经网络)、TCNN等神经网络结构组成的判别模型。In an example, the first preset module can adopt a generative model, such as a complex Gaussian mixture model, or can adopt a discriminative model, such as DNN (Deep Neural Networks, deep neural network), TDNN (time delay neural network), LSTM (Long Short-Term Memory), CNN (Convolutional Neural Networks, Convolutional Neural Network), TCNN and other neural network structures composed of discriminant models.
子步骤12,根据所述第一时频掩码,确定针对所述第二多通道音频数据的时频掩码。Sub-step 12: Determine a time-frequency mask for the second multi-channel audio data according to the first time-frequency mask.
在获得第一时频掩码后,可以直接将该第一时频掩码作为针对第二多通道音频数据的时频掩码,也可以根据该第一时频掩码进行进一步的优化,实现对类目标音频数据中指定目标音频数据的掩蔽效果。After the first time-frequency mask is obtained, the first time-frequency mask can be directly used as the time-frequency mask for the second multi-channel audio data, or further optimization can be performed according to the first time-frequency mask to achieve The masking effect of the specified target audio data in the class target audio data.
在本申请一实施例中,子步骤12可以包括如下子步骤:In an embodiment of the present application, sub-step 12 may include the following sub-steps:
子步骤121,获取所述第一时频掩码对应的类目标语音数据;Sub-step 121: Acquire target-like voice data corresponding to the first time-frequency mask;
在具体实现中,可以采用第一视频掩码对第二多通道音频数据进行处理,则可以从第二多通道音频数据中,得到第一时频掩码对应的类目标语音数据。In a specific implementation, the first video mask can be used to process the second multi-channel audio data, and then the class-target voice data corresponding to the first time-frequency mask can be obtained from the second multi-channel audio data.
子步骤122,结合所述类目标语音数据,生成针对所述第二多通道音频数据中目标语音数据的第二时频掩码;其中,所述类目标语音数据包含所述目标语音数据;Sub-step 122, combining the target voice data to generate a second time-frequency mask for target voice data in the second multi-channel audio data; wherein the target voice data includes the target voice data;
在获得类目标语音数据后,可以将该类目标语音数据输入第二预置模型,第二预置模型可以生成针对第二多通道音频数据中目标语音数据的第二时频掩码,如第二多通道音频数据可以包括与人声对应的音频数据和与环境噪声对应的音频数据,与人声对应的音频数据可以包括与用户A对应的音频数据和与用户B对应的音频数据,目标音频数据为与用户A对应的音频数据,则可以得到针对用户A对应的音频数据的第二时频掩码,进而可以实现指定人的掩蔽效果,能够适用于家用人机交互等场景。After obtaining the target voice data, the target voice data can be input into a second preset model, and the second preset model can generate a second time-frequency mask for the target voice data in the second multi-channel audio data, such as the first Two multi-channel audio data may include audio data corresponding to human voice and audio data corresponding to environmental noise, and audio data corresponding to human voice may include audio data corresponding to user A and audio data corresponding to user B, target audio If the data is audio data corresponding to user A, the second time-frequency mask for the audio data corresponding to user A can be obtained, and then the masking effect of a designated person can be realized, which can be applied to scenarios such as home human-computer interaction.
在一示例中,第二预置模型可以为SpeakerBeam或iVector+DeepCluster等模型。In an example, the second preset model may be a model such as SpeakerBeam or iVector+DeepCluster.
子步骤123,结合所述第一视频掩码、所述第二视频掩码,生成针对所述第二多通道音频数据的时频掩码。Sub-step 123, combining the first video mask and the second video mask to generate a time-frequency mask for the second multi-channel audio data.
在获得第一视频掩码、第二视频掩码后,可以对第一视频掩码、第二视频掩码进行点乘,进而可以得到针对第二多通道音频数据的时频掩码。After obtaining the first video mask and the second video mask, the first video mask and the second video mask can be dot-multiplied to obtain the time-frequency mask for the second multi-channel audio data.
步骤104,根据所述时频掩码,对所述第二多通道音频数据进行波束形成处理,得到第一单通道音频数据;Step 104: Perform beamforming processing on the second multi-channel audio data according to the time-frequency mask to obtain first single-channel audio data;
其中,波束形成是通过阵列接收信号的空间谱特性,对信号进行空域滤波,从而实现指向性接收的技术。Among them, beamforming is a technology that uses the spatial spectrum characteristics of the received signal from the array to perform spatial filtering on the signal to achieve directional reception.
在获得视频掩码后,可以采用时频掩码,对第二多通道音频数据进行波束形成处理,得到第一单通道音频数据。After the video mask is obtained, the time-frequency mask can be used to perform beamforming processing on the second multi-channel audio data to obtain the first single-channel audio data.
步骤105,采用所述第一单通道音频数据,进行音频信号输出。Step 105: Use the first single-channel audio data to output an audio signal.
在获得第一单通道音频数据后,可以采用第一单通道音频数据,进行音频信号输出,进而可以实现了对语音信号的增强,降低干扰噪声的影响。After the first single-channel audio data is obtained, the first single-channel audio data can be used for audio signal output, thereby achieving enhancement of the voice signal and reducing the influence of interference noise.
在本申请实施例中,通过获取第一多通道音频数据,第一多通道音频数据由一个或多个麦克风阵列采集的音频数据组成,然后对第一多通道音频数据进行解混响处理,得到第二多通道音频数据,生成针对第二多通道音频数据的时频掩码,采用时频掩码,对第二多通道音频数据进行波束形成处理,得到第一单通道音频数据,采用第一单通道音频数据,进行音频信号输出,实现了对非同步采集的多个麦克风阵列的音频处理,避免了仅能采用同步采集的统一阵列进行音频处理导致的高成本,扩大了拾音范围,提升了鲁棒性,且通过采用时频掩码,在进行音频处理时无需依赖于麦克风阵列的位置信息,提升了降噪和抗干扰能力。In the embodiment of the present application, by acquiring the first multi-channel audio data, the first multi-channel audio data is composed of audio data collected by one or more microphone arrays, and then the first multi-channel audio data is de-reverberated to obtain The second multi-channel audio data is used to generate a time-frequency mask for the second multi-channel audio data, and the time-frequency mask is used to perform beamforming processing on the second multi-channel audio data to obtain the first single-channel audio data. Single-channel audio data, audio signal output, realizes the audio processing of multiple microphone arrays that are collected asynchronously, avoids the high cost caused by only using a unified array of synchronized collection for audio processing, expands the pickup range, and improves The robustness is improved, and by adopting the time-frequency mask, there is no need to rely on the position information of the microphone array during audio processing, which improves the noise reduction and anti-interference capabilities.
参照图2,示出了本申请一实施例提供的另一种音频数据处理的方法的步骤流程图,具体可以包括如下步骤:Referring to FIG. 2, there is shown a step flowchart of another audio data processing method provided by an embodiment of the present application, which may specifically include the following steps:
步骤201,获取第一多通道音频数据;其中,所述第一多通道音频数据由一个或多个麦克风阵列采集的音频数据组成;Step 201: Acquire first multi-channel audio data; wherein, the first multi-channel audio data is composed of audio data collected by one or more microphone arrays;
步骤202,对所述第一多通道音频数据进行解混响处理,得到第二多通道音频数据;Step 202: Perform de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data;
步骤203,生成针对所述第二多通道音频数据的时频掩码;Step 203: Generate a time-frequency mask for the second multi-channel audio data;
步骤204,根据所述时频掩码,确定信道传递函数和干扰噪声协方差矩阵;Step 204: Determine the channel transfer function and the interference noise covariance matrix according to the time-frequency mask;
在获得时频掩码后,对于每个频点,可以根据时频掩码,确定信道传递 函数和干扰噪声协方差矩阵。After obtaining the time-frequency mask, for each frequency point, the channel transfer function and the interference noise covariance matrix can be determined according to the time-frequency mask.
在本申请一实施例中,时频掩码可以包括目标语音掩码和干扰噪声掩码,目标语音掩码和干扰噪声掩码之和可以为一固定值,如目标语音掩码和干扰噪声掩码之和可以为1,则步骤204可以包括如下子步骤:In an embodiment of the present application, the time-frequency mask may include a target speech mask and an interference noise mask, and the sum of the target speech mask and the interference noise mask may be a fixed value, such as the target speech mask and the interference noise mask. The sum of codes can be 1, and step 204 can include the following sub-steps:
子步骤21,采用所述目标语音掩码,生成目标语音协方差矩阵,并采用所述目标语音协方差矩阵,计算得到信道传递函数;Sub-step 21, using the target voice mask to generate a target voice covariance matrix, and using the target voice covariance matrix to calculate a channel transfer function;
在具体实现中,可以采用目标语音掩码,生成目标语音协方差矩阵,然后可以采用目标语音协方差矩阵,计算得到信道传递函数,具体如下:In specific implementation, the target voice mask can be used to generate the target voice covariance matrix, and then the target voice covariance matrix can be used to calculate the channel transfer function, as follows:
对于麦克风阵列的信号模型可以表示为:The signal model of the microphone array can be expressed as:
Figure PCTCN2020110038-appb-000001
Figure PCTCN2020110038-appb-000001
其中,x i(t)为第i个麦克风接收信号,s(t)为目标语音信号,f i(t)为第i个麦克风接收信号的信道传递函数,n i(t)为第i个麦克风接收的噪声和干扰信号。 Among them, x i (t) is the signal received by the i-th microphone, s(t) is the target voice signal, f i (t) is the channel transfer function of the signal received by the i-th microphone, and n i (t) is the i-th microphone Noise and interference signals received by the microphone.
对上式进行时频变换,其每个频点都可以表示为:The time-frequency transformation of the above formula, each frequency point can be expressed as:
x f,t=d fs f,t+n f,t x f,t = d f s f,t +n f,t
其中,x f,t和n f,t分别为t时刻该频点接收的多通道数据向量(即第二多通道音频数据)和噪声干扰信号,s f,t则为该时刻的目标语音信号,d f为对应的信道传递函数向量。 Among them, x f, t and n f, t are the multi-channel data vector (ie the second multi-channel audio data) and noise interference signal received at the frequency at time t, respectively, and s f, t are the target voice signals at that time. , D f is the corresponding channel transfer function vector.
由于混响已基本被抑制,假设噪声干扰与目标语音信号不相关,可进一步推导为:Since the reverberation has been basically suppressed, assuming that the noise interference is not related to the target speech signal, it can be further derived as:
Figure PCTCN2020110038-appb-000002
Figure PCTCN2020110038-appb-000002
其中,
Figure PCTCN2020110038-appb-000003
Figure PCTCN2020110038-appb-000004
分别为频点f的数据、目标和干扰噪声协方差矩阵,
Figure PCTCN2020110038-appb-000005
为该频点目标语音信号方差,N为所用时间窗长度。
among them,
Figure PCTCN2020110038-appb-000003
with
Figure PCTCN2020110038-appb-000004
Are the data, target and interference noise covariance matrix of frequency f,
Figure PCTCN2020110038-appb-000005
Is the variance of the target voice signal at this frequency point, and N is the length of the time window used.
利用得到的时频掩码:Use the obtained time-frequency mask:
Figure PCTCN2020110038-appb-000006
Figure PCTCN2020110038-appb-000006
其中,
Figure PCTCN2020110038-appb-000007
为当前频率的目标语音协方差矩阵估计,
Figure PCTCN2020110038-appb-000008
为t时刻该频点对应的目标语音掩码,
Figure PCTCN2020110038-appb-000009
Figure PCTCN2020110038-appb-000010
分别为信道传递函数向量和目标方差的估计,即通过对
Figure PCTCN2020110038-appb-000011
进行特征分解,取主特征值和特征向量即可得到信道传递函数向量。对于在线估计方法,多帧累积可以改为带衰落系数的累积方式,方便实时处理。
among them,
Figure PCTCN2020110038-appb-000007
Is the target speech covariance matrix estimation of the current frequency,
Figure PCTCN2020110038-appb-000008
Is the target speech mask corresponding to the frequency point at time t,
Figure PCTCN2020110038-appb-000009
with
Figure PCTCN2020110038-appb-000010
Are the estimation of the channel transfer function vector and the target variance respectively, that is, by
Figure PCTCN2020110038-appb-000011
Perform eigen decomposition, take the main eigenvalue and eigenvector to get the channel transfer function vector. For the online estimation method, multi-frame accumulation can be changed to the accumulation method with fading coefficient, which is convenient for real-time processing.
子步骤22,采用所述干扰噪声掩码,计算得到干扰噪声协方差矩阵。In sub-step 22, the interference noise covariance matrix is calculated by using the interference noise mask.
基于上述说明,也可以采用干扰噪声掩码,计算得到干扰噪声协方差矩阵,具体如下:Based on the above description, the interference noise mask can also be used to calculate the interference noise covariance matrix, as follows:
Figure PCTCN2020110038-appb-000012
Figure PCTCN2020110038-appb-000012
其中,
Figure PCTCN2020110038-appb-000013
为当前频率的干扰噪声协方差矩阵估计,
Figure PCTCN2020110038-appb-000014
为t时刻该频点对应的干扰噪声掩码。
among them,
Figure PCTCN2020110038-appb-000013
Is the interference noise covariance matrix estimation of the current frequency,
Figure PCTCN2020110038-appb-000014
Is the interference noise mask corresponding to the frequency point at time t.
步骤205,采用所述信道传递函数和所述干扰噪声协方差矩阵,确定波束权值;Step 205, using the channel transfer function and the interference and noise covariance matrix to determine beam weights;
在获得信道传递函数和干扰噪声协方差矩阵后,可以来计算波束权值w f,可以采用最小方差无畸变(MVDR)波束形成方法,具体如下: After obtaining the channel transfer function and the interference noise covariance matrix, the beam weight w f can be calculated, and the minimum variance distortion (MVDR) beamforming method can be used, as follows:
Figure PCTCN2020110038-appb-000015
Figure PCTCN2020110038-appb-000015
步骤206,采用所述波束权值,对所述第二多通道音频数据进行波束形成处理,得到第一单通道音频数据;Step 206: Perform beamforming processing on the second multi-channel audio data by using the beam weight to obtain first single-channel audio data;
在获得波束权值,可以采用波束权值,对第二多通道音频数据进行波束形成处理,得到第一单通道音频数据。In obtaining the beam weight, the beam weight may be used to perform beamforming processing on the second multi-channel audio data to obtain the first single-channel audio data.
步骤207,采用所述第一单通道音频数据,进行音频信号输出。Step 207: Use the first single-channel audio data to output an audio signal.
在本申请实施例中,通过根据时频掩码,确定信道传递函数和干扰噪声协方差矩阵,然后采用信道传递函数和干扰噪声协方差矩阵,确定波束权值,进而采用波束权值,对第二多通道音频数据进行波束形成处理,得到第一单通道音频数据,实现了采用时频掩码估计得到信道传递函数和干扰噪声协方差矩阵,进而进行波束形成,减小波束形成产生的语音畸变,且无需依赖于 麦克风阵列的位置信息,能够获得与同步阵列类似的处理性能,提升了降噪和抗干扰能力。In the embodiment of this application, the channel transfer function and the interference noise covariance matrix are determined according to the time-frequency mask, and then the channel transfer function and the interference noise covariance matrix are used to determine the beam weights, and then the beam weights are used to determine the Two multi-channel audio data are subjected to beamforming processing to obtain the first single-channel audio data, which realizes the use of time-frequency mask estimation to obtain the channel transfer function and interference noise covariance matrix, and then performs beamforming to reduce the voice distortion caused by beamforming , And does not need to rely on the position information of the microphone array, can obtain the processing performance similar to the synchronous array, and improve the noise reduction and anti-interference ability.
参照图3,示出了本申请一实施例提供的另一种音频数据处理的方法的步骤流程图,具体可以包括如下步骤:Referring to FIG. 3, there is shown a step flow chart of another audio data processing method provided by an embodiment of the present application, which may specifically include the following steps:
步骤301,获取第一多通道音频数据;其中,所述第一多通道音频数据由一个或多个麦克风阵列采集的音频数据组成;Step 301: Acquire first multi-channel audio data; wherein, the first multi-channel audio data is composed of audio data collected by one or more microphone arrays;
步骤302,对所述第一多通道音频数据进行解混响处理,得到第二多通道音频数据;Step 302: Perform de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data;
步骤303,生成针对所述第二多通道音频数据的时频掩码;Step 303: Generate a time-frequency mask for the second multi-channel audio data;
步骤304,根据所述时频掩码,对所述第二多通道音频数据进行波束形成处理,得到第一单通道音频数据;Step 304: Perform beamforming processing on the second multi-channel audio data according to the time-frequency mask to obtain first single-channel audio data;
步骤305,对所述第一单通道音频数据进行自适应滤波处理,得到第二单通道音频数据;Step 305: Perform adaptive filtering processing on the first single-channel audio data to obtain second single-channel audio data;
由于经过波束形成处理后的单通道音频数据可能仍可能存在部分噪声和干扰,则在获得第一单通道音频数据后,可以对第一单通道音频数据进行自适应滤波处理,得到第二单通道音频数据,其具体可以采用广义旁瓣相消(GSC,Generalized Sidelobe Canceller),通过将干扰噪声时频掩码作为阻塞分支输出,判断是否为目标语音段调节自适应滤波器系数更新,在非语音段更新滤波器,在语音段固定滤波器系数。Since the single-channel audio data after beamforming may still have some noise and interference, after the first single-channel audio data is obtained, the first single-channel audio data can be adaptively filtered to obtain the second single-channel For audio data, Generalized Sidelobe Canceller (GSC, Generalized Sidelobe Canceller) can be used. By outputting the interference noise time-frequency mask as the blocking branch, it is judged whether it is the target voice segment to adjust the adaptive filter coefficient update. Segment update filter and fix filter coefficients in the speech segment.
步骤306,采用所述第二单通道音频数据,进行音频信号输出。Step 306: Use the second single-channel audio data to output an audio signal.
在获得第二单通道音频数据,可以采用第二单通道音频数据,进行音频信号输出,进而可以实现了对语音信号的增强,降低干扰噪声的影响。When the second single-channel audio data is obtained, the second single-channel audio data can be used to output the audio signal, thereby achieving enhancement of the voice signal and reducing the influence of interference noise.
在本申请实施例中,通过对第一单通道音频数据进行自适应滤波处理,得到第二单通道音频数据,然后采用第二单通道音频数据,进行音频信号输出,实现了对音频数据的自适应滤波处理,提升了输出语音的纯净性。In the embodiment of the present application, the second single-channel audio data is obtained by performing adaptive filtering processing on the first single-channel audio data, and then the second single-channel audio data is used to output the audio signal, which realizes the self-control of the audio data. Adapt to filter processing, improve the purity of the output voice.
参照图4,示出了本申请一实施例提供的另一种音频数据处理的方法的步骤流程图,具体可以包括如下步骤:Referring to FIG. 4, there is shown a step flow chart of another audio data processing method provided by an embodiment of the present application, which may specifically include the following steps:
步骤401,获取第一多通道音频数据;其中,所述第一多通道音频数据由一个或多个麦克风阵列采集的音频数据组成;Step 401: Acquire first multi-channel audio data; wherein, the first multi-channel audio data is composed of audio data collected by one or more microphone arrays;
步骤402,获取解混响参数;Step 402: Obtain dereverberation parameters;
在具体实现中,可以获取解混响参数,该解混响参数可以与目标语音数据的语音方差相关,其可以作为用于解混响处理的滤波器的滤波器系数。In a specific implementation, a de-reverberation parameter can be obtained, and the de-reverberation parameter can be related to the voice variance of the target voice data, and it can be used as a filter coefficient of a filter for de-reverberation processing.
步骤403,采用所述解混响参数,对所述第一多通道音频数据进行解混响处理,得到第二多通道音频数据;Step 403: Using the de-reverberation parameters, perform de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data;
在获得解混响参数后,可以采用该解混响参数,对第一多通道音频数据进行解混响处理,得到第二多通道音频数据。After the de-reverberation parameter is obtained, the de-reverberation parameter can be used to perform de-reverberation processing on the first multi-channel audio data to obtain the second multi-channel audio data.
步骤404,生成针对所述第二多通道音频数据的时频掩码;Step 404: Generate a time-frequency mask for the second multi-channel audio data;
步骤405,根据所述时频掩码,对所述第二多通道音频数据进行波束形成处理,得到第一单通道音频数据;Step 405: Perform beamforming processing on the second multi-channel audio data according to the time-frequency mask to obtain first single-channel audio data;
步骤406,采对所述第一单通道音频数据进行自适应滤波处理,得到第二单通道音频数据;Step 406: Perform adaptive filtering processing on the first single-channel audio data to obtain second single-channel audio data;
步骤407,确定当前应用类型;Step 407: Determine the current application type;
在具体实现中,为了满足不同的应用需求,如音频通信、语音唤醒和语音识别等应用,可以确定当前应用类型。In specific implementation, in order to meet different application requirements, such as audio communication, voice wake-up, and voice recognition applications, the current application type can be determined.
步骤408,采用所述当前应用类型对应的单通道降噪策略,对所述第二单通道音频数据进行降噪处理,得到第三单通道音频数据;Step 408: Use the single-channel noise reduction strategy corresponding to the current application type to perform noise reduction processing on the second single-channel audio data to obtain third single-channel audio data;
在确定当前应用类型后,可以采用当前应用类型对应的单通道降噪策略,对第二单通道音频数据进行降噪处理,得到第三单通道音频数据,如采用log-MMSE(Minimum Mean Square Error)、IMCRA(Improved Minima Controlled Recursive Averaging)和OMLSA(Optimally Modified Log-Spectral Amplitude Estimator)等基于信号统计的降噪方案,或者使用DNN、LSTM、TDNN、CNN和TCNN等结构组成的降噪网络。After determining the current application type, you can use the single-channel noise reduction strategy corresponding to the current application type to perform noise reduction processing on the second single-channel audio data to obtain the third single-channel audio data, such as using log-MMSE (Minimum Mean Square Error) ), IMCRA (Improved Minima Controlled Recursive Averaging) and OMLSA (Optimally Modified Log-Spectral Amplitude Estimator) and other noise reduction schemes based on signal statistics, or use a noise reduction network composed of structures such as DNN, LSTM, TDNN, CNN and TCNN.
步骤409,采用所述第三单通道音频数据,进行音频信号输出。Step 409: Use the third single-channel audio data to output an audio signal.
在获得第三单通道音频数据,可以采用第三单通道音频数据,进行音频信号输出,进而可以实现了对语音信号的增强,降低干扰噪声的影响。When the third single-channel audio data is obtained, the third single-channel audio data can be used to output the audio signal, thereby achieving enhancement of the voice signal and reducing the influence of interference noise.
在本申请一实施例中,该方法还可以包括如下步骤:In an embodiment of the present application, the method may further include the following steps:
采用所述第一单通道音频数据和/或,所述第二单通道音频数据和/或,所述第三单通道音频数据,迭代更新所述解混响参数。The first single-channel audio data and/or the second single-channel audio data and/or the third single-channel audio data are used to iteratively update the dereverberation parameters.
在具体实现中,由于获得的第一单通道音频数据、第二单通道音频数据、第三单通道音频数据为较为纯净的目标语音,则可以采用第一单通道音频数据或第二单通道音频数据或第三单通道音频数据,迭代更新解混响参数,进而可以得到更加准确的解混响参数,提升解混响效果。In specific implementation, since the obtained first single-channel audio data, second single-channel audio data, and third single-channel audio data are relatively pure target voices, the first single-channel audio data or the second single-channel audio can be used Data or the third single-channel audio data, iteratively update the de-reverberation parameters, and then get more accurate de-reverberation parameters and improve the de-reverberation effect.
在本申请实施例中,通过确定当前应用类型,采用当前应用类型对应的单通道降噪策略,对第二单通道音频数据进行降噪处理,得到第三单通道音频数据,然后采用第三单通道音频数据,进行音频信号输出,实现了针对不同的应用需求采用不同的降噪策略,使得输出语音能够更加适配应用需求。In the embodiment of this application, by determining the current application type, the single-channel noise reduction strategy corresponding to the current application type is used to perform noise reduction processing on the second single-channel audio data to obtain the third single-channel audio data, and then the third single-channel audio data is obtained. Channel audio data, audio signal output, realizes the use of different noise reduction strategies for different application requirements, so that the output voice can be more adapted to the application requirements.
而且,通过采用所第一单通道音频数据或第二单通道音频数据或第三单通道音频数据,迭代更新解混响参数,实现了整个系统内部性能上的正反馈,迭代提升系统性能,有效提升解混响效果。Moreover, by using all the first single-channel audio data or the second single-channel audio data or the third single-channel audio data, the de-reverberation parameters are updated iteratively, realizing positive feedback on the internal performance of the entire system, and iteratively improving the system performance. Improve the de-reverberation effect.
需要说明的是,对于方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请实施例并不受所描述的动作顺序的限制,因为依据本申请实施例,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作并不一定是本申请实施例所必须的。It should be noted that for the method embodiments, for the sake of simple description, they are all expressed as a series of action combinations, but those skilled in the art should know that the embodiments of this application are not limited by the described sequence of actions, because According to the embodiments of the present application, some steps may be performed in other order or simultaneously. Secondly, those skilled in the art should also know that the embodiments described in the specification are all preferred embodiments, and the actions involved are not necessarily required by the embodiments of the present application.
参照图5,示出了本申请一实施例提供的一种音频数据处理的装置的结构示意图,具体可以包括如下模块:Referring to FIG. 5, there is shown a schematic structural diagram of an audio data processing apparatus provided by an embodiment of the present application, which may specifically include the following modules:
第一多通道音频数据获取模块501,用于获取第一多通道音频数据;其中,所述第一多通道音频数据由一个或多个麦克风阵列采集的音频数据组成;The first multi-channel audio data acquisition module 501 is configured to acquire first multi-channel audio data; wherein, the first multi-channel audio data is composed of audio data collected by one or more microphone arrays;
解混响处理模块502,用于对所述第一多通道音频数据进行解混响处理,得到第二多通道音频数据;The de-reverberation processing module 502 is configured to perform de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data;
时频掩码生成模块503,用于生成针对所述第二多通道音频数据的时频掩码;A time-frequency mask generating module 503, configured to generate a time-frequency mask for the second multi-channel audio data;
波束形成处理模块504,用于根据所述时频掩码,对所述第二多通道音频数据进行波束形成处理,得到第一单通道音频数据;The beamforming processing module 504 is configured to perform beamforming processing on the second multi-channel audio data according to the time-frequency mask to obtain first single-channel audio data;
第一音频信号输出模块505,用于采用所述第一单通道音频数据,进行音频信号输出。The first audio signal output module 505 is configured to use the first single-channel audio data to output audio signals.
在本申请一实施例中,所述波束形成处理模块504包括:In an embodiment of the present application, the beamforming processing module 504 includes:
函数和矩阵确定子模块,用于根据所述时频掩码,确定信道传递函数和干扰噪声协方差矩阵;The function and matrix determination sub-module is used to determine the channel transfer function and the interference noise covariance matrix according to the time-frequency mask;
波束权值确定子模块,用于采用所述信道传递函数和所述干扰噪声协方差矩阵,确定波束权值;The beam weight determination sub-module is configured to use the channel transfer function and the interference and noise covariance matrix to determine the beam weight;
第一单通道音频数据得到子模块,用于采用所述波束权值,对所述第二多通道音频数据进行波束形成处理,得到第一单通道音频数据。The first single-channel audio data obtaining submodule is configured to use the beam weight to perform beamforming processing on the second multi-channel audio data to obtain the first single-channel audio data.
在本申请一实施例中,所述时频掩码包括目标语音掩码和干扰噪声掩码,所述函数和矩阵确定子模块包括:In an embodiment of the present application, the time-frequency mask includes a target speech mask and an interference noise mask, and the function and matrix determination submodule includes:
目标语音协方差矩阵生成单元,用于采用所述目标语音掩码,生成目标语音协方差矩阵;A target speech covariance matrix generating unit, configured to use the target speech mask to generate a target speech covariance matrix;
信道传递函数得到单元,用于采用所述目标语音协方差矩阵,计算得到信道传递函数;A channel transfer function obtaining unit, configured to use the target voice covariance matrix to calculate the channel transfer function;
干扰噪声协方差矩阵得到单元,用于采用所述干扰噪声掩码,计算得到干扰噪声协方差矩阵。The interference noise covariance matrix obtaining unit is configured to use the interference noise mask to calculate the interference noise covariance matrix.
在本申请一实施例中,所述时频掩码生成模块503包括:In an embodiment of the present application, the time-frequency mask generation module 503 includes:
第一时频掩码生成子模块,用于生成针对所述第二多通道音频数据中类目标语音数据的第一时频掩码;The first time-frequency mask generation sub-module is configured to generate a first time-frequency mask for target-like voice data in the second multi-channel audio data;
根据确定时频掩码子模块,用于根据所述第一时频掩码,确定针对所述第二多通道音频数据的时频掩码。The time-frequency mask according to determination sub-module is configured to determine a time-frequency mask for the second multi-channel audio data according to the first time-frequency mask.
在本申请一实施例中,所述根据确定时频掩码子模块包括:In an embodiment of the present application, the time-frequency mask sub-module for determining according to includes:
类目标语音数据获取单元,用于获取所述第一时频掩码对应的类目标语音数据;The class target voice data obtaining unit is configured to obtain class target voice data corresponding to the first time-frequency mask;
第二时频掩码生成单元,用于结合所述类目标语音数据,生成针对所述第二多通道音频数据中目标语音数据的第二时频掩码;其中,所述类目标语 音数据包含所述目标语音数据;The second time-frequency mask generating unit is configured to combine the class-target voice data to generate a second time-frequency mask for the target voice data in the second multi-channel audio data; wherein, the class-target voice data includes The target voice data;
结合确定时频掩码单元,用于结合所述第一视频掩码、所述第二视频掩码,生成针对所述第二多通道音频数据的时频掩码。The combined time-frequency mask unit is used to combine the first video mask and the second video mask to generate a time-frequency mask for the second multi-channel audio data.
在本申请一实施例中,所述第一音频信号输出模块505包括:In an embodiment of the present application, the first audio signal output module 505 includes:
自适应滤波处理子模块,用于对所述第一单通道音频数据进行自适应滤波处理,得到第二单通道音频数据;An adaptive filter processing sub-module, configured to perform adaptive filter processing on the first single-channel audio data to obtain second single-channel audio data;
第二音频信号输出子模块,用于采用所述第二单通道音频数据,进行音频信号输出。The second audio signal output sub-module is configured to use the second single-channel audio data to output audio signals.
在本申请一实施例中,所述第二音频信号输出子模块包括:In an embodiment of the present application, the second audio signal output submodule includes:
当前应用类型确定单元,用于确定当前应用类型;The current application type determining unit is used to determine the current application type;
第三单通道音频数据得到单元,用于采用所述当前应用类型对应的单通道降噪策略,对所述第二单通道音频数据进行降噪处理,得到第三单通道音频数据;A third single-channel audio data obtaining unit, configured to adopt a single-channel noise reduction strategy corresponding to the current application type to perform noise reduction processing on the second single-channel audio data to obtain third single-channel audio data;
第三音频信号输出单元,用于采用所述第三单通道音频数据,进行音频信号输出。The third audio signal output unit is configured to use the third single-channel audio data to output audio signals.
在本申请一实施例中,所述解混响处理模块502包括:In an embodiment of the present application, the de-reverberation processing module 502 includes:
解混响参数获取子模块,用于获取解混响参数;De-reverberation parameter acquisition sub-module for acquiring de-reverberation parameters;
第二多通道音频数据得到子模块,用于采用所述解混响参数,对所述第一多通道音频数据进行解混响处理,得到第二多通道音频数据;The second multi-channel audio data obtaining sub-module is configured to use the de-reverberation parameter to perform de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data;
所述装置还包括:The device also includes:
迭代更新模块,用于采用所述第一单通道音频数据和/或,所述第二单通道音频数据和/或,所述第三单通道音频数据,迭代更新所述解混响参数。An iterative update module is configured to use the first single-channel audio data and/or the second single-channel audio data and/or the third single-channel audio data to iteratively update the dereverberation parameters.
在本申请一实施例中,所述装置还包括:In an embodiment of the present application, the device further includes:
相关程度确定模块,用于确定所述第一多通道音频数据中音频数据的相关程度;A correlation degree determination module, configured to determine the correlation degree of audio data in the first multi-channel audio data;
对齐处理模块,用于按照所述相关程度,对所述第一多通道音频数据中音频数据进行对齐处理。The alignment processing module is configured to perform alignment processing on the audio data in the first multi-channel audio data according to the correlation degree.
在本申请实施例中,通过获取第一多通道音频数据,第一多通道音频数据由一个或多个麦克风阵列采集的音频数据组成,然后对第一多通道音频数 据进行解混响处理,得到第二多通道音频数据,生成针对第二多通道音频数据的时频掩码,采用时频掩码,对第二多通道音频数据进行波束形成处理,得到第一单通道音频数据,采用第一单通道音频数据,进行音频信号输出,实现了对非同步采集的多个麦克风阵列的音频处理,避免了仅能采用同步采集的统一阵列进行音频处理导致的高成本,扩大了拾音范围,提升了鲁棒性,且通过采用时频掩码,在进行音频处理时无需依赖于麦克风阵列的位置信息,提升了降噪和抗干扰能力。In the embodiment of the present application, by acquiring the first multi-channel audio data, the first multi-channel audio data is composed of audio data collected by one or more microphone arrays, and then the first multi-channel audio data is de-reverberated to obtain The second multi-channel audio data is used to generate a time-frequency mask for the second multi-channel audio data, and the time-frequency mask is used to perform beamforming processing on the second multi-channel audio data to obtain the first single-channel audio data. Single-channel audio data, audio signal output, realizes the audio processing of multiple microphone arrays that are collected asynchronously, avoids the high cost caused by only using a unified array of synchronous collection for audio processing, expands the pickup range, and improves The robustness is improved, and by adopting the time-frequency mask, there is no need to rely on the position information of the microphone array during audio processing, which improves the noise reduction and anti-interference capabilities.
以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下,即可以理解并实施。The device embodiments described above are merely illustrative, where the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in One place, or it can be distributed to multiple network units. Some or all of the modules can be selected according to actual needs to achieve the objectives of the solutions of the embodiments. Those of ordinary skill in the art can understand and implement it without creative work.
对于装置实施例而言,由于其与方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。As for the device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for related parts, please refer to the part of the description of the method embodiment.
图6是根据一示例性实施例示出的一种用于音频数据处理的电子设备600的框图。例如,电子设备600可以是移动电话,计算机,数字广播终端,消息收发设备,游戏控制台,平板设备,医疗设备,健身设备,个人数字助理等。Fig. 6 is a block diagram showing an electronic device 600 for audio data processing according to an exemplary embodiment. For example, the electronic device 600 may be a mobile phone, a computer, a digital broadcasting terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, etc.
参照图6,电子设备600可以包括以下一个或多个组件:处理组件602,存储器604,电源组件606,多媒体组件608,音频组件610,输入/输出(I/O)的接口612,传感器组件614,以及通信组件616。6, the electronic device 600 may include one or more of the following components: a processing component 602, a memory 604, a power supply component 606, a multimedia component 608, an audio component 610, an input/output (I/O) interface 612, and a sensor component 614 , And the communication component 616.
处理组件602通常控制电子设备600的整体操作,诸如与显示,电话呼叫,数据通信,相机操作和记录操作相关联的操作。处理元件602可以包括一个或多个处理器620来执行指令,以完成上述的方法的全部或部分步骤。此外,处理组件602可以包括一个或多个模块,便于处理组件602和其他组件之间的交互。例如,处理部件602可以包括多媒体模块,以方便多媒体组件608和处理组件602之间的交互。The processing component 602 generally controls the overall operations of the electronic device 600, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing element 602 may include one or more processors 620 to execute instructions to complete all or part of the steps of the foregoing method. In addition, the processing component 602 may include one or more modules to facilitate the interaction between the processing component 602 and other components. For example, the processing component 602 may include a multimedia module to facilitate the interaction between the multimedia component 608 and the processing component 602.
存储器604被配置为存储各种类型的数据以支持在电子设备600的操 作。这些数据的示例包括用于在电子设备600上操作的任何应用程序或方法的指令,联系人数据,电话簿数据,消息,图片,视频等。存储器604可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。The memory 604 is configured to store various types of data to support operations in the electronic device 600. Examples of these data include instructions for any application or method operating on the electronic device 600, contact data, phone book data, messages, pictures, videos, etc. The memory 604 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable and Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic Disk or Optical Disk.
电源组件606为电子设备600的各种组件提供电力。电源组件606可以包括电源管理系统,一个或多个电源,及其他与为电子设备600生成、管理和分配电力相关联的组件。The power supply component 606 provides power for various components of the electronic device 600. The power supply component 606 may include a power management system, one or more power supplies, and other components associated with the generation, management, and distribution of power for the electronic device 600.
多媒体组件608包括在所述电子设备600和用户之间的提供一个输出接口的屏幕。在一些实施例中,屏幕可以包括液晶显示器(LCD)和触摸面板(TP)。如果屏幕包括触摸面板,屏幕可以被实现为触摸屏,以接收来自用户的输入信号。触摸面板包括一个或多个触摸传感器以感测触摸、滑动和触摸面板上的手势。所述触摸传感器可以不仅感测触摸或滑动动作的边界,而且还检测与所述触摸或滑动操作相关的持续时间和压力。在一些实施例中,多媒体组件608包括一个前置摄像头和/或后置摄像头。当设备600处于操作模式,如拍摄模式或视频模式时,前置摄像头和/或后置摄像头可以接收外部的多媒体数据。每个前置摄像头和后置摄像头可以是一个固定的光学透镜系统或具有焦距和光学变焦能力。The multimedia component 608 includes a screen that provides an output interface between the electronic device 600 and the user. In some embodiments, the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from the user. The touch panel includes one or more touch sensors to sense touch, sliding, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure related to the touch or slide operation. In some embodiments, the multimedia component 608 includes a front camera and/or a rear camera. When the device 600 is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera can receive external multimedia data. Each front camera and rear camera can be a fixed optical lens system or have focal length and optical zoom capabilities.
音频组件610被配置为输出和/或输入音频信号。例如,音频组件610包括一个麦克风(MIC),当电子设备600处于操作模式,如呼叫模式、记录模式和语音识别模式时,麦克风被配置为接收外部音频信号。所接收的音频信号可以被进一步存储在存储器604或经由通信组件616发送。在一些实施例中,音频组件610还包括一个扬声器,用于输出音频信号。The audio component 610 is configured to output and/or input audio signals. For example, the audio component 610 includes a microphone (MIC), and when the electronic device 600 is in an operation mode, such as a call mode, a recording mode, and a voice recognition mode, the microphone is configured to receive an external audio signal. The received audio signal can be further stored in the memory 604 or sent via the communication component 616. In some embodiments, the audio component 610 further includes a speaker for outputting audio signals.
I/O接口612为处理组件602和外围接口模块之间提供接口,上述外围接口模块可以是键盘,点击轮,按钮等。这些按钮可包括但不限于:主页按钮、音量按钮、启动按钮和锁定按钮。The I/O interface 612 provides an interface between the processing component 602 and a peripheral interface module. The above-mentioned peripheral interface module may be a keyboard, a click wheel, a button, and the like. These buttons may include, but are not limited to: home button, volume button, start button, and lock button.
传感器组件614包括一个或多个传感器,用于为电子设备600提供各个方面的状态评估。例如,传感器组件614可以检测到设备600的打开/关闭状态,组件的相对定位,例如所述组件为电子设备600的显示器和小键盘,传 感器组件614还可以检测电子设备600或电子设备600一个组件的位置改变,用户与电子设备600接触的存在或不存在,电子设备600方位或加速/减速和电子设备600的温度变化。传感器组件614可以包括接近传感器,被配置用来在没有任何的物理接触时检测附近物体的存在。传感器组件614还可以包括光传感器,如CMOS或CCD图像传感器,用于在成像应用中使用。在一些实施例中,该传感器组件614还可以包括加速度传感器,陀螺仪传感器,磁传感器,压力传感器或温度传感器。The sensor component 614 includes one or more sensors for providing the electronic device 600 with various aspects of state evaluation. For example, the sensor component 614 can detect the on/off status of the device 600 and the relative positioning of the components. For example, the component is the display and the keypad of the electronic device 600. The sensor component 614 can also detect the electronic device 600 or a component of the electronic device 600. The position of the electronic device 600 changes, the presence or absence of contact between the user and the electronic device 600, the orientation or acceleration/deceleration of the electronic device 600, and the temperature change of the electronic device 600. The sensor component 614 may include a proximity sensor configured to detect the presence of nearby objects when there is no physical contact. The sensor component 614 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor component 614 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
通信组件616被配置为便于电子设备600和其他设备之间有线或无线方式的通信。电子设备600可以接入基于通信标准的无线网络,如WiFi,2G或3G,或它们的组合。在一个示例性实施例中,通信部件616经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一个示例性实施例中,所述通信部件616还包括近场通信(NFC)模块,以促进短程通信。例如,在NFC模块可基于射频识别(RFID)技术,红外数据协会(IrDA)技术,超宽带(UWB)技术,蓝牙(BT)技术和其他技术来实现。The communication component 616 is configured to facilitate wired or wireless communication between the electronic device 600 and other devices. The electronic device 600 can access a wireless network based on a communication standard, such as WiFi, 2G, or 3G, or a combination thereof. In an exemplary embodiment, the communication component 616 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 616 further includes a near field communication (NFC) module to facilitate short-range communication. For example, the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology and other technologies.
在示例性实施例中,电子设备600可以被一个或多个应用专用集成电路(ASIC)、数字信号处理器(DSP)、数字信号处理设备(DSPD)、可编程逻辑器件(PLD)、现场可编程门阵列(FPGA)、控制器、微控制器、微处理器或其他电子元件实现,用于执行上述方法。In an exemplary embodiment, the electronic device 600 may be implemented by one or more application specific integrated circuits (ASIC), digital signal processors (DSP), digital signal processing devices (DSPD), programmable logic devices (PLD), field A programmable gate array (FPGA), controller, microcontroller, microprocessor, or other electronic components are implemented to implement the above methods.
在示例性实施例中,还提供了一种包括指令的非临时性计算机可读存储介质,例如包括指令的存储器604,上述指令可由电子设备600的处理器620执行以完成上述方法。例如,所述非临时性计算机可读存储介质可以是ROM、随机存取存储器(RAM)、CD-ROM、磁带、软盘和光数据存储设备等。In an exemplary embodiment, there is also provided a non-transitory computer-readable storage medium including instructions, such as the memory 604 including instructions, and the foregoing instructions may be executed by the processor 620 of the electronic device 600 to complete the foregoing method. For example, the non-transitory computer-readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.
一种非临时性计算机可读存储介质,当所述存储介质中的指令由移动终端的处理器执行时,使得移动终端能够执行一种音频数据处理的的方法,所述方法包括:A non-transitory computer-readable storage medium. When instructions in the storage medium are executed by a processor of a mobile terminal, the mobile terminal can execute an audio data processing method. The method includes:
获取第一多通道音频数据;其中,所述第一多通道音频数据由一个或多个麦克风阵列采集的音频数据组成;Acquiring first multi-channel audio data; wherein the first multi-channel audio data is composed of audio data collected by one or more microphone arrays;
对所述第一多通道音频数据进行解混响处理,得到第二多通道音频数据;Performing de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data;
生成针对所述第二多通道音频数据的时频掩码;Generating a time-frequency mask for the second multi-channel audio data;
根据所述时频掩码,对所述第二多通道音频数据进行波束形成处理,得到第一单通道音频数据;Performing beamforming processing on the second multi-channel audio data according to the time-frequency mask to obtain first single-channel audio data;
采用所述第一单通道音频数据,进行音频信号输出。The audio signal output is performed by using the first single-channel audio data.
可选地,所述根据所述时频掩码,对所述第二多通道音频数据进行波束形成处理,得到第一单通道音频数据的步骤包括:Optionally, the step of performing beamforming processing on the second multi-channel audio data according to the time-frequency mask to obtain the first single-channel audio data includes:
根据所述时频掩码,确定信道传递函数和干扰噪声协方差矩阵;Determine the channel transfer function and the interference noise covariance matrix according to the time-frequency mask;
采用所述信道传递函数和所述干扰噪声协方差矩阵,确定波束权值;Using the channel transfer function and the interference noise covariance matrix to determine beam weights;
采用所述波束权值,对所述第二多通道音频数据进行波束形成处理,得到第一单通道音频数据。The beam weight is used to perform beamforming processing on the second multi-channel audio data to obtain the first single-channel audio data.
可选地,所述时频掩码包括目标语音掩码和干扰噪声掩码,所述根据所述时频掩码,确定信道传递函数和干扰噪声协方差矩阵的步骤包括:Optionally, the time-frequency mask includes a target speech mask and an interference noise mask, and the step of determining a channel transfer function and an interference noise covariance matrix according to the time-frequency mask includes:
采用所述目标语音掩码,生成目标语音协方差矩阵;Using the target voice mask to generate a target voice covariance matrix;
采用所述目标语音协方差矩阵,计算得到信道传递函数;Using the target speech covariance matrix to calculate the channel transfer function;
采用所述干扰噪声掩码,计算得到干扰噪声协方差矩阵。Using the interference noise mask, the interference noise covariance matrix is calculated.
可选地,所述生成针对所述第二多通道音频数据的时频掩码的步骤包括:Optionally, the step of generating a time-frequency mask for the second multi-channel audio data includes:
生成针对所述第二多通道音频数据中类目标语音数据的第一时频掩码;Generating a first time-frequency mask for the target voice data in the second multi-channel audio data;
根据所述第一时频掩码,确定针对所述第二多通道音频数据的时频掩码。According to the first time-frequency mask, a time-frequency mask for the second multi-channel audio data is determined.
可选地,所述根据所述第一时频掩码,确定针对所述第二多通道音频数据的时频掩码的步骤包括:Optionally, the step of determining a time-frequency mask for the second multi-channel audio data according to the first time-frequency mask includes:
获取所述第一时频掩码对应的类目标语音数据;Acquiring target-like voice data corresponding to the first time-frequency mask;
结合所述类目标语音数据,生成针对所述第二多通道音频数据中目标语音数据的第二时频掩码;其中,所述类目标语音数据包含所述目标语音数据;Combining the class target voice data to generate a second time-frequency mask for the target voice data in the second multi-channel audio data; wherein the class target voice data includes the target voice data;
结合所述第一视频掩码、所述第二视频掩码,生成针对所述第二多通道音频数据的时频掩码。Combining the first video mask and the second video mask to generate a time-frequency mask for the second multi-channel audio data.
可选地,所述采用所述第一单通道音频数据,进行音频信号输出的步骤包括:Optionally, the step of using the first single-channel audio data to output an audio signal includes:
对所述第一单通道音频数据进行自适应滤波处理,得到第二单通道音频数据;Performing adaptive filtering processing on the first single-channel audio data to obtain second single-channel audio data;
采用所述第二单通道音频数据,进行音频信号输出。The second single-channel audio data is used for audio signal output.
可选地,所述采用所述第二单通道音频数据,进行音频信号输出的步骤包括:Optionally, the step of using the second single-channel audio data to output an audio signal includes:
确定当前应用类型;Determine the current application type;
采用所述当前应用类型对应的单通道降噪策略,对所述第二单通道音频数据进行降噪处理,得到第三单通道音频数据;Using the single-channel noise reduction strategy corresponding to the current application type to perform noise reduction processing on the second single-channel audio data to obtain third single-channel audio data;
采用所述第三单通道音频数据,进行音频信号输出。The third single-channel audio data is used for audio signal output.
可选地,所述对所述第一多通道音频数据进行解混响处理,得到第二多通道音频数据的步骤包括:Optionally, the step of performing de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data includes:
获取解混响参数;Obtain de-reverberation parameters;
采用所述解混响参数,对所述第一多通道音频数据进行解混响处理,得到第二多通道音频数据;Using the de-reverberation parameter to perform de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data;
所述方法还包括:The method also includes:
采用所述第一单通道音频数据和/或,所述第二单通道音频数据和/或,所述第三单通道音频数据,迭代更新所述解混响参数。The first single-channel audio data and/or the second single-channel audio data and/or the third single-channel audio data are used to iteratively update the dereverberation parameters.
可选地,在所述对所述第一多通道音频数据进行解混响处理,得到第二多通道音频数据的步骤之前,所述方法还包括:Optionally, before the step of performing de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data, the method further includes:
确定所述第一多通道音频数据中音频数据的相关程度;Determining the degree of relevance of audio data in the first multi-channel audio data;
按照所述相关程度,对所述第一多通道音频数据中音频数据进行对齐处理。According to the correlation degree, the audio data in the first multi-channel audio data is aligned.
图7是本申请一实施例示出的一种用于音频数据处理的电子设备700的结构示意图。该电子设备700可以是服务器,该服务器700可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上中央处理器(central processing units,CPU)722(例如,一个或一个以上处理器)和存储器732,一个或一个以上存储应用程序742或数据744的存储介质730(例如一个或一个以上海量存储设备)。其中,存储器732和存储介质730可以是短暂存储或持久存储。存储在存储介质730的程序可以包括一个或一个以上模块(图 示没标出),每个模块可以包括对服务器中的一系列指令操作。更进一步地,中央处理器722可以设置为与存储介质730通信,在服务器700上执行存储介质730中的一系列指令操作。FIG. 7 is a schematic structural diagram of an electronic device 700 for audio data processing according to an embodiment of the present application. The electronic device 700 may be a server, and the server 700 may have relatively large differences due to different configurations or performances, and may include one or more central processing units (CPU) 722 (for example, one or more processors). ) And memory 732, one or more storage media 730 (for example, one or one storage device with a large amount of storage) for storing application programs 742 or data 744. Among them, the memory 732 and the storage medium 730 may be short-term storage or persistent storage. The program stored in the storage medium 730 may include one or more modules (not shown in the figure), and each module may include a series of command operations on the server. Furthermore, the central processing unit 722 may be configured to communicate with the storage medium 730, and execute a series of instruction operations in the storage medium 730 on the server 700.
服务器700还可以包括一个或一个以上电源726,一个或一个以上有线或无线网络接口750,一个或一个以上输入输出接口758,一个或一个以上键盘756,和/或,一个或一个以上操作系统741,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM等等。The server 700 may also include one or more power supplies 726, one or more wired or wireless network interfaces 750, one or more input and output interfaces 758, one or more keyboards 756, and/or, one or more operating systems 741 , Such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM and so on.
本说明书中的各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似的部分互相参见即可。The various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments, and the same or similar parts between the various embodiments can be referred to each other.
本领域内的技术人员应明白,本申请实施例可提供为方法、装置、或计算机程序产品。因此,本申请实施例可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请实施例可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art should understand that the embodiments of the present application may be provided as methods, devices, or computer program products. Therefore, the embodiments of the present application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, the embodiments of the present application may adopt the form of computer program products implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
本申请实施例是参照根据本申请实施例的方法、终端设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理终端设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理终端设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The embodiments of this application are described with reference to the flowcharts and/or block diagrams of the methods, terminal devices (systems), and computer program products according to the embodiments of this application. It should be understood that each process and/or block in the flowchart and/or block diagram, and the combination of processes and/or blocks in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions can be provided to the processors of general-purpose computers, special-purpose computers, embedded processors, or other programmable data processing terminal equipment to generate a machine, so that instructions executed by the processor of the computer or other programmable data processing terminal equipment A device for realizing the functions specified in a flow or multiple flows in the flowchart and/or a block or multiple blocks in the block diagram is generated.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理终端设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing terminal equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device. The instruction device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
这些计算机程序指令也可装载到计算机或其他可编程数据处理终端设 备上,使得在计算机或其他可编程终端设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程终端设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing terminal equipment, so that a series of operation steps are executed on the computer or other programmable terminal equipment to produce computer-implemented processing, so that the computer or other programmable terminal equipment The instructions executed above provide steps for implementing functions specified in a flow or multiple flows in the flowchart and/or a block or multiple blocks in the block diagram.
尽管已描述了本申请实施例的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例做出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本申请实施例范围的所有变更和修改。Although the preferred embodiments of the embodiments of the present application have been described, those skilled in the art can make additional changes and modifications to these embodiments once they learn the basic creative concept. Therefore, the appended claims are intended to be interpreted as including the preferred embodiments and all changes and modifications falling within the scope of the embodiments of the present application.
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者终端设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者终端设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者终端设备中还存在另外的相同要素。Finally, it should be noted that in this article, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply these entities. Or there is any such actual relationship or sequence between operations. Moreover, the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article, or terminal device that includes a series of elements includes not only those elements, but also those that are not explicitly listed. Other elements listed, or also include elements inherent to this process, method, article, or terminal device. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other same elements in the process, method, article, or terminal device that includes the element.
以上对所提供的一种音频数据处理的方法及装置、电子设备、存储介质,进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。The above provides a detailed introduction to the provided audio data processing method and device, electronic equipment, and storage medium. Specific examples are used in this article to illustrate the principles and implementation of the application. The description of the above embodiments is only It is used to help understand the methods and core ideas of this application; at the same time, for those of ordinary skill in the art, according to the ideas of this application, there will be changes in the specific implementation and the scope of application. In summary, this The content of the description should not be construed as a limitation on this application.

Claims (28)

  1. 一种音频数据处理的方法,其特征在于,所述方法包括:A method for audio data processing, characterized in that the method includes:
    获取第一多通道音频数据;其中,所述第一多通道音频数据由一个或多个麦克风阵列采集的音频数据组成;Acquiring first multi-channel audio data; wherein the first multi-channel audio data is composed of audio data collected by one or more microphone arrays;
    对所述第一多通道音频数据进行解混响处理,得到第二多通道音频数据;Performing de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data;
    生成针对所述第二多通道音频数据的时频掩码;Generating a time-frequency mask for the second multi-channel audio data;
    根据所述时频掩码,对所述第二多通道音频数据进行波束形成处理,得到第一单通道音频数据;Performing beamforming processing on the second multi-channel audio data according to the time-frequency mask to obtain first single-channel audio data;
    采用所述第一单通道音频数据,进行音频信号输出。The audio signal output is performed by using the first single-channel audio data.
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述时频掩码,对所述第二多通道音频数据进行波束形成处理,得到第一单通道音频数据的步骤包括:The method according to claim 1, wherein the step of performing beamforming processing on the second multi-channel audio data according to the time-frequency mask to obtain the first single-channel audio data comprises:
    根据所述时频掩码,确定信道传递函数和干扰噪声协方差矩阵;Determine the channel transfer function and the interference noise covariance matrix according to the time-frequency mask;
    采用所述信道传递函数和所述干扰噪声协方差矩阵,确定波束权值;Using the channel transfer function and the interference noise covariance matrix to determine beam weights;
    采用所述波束权值,对所述第二多通道音频数据进行波束形成处理,得到第一单通道音频数据。The beam weight is used to perform beamforming processing on the second multi-channel audio data to obtain the first single-channel audio data.
  3. 根据权利要求2所述的方法,其特征在于,所述时频掩码包括目标语音掩码和干扰噪声掩码,所述根据所述时频掩码,确定信道传递函数和干扰噪声协方差矩阵的步骤包括:The method according to claim 2, wherein the time-frequency mask includes a target speech mask and an interference noise mask, and the channel transfer function and the interference noise covariance matrix are determined according to the time-frequency mask The steps include:
    采用所述目标语音掩码,生成目标语音协方差矩阵;Using the target voice mask to generate a target voice covariance matrix;
    采用所述目标语音协方差矩阵,计算得到信道传递函数;Using the target speech covariance matrix to calculate the channel transfer function;
    采用所述干扰噪声掩码,计算得到干扰噪声协方差矩阵。Using the interference noise mask, the interference noise covariance matrix is calculated.
  4. 根据权利要求1或2或3所述的方法,其特征在于,所述生成针对所述第二多通道音频数据的时频掩码的步骤包括:The method according to claim 1 or 2 or 3, wherein the step of generating a time-frequency mask for the second multi-channel audio data comprises:
    生成针对所述第二多通道音频数据中类目标语音数据的第一时频掩码;Generating a first time-frequency mask for the target voice data in the second multi-channel audio data;
    根据所述第一时频掩码,确定针对所述第二多通道音频数据的时频掩码。According to the first time-frequency mask, a time-frequency mask for the second multi-channel audio data is determined.
  5. 根据权利要求4所述的方法,其特征在于,所述根据所述第一时频掩码,确定针对所述第二多通道音频数据的时频掩码的步骤包括:The method according to claim 4, wherein the step of determining a time-frequency mask for the second multi-channel audio data according to the first time-frequency mask comprises:
    获取所述第一时频掩码对应的类目标语音数据;Acquiring target-like voice data corresponding to the first time-frequency mask;
    结合所述类目标语音数据,生成针对所述第二多通道音频数据中目标语音数据的第二时频掩码;其中,所述类目标语音数据包含所述目标语音数据;Combining the class target voice data to generate a second time-frequency mask for the target voice data in the second multi-channel audio data; wherein the class target voice data includes the target voice data;
    结合所述第一视频掩码、所述第二视频掩码,生成针对所述第二多通道音频数据的时频掩码。Combining the first video mask and the second video mask to generate a time-frequency mask for the second multi-channel audio data.
  6. 根据权利要求1所述的方法,其特征在于,所述采用所述第一单通道音频数据,进行音频信号输出的步骤包括:The method according to claim 1, wherein the step of using the first single-channel audio data to output an audio signal comprises:
    对所述第一单通道音频数据进行自适应滤波处理,得到第二单通道音频数据;Performing adaptive filtering processing on the first single-channel audio data to obtain second single-channel audio data;
    采用所述第二单通道音频数据,进行音频信号输出。The second single-channel audio data is used for audio signal output.
  7. 根据权利要求6所述的方法,其特征在于,所述采用所述第二单通道音频数据,进行音频信号输出的步骤包括:The method according to claim 6, wherein the step of using the second single-channel audio data to output an audio signal comprises:
    确定当前应用类型;Determine the current application type;
    采用所述当前应用类型对应的单通道降噪策略,对所述第二单通道音频数据进行降噪处理,得到第三单通道音频数据;Using the single-channel noise reduction strategy corresponding to the current application type to perform noise reduction processing on the second single-channel audio data to obtain third single-channel audio data;
    采用所述第三单通道音频数据,进行音频信号输出。The third single-channel audio data is used for audio signal output.
  8. 根据权利要求7所述的方法,其特征在于,所述对所述第一多通道音频数据进行解混响处理,得到第二多通道音频数据的步骤包括:8. The method according to claim 7, wherein the step of performing de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data comprises:
    获取解混响参数;Obtain de-reverberation parameters;
    采用所述解混响参数,对所述第一多通道音频数据进行解混响处理,得到第二多通道音频数据;Using the de-reverberation parameter to perform de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data;
    所述方法还包括:The method also includes:
    采用所述第一单通道音频数据和/或,所述第二单通道音频数据和/或,所述第三单通道音频数据,迭代更新所述解混响参数。The first single-channel audio data and/or the second single-channel audio data and/or the third single-channel audio data are used to iteratively update the dereverberation parameters.
  9. 根据权利要求1所述的方法,其特征在于,在所述对所述第一多通道音频数据进行解混响处理,得到第二多通道音频数据的步骤之前,所述方法还包括:The method according to claim 1, wherein before the step of performing de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data, the method further comprises:
    确定所述第一多通道音频数据中音频数据的相关程度;Determining the degree of relevance of audio data in the first multi-channel audio data;
    按照所述相关程度,对所述第一多通道音频数据中音频数据进行对齐处理。According to the correlation degree, the audio data in the first multi-channel audio data is aligned.
  10. 一种音频数据处理的装置,其特征在于,所述装置包括:An audio data processing device, characterized in that the device comprises:
    第一多通道音频数据获取模块,用于获取第一多通道音频数据;其中,所述第一多通道音频数据由一个或多个麦克风阵列采集的音频数据组成;The first multi-channel audio data acquisition module is configured to acquire first multi-channel audio data; wherein, the first multi-channel audio data is composed of audio data collected by one or more microphone arrays;
    解混响处理模块,用于对所述第一多通道音频数据进行解混响处理,得到第二多通道音频数据;A de-reverberation processing module, configured to perform de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data;
    时频掩码生成模块,用于生成针对所述第二多通道音频数据的时频掩码;A time-frequency mask generating module, configured to generate a time-frequency mask for the second multi-channel audio data;
    波束形成处理模块,用于根据所述时频掩码,对所述第二多通道音频数据进行波束形成处理,得到第一单通道音频数据;A beamforming processing module, configured to perform beamforming processing on the second multi-channel audio data according to the time-frequency mask to obtain first single-channel audio data;
    音频信号输出模块,用于采用所述第一单通道音频数据,进行音频信号输出。The audio signal output module is configured to use the first single-channel audio data to output audio signals.
  11. 根据权利要求10所述的装置,其特征在于,所述波束形成处理模块包括:The apparatus according to claim 10, wherein the beamforming processing module comprises:
    函数和矩阵确定子模块,用于根据所述时频掩码,确定信道传递函数和干扰噪声协方差矩阵;The function and matrix determination sub-module is used to determine the channel transfer function and the interference noise covariance matrix according to the time-frequency mask;
    波束权值确定子模块,用于采用所述信道传递函数和所述干扰噪声协方差矩阵,确定波束权值;The beam weight determination sub-module is configured to use the channel transfer function and the interference and noise covariance matrix to determine the beam weight;
    第一单通道音频数据得到子模块,用于采用所述波束权值,对所述第二多通道音频数据进行波束形成处理,得到第一单通道音频数据。The first single-channel audio data obtaining submodule is configured to use the beam weight to perform beamforming processing on the second multi-channel audio data to obtain the first single-channel audio data.
  12. 根据权利要求11所述的装置,其特征在于,所述时频掩码包括目标语音掩码和干扰噪声掩码,所述函数和矩阵确定子模块包括:The device according to claim 11, wherein the time-frequency mask comprises a target speech mask and an interference noise mask, and the function and matrix determination sub-module comprises:
    目标语音协方差矩阵生成单元,用于采用所述目标语音掩码,生成目标语音协方差矩阵;A target speech covariance matrix generating unit, configured to use the target speech mask to generate a target speech covariance matrix;
    信道传递函数得到单元,用于采用所述目标语音协方差矩阵,计算得到信道传递函数;A channel transfer function obtaining unit, configured to use the target voice covariance matrix to calculate the channel transfer function;
    干扰噪声协方差矩阵得到单元,用于采用所述干扰噪声掩码,计算得到干扰噪声协方差矩阵。The interference noise covariance matrix obtaining unit is configured to use the interference noise mask to calculate the interference noise covariance matrix.
  13. 根据权利要求10或11或12所述的装置,其特征在于,所述时频掩码生成模块包括:The device according to claim 10 or 11 or 12, wherein the time-frequency mask generation module comprises:
    第一时频掩码生成子模块,用于生成针对所述第二多通道音频数据中类目标语音数据的第一时频掩码;The first time-frequency mask generation sub-module is configured to generate a first time-frequency mask for target-like voice data in the second multi-channel audio data;
    根据确定时频掩码子模块,用于根据所述第一时频掩码,确定针对所述第二多通道音频数据的时频掩码。The time-frequency mask according to determination sub-module is configured to determine a time-frequency mask for the second multi-channel audio data according to the first time-frequency mask.
  14. 根据权利要求13所述的装置,其特征在于,所述根据确定时频掩码子模块包括:The apparatus according to claim 13, wherein the time-frequency mask sub-module for determining according to comprises:
    类目标语音数据获取单元,用于获取所述第一时频掩码对应的类目标语音数据;The class target voice data obtaining unit is configured to obtain class target voice data corresponding to the first time-frequency mask;
    第二时频掩码生成单元,用于结合所述类目标语音数据,生成针对所述第二多通道音频数据中目标语音数据的第二时频掩码;其中,所述类目标语音数据包含所述目标语音数据;The second time-frequency mask generating unit is configured to combine the class-target voice data to generate a second time-frequency mask for the target voice data in the second multi-channel audio data; wherein, the class-target voice data includes The target voice data;
    结合确定时频掩码单元,用于结合所述第一视频掩码、所述第二视频掩码,生成针对所述第二多通道音频数据的时频掩码。The combined time-frequency mask unit is used to combine the first video mask and the second video mask to generate a time-frequency mask for the second multi-channel audio data.
  15. 根据权利要求10所述的装置,其特征在于,所述第一音频信号输出模块包括:The device according to claim 10, wherein the first audio signal output module comprises:
    自适应滤波处理子模块,用于对所述第一单通道音频数据进行自适应滤波处理,得到第二单通道音频数据;An adaptive filter processing sub-module, configured to perform adaptive filter processing on the first single-channel audio data to obtain second single-channel audio data;
    第二音频信号输出子模块,用于采用所述第二单通道音频数据,进行音频信号输出。The second audio signal output sub-module is configured to use the second single-channel audio data to output audio signals.
  16. 根据权利要求15所述的装置,其特征在于,所述第二音频信号输出子模块包括:The device according to claim 15, wherein the second audio signal output sub-module comprises:
    当前应用类型确定单元,用于确定当前应用类型;The current application type determining unit is used to determine the current application type;
    第三单通道音频数据得到单元,用于采用所述当前应用类型对应的单通道降噪策略,对所述第二单通道音频数据进行降噪处理,得到第三单通道音频数据;A third single-channel audio data obtaining unit, configured to adopt a single-channel noise reduction strategy corresponding to the current application type to perform noise reduction processing on the second single-channel audio data to obtain third single-channel audio data;
    第三音频信号输出单元,用于采用所述第三单通道音频数据,进行音频信号输出。The third audio signal output unit is configured to use the third single-channel audio data to output audio signals.
  17. 根据权利要求16所述的装置,其特征在于,所述解混响处理模块包括:The device according to claim 16, wherein the de-reverberation processing module comprises:
    解混响参数获取子模块,用于获取解混响参数;De-reverberation parameter acquisition sub-module for acquiring de-reverberation parameters;
    第二多通道音频数据得到子模块,用于采用所述解混响参数,对所述第一多通道音频数据进行解混响处理,得到第二多通道音频数据;The second multi-channel audio data obtaining sub-module is configured to use the de-reverberation parameter to perform de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data;
    所述装置还包括:The device also includes:
    迭代更新模块,用于采用所述第一单通道音频数据和/或,所述第二单通道音频数据和/或,所述第三单通道音频数据,迭代更新所述解混响参数。An iterative update module is configured to use the first single-channel audio data and/or the second single-channel audio data and/or the third single-channel audio data to iteratively update the dereverberation parameters.
  18. 根据权利要求10所述的装置,其特征在于,所述装置还包括:The device according to claim 10, wherein the device further comprises:
    相关程度确定模块,用于确定所述第一多通道音频数据中音频数据的相关程度;A correlation degree determination module, configured to determine the correlation degree of audio data in the first multi-channel audio data;
    对齐处理模块,用于按照所述相关程度,对所述第一多通道音频数据中音频数据进行对齐处理。The alignment processing module is configured to perform alignment processing on the audio data in the first multi-channel audio data according to the correlation degree.
  19. 一种电子设备,其特征在于,包括有存储器,以及一个或者一个以上的程序,其中一个或者一个以上程序存储于存储器中,且经配置以由一个或者一个以上处理器执行所述一个或者一个以上程序包含用于进行以下操作的指令:An electronic device characterized by comprising a memory and one or more programs, wherein one or more programs are stored in the memory and configured to be executed by one or more processors. The program contains instructions for the following operations:
    获取第一多通道音频数据;其中,所述第一多通道音频数据由一个或多个麦克风阵列采集的音频数据组成;Acquiring first multi-channel audio data; wherein the first multi-channel audio data is composed of audio data collected by one or more microphone arrays;
    对所述第一多通道音频数据进行解混响处理,得到第二多通道音频数据;Performing de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data;
    生成针对所述第二多通道音频数据的时频掩码;Generating a time-frequency mask for the second multi-channel audio data;
    根据所述时频掩码,对所述第二多通道音频数据进行波束形成处理,得到第一单通道音频数据;Performing beamforming processing on the second multi-channel audio data according to the time-frequency mask to obtain first single-channel audio data;
    采用所述第一单通道音频数据,进行音频信号输出。The audio signal output is performed by using the first single-channel audio data.
  20. 根据权利要求19所述的电子设备,其特征在于,所述根据所述时 频掩码,对所述第二多通道音频数据进行波束形成处理,得到第一单通道音频数据的步骤包括:The electronic device according to claim 19, wherein the step of performing beamforming processing on the second multi-channel audio data according to the time-frequency mask to obtain the first single-channel audio data comprises:
    根据所述时频掩码,确定信道传递函数和干扰噪声协方差矩阵;Determine the channel transfer function and the interference noise covariance matrix according to the time-frequency mask;
    采用所述信道传递函数和所述干扰噪声协方差矩阵,确定波束权值;Using the channel transfer function and the interference noise covariance matrix to determine beam weights;
    采用所述波束权值,对所述第二多通道音频数据进行波束形成处理,得到第一单通道音频数据。The beam weight is used to perform beamforming processing on the second multi-channel audio data to obtain the first single-channel audio data.
  21. 根据权利要求20所述的电子设备,其特征在于,所述时频掩码包括目标语音掩码和干扰噪声掩码,所述根据所述时频掩码,确定信道传递函数和干扰噪声协方差矩阵的步骤包括:The electronic device according to claim 20, wherein the time-frequency mask comprises a target speech mask and an interference noise mask, and the channel transfer function and the interference noise covariance are determined according to the time-frequency mask The steps of the matrix include:
    采用所述目标语音掩码,生成目标语音协方差矩阵;Using the target voice mask to generate a target voice covariance matrix;
    采用所述目标语音协方差矩阵,计算得到信道传递函数;Using the target speech covariance matrix to calculate the channel transfer function;
    采用所述干扰噪声掩码,计算得到干扰噪声协方差矩阵。Using the interference noise mask, the interference noise covariance matrix is calculated.
  22. 根据权利要求19或20或21所述的电子设备,其特征在于,所述生成针对所述第二多通道音频数据的时频掩码的步骤包括:The electronic device according to claim 19 or 20 or 21, wherein the step of generating a time-frequency mask for the second multi-channel audio data comprises:
    生成针对所述第二多通道音频数据中类目标语音数据的第一时频掩码;Generating a first time-frequency mask for the target voice data in the second multi-channel audio data;
    根据所述第一时频掩码,确定针对所述第二多通道音频数据的时频掩码。According to the first time-frequency mask, a time-frequency mask for the second multi-channel audio data is determined.
  23. 根据权利要求22所述的电子设备,其特征在于,所述根据所述第一时频掩码,确定针对所述第二多通道音频数据的时频掩码的步骤包括:The electronic device according to claim 22, wherein the step of determining a time-frequency mask for the second multi-channel audio data according to the first time-frequency mask comprises:
    获取所述第一时频掩码对应的类目标语音数据;Acquiring target-like voice data corresponding to the first time-frequency mask;
    结合所述类目标语音数据,生成针对所述第二多通道音频数据中目标语音数据的第二时频掩码;其中,所述类目标语音数据包含所述目标语音数据;Combining the class target voice data to generate a second time-frequency mask for the target voice data in the second multi-channel audio data; wherein the class target voice data includes the target voice data;
    结合所述第一视频掩码、所述第二视频掩码,生成针对所述第二多通道音频数据的时频掩码。Combining the first video mask and the second video mask to generate a time-frequency mask for the second multi-channel audio data.
  24. 根据权利要求19所述的电子设备,其特征在于,所述采用所述第一单通道音频数据,进行音频信号输出的步骤包括:The electronic device according to claim 19, wherein the step of using the first single-channel audio data to output an audio signal comprises:
    对所述第一单通道音频数据进行自适应滤波处理,得到第二单通道音频数据;Performing adaptive filtering processing on the first single-channel audio data to obtain second single-channel audio data;
    采用所述第二单通道音频数据,进行音频信号输出。The second single-channel audio data is used for audio signal output.
  25. 根据权利要求24所述的电子设备,其特征在于,所述采用所述第二单通道音频数据,进行音频信号输出的步骤包括:The electronic device according to claim 24, wherein the step of using the second single-channel audio data to output an audio signal comprises:
    确定当前应用类型;Determine the current application type;
    采用所述当前应用类型对应的单通道降噪策略,对所述第二单通道音频数据进行降噪处理,得到第三单通道音频数据;Using the single-channel noise reduction strategy corresponding to the current application type to perform noise reduction processing on the second single-channel audio data to obtain third single-channel audio data;
    采用所述第三单通道音频数据,进行音频信号输出。The third single-channel audio data is used for audio signal output.
  26. 根据权利要求25所述的电子设备,其特征在于,所述对所述第一多通道音频数据进行解混响处理,得到第二多通道音频数据的步骤包括:The electronic device according to claim 25, wherein the step of performing de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data comprises:
    获取解混响参数;Obtain de-reverberation parameters;
    采用所述解混响参数,对所述第一多通道音频数据进行解混响处理,得到第二多通道音频数据;Using the de-reverberation parameter to perform de-reverberation processing on the first multi-channel audio data to obtain second multi-channel audio data;
    所述电子设备还包含用于进行以下操作的指令:The electronic device also includes instructions for performing the following operations:
    采用所述第一单通道音频数据和/或,所述第二单通道音频数据和/或,所述第三单通道音频数据,迭代更新所述解混响参数。The first single-channel audio data and/or the second single-channel audio data and/or the third single-channel audio data are used to iteratively update the dereverberation parameters.
  27. 根据权利要求19所述的电子设备,其特征在于,所述电子设备还包含用于进行以下操作的指令:The electronic device according to claim 19, wherein the electronic device further comprises instructions for performing the following operations:
    确定所述第一多通道音频数据中音频数据的相关程度;Determining the degree of relevance of audio data in the first multi-channel audio data;
    按照所述相关程度,对所述第一多通道音频数据中音频数据进行对齐处理。According to the correlation degree, the audio data in the first multi-channel audio data is aligned.
  28. 一种可读存储介质,其特征在于,当所述存储介质中的指令由电子设备的处理器执行时,使得电子设备能够执行如方法权利要求1-9任一所述的音频数据处理方法。A readable storage medium, characterized in that, when the instructions in the storage medium are executed by the processor of the electronic device, the electronic device can execute the audio data processing method according to any one of the method claims 1-9.
PCT/CN2020/110038 2019-11-29 2020-08-19 Audio data processing method and apparatus, and electronic device and storage medium WO2021103672A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911207689.4A CN110970046B (en) 2019-11-29 2019-11-29 Audio data processing method and device, electronic equipment and storage medium
CN201911207689.4 2019-11-29

Publications (1)

Publication Number Publication Date
WO2021103672A1 true WO2021103672A1 (en) 2021-06-03

Family

ID=70032376

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/110038 WO2021103672A1 (en) 2019-11-29 2020-08-19 Audio data processing method and apparatus, and electronic device and storage medium

Country Status (2)

Country Link
CN (1) CN110970046B (en)
WO (1) WO2021103672A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113689870A (en) * 2021-07-26 2021-11-23 浙江大华技术股份有限公司 Multi-channel voice enhancement method and device, terminal and readable storage medium
CN114898767A (en) * 2022-04-15 2022-08-12 中国电子科技集团公司第十研究所 Airborne voice noise separation method, device and medium based on U-Net

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110970046B (en) * 2019-11-29 2022-03-11 北京搜狗科技发展有限公司 Audio data processing method and device, electronic equipment and storage medium
CN112420073B (en) * 2020-10-12 2024-04-16 北京百度网讯科技有限公司 Voice signal processing method, device, electronic equipment and storage medium
CN113270097B (en) * 2021-05-18 2022-05-17 成都傅立叶电子科技有限公司 Unmanned mechanical control method, radio station voice instruction conversion method and device
CN113643714B (en) * 2021-10-14 2022-02-18 阿里巴巴达摩院(杭州)科技有限公司 Audio processing method, device, storage medium and computer program
CN113644947A (en) * 2021-10-14 2021-11-12 西南交通大学 Adaptive beam forming method, device, equipment and readable storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105788607A (en) * 2016-05-20 2016-07-20 中国科学技术大学 Speech enhancement method applied to dual-microphone array
CN106448722A (en) * 2016-09-14 2017-02-22 科大讯飞股份有限公司 Sound recording method, device and system
CN108335701A (en) * 2018-01-24 2018-07-27 青岛海信移动通信技术股份有限公司 A kind of method and apparatus carrying out noise reduction
CN109166590A (en) * 2018-08-21 2019-01-08 江西理工大学 A kind of two-dimentional time-frequency mask estimation modeling method based on spatial correlation
WO2019049276A1 (en) * 2017-09-07 2019-03-14 三菱電機株式会社 Noise elimination device and noise elimination method
US10249299B1 (en) * 2013-06-27 2019-04-02 Amazon Technologies, Inc. Tailoring beamforming techniques to environments
CN110503971A (en) * 2018-05-18 2019-11-26 英特尔公司 Time-frequency mask neural network based estimation and Wave beam forming for speech processes
CN110970046A (en) * 2019-11-29 2020-04-07 北京搜狗科技发展有限公司 Audio data processing method and device, electronic equipment and storage medium

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105244036A (en) * 2014-06-27 2016-01-13 中兴通讯股份有限公司 Microphone speech enhancement method and microphone speech enhancement device
US10475466B2 (en) * 2014-07-17 2019-11-12 Ford Global Technologies, Llc Adaptive vehicle state-based hands-free phone noise reduction with learning capability
CN204117590U (en) * 2014-09-24 2015-01-21 广东外语外贸大学 Voice collecting denoising device and voice quality assessment system
US11133011B2 (en) * 2017-03-13 2021-09-28 Mitsubishi Electric Research Laboratories, Inc. System and method for multichannel end-to-end speech recognition
CN107316649B (en) * 2017-05-15 2020-11-20 百度在线网络技术(北京)有限公司 Speech recognition method and device based on artificial intelligence
US10192566B1 (en) * 2018-01-17 2019-01-29 Sorenson Ip Holdings, Llc Noise reduction in an audio system
CN108831495B (en) * 2018-06-04 2022-11-29 桂林电子科技大学 Speech enhancement method applied to speech recognition in noise environment
CN108806707B (en) * 2018-06-11 2020-05-12 百度在线网络技术(北京)有限公司 Voice processing method, device, equipment and storage medium
CN109817236A (en) * 2019-02-01 2019-05-28 安克创新科技股份有限公司 Audio defeat method, apparatus, electronic equipment and storage medium based on scene

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10249299B1 (en) * 2013-06-27 2019-04-02 Amazon Technologies, Inc. Tailoring beamforming techniques to environments
CN105788607A (en) * 2016-05-20 2016-07-20 中国科学技术大学 Speech enhancement method applied to dual-microphone array
CN106448722A (en) * 2016-09-14 2017-02-22 科大讯飞股份有限公司 Sound recording method, device and system
WO2019049276A1 (en) * 2017-09-07 2019-03-14 三菱電機株式会社 Noise elimination device and noise elimination method
CN108335701A (en) * 2018-01-24 2018-07-27 青岛海信移动通信技术股份有限公司 A kind of method and apparatus carrying out noise reduction
CN110503971A (en) * 2018-05-18 2019-11-26 英特尔公司 Time-frequency mask neural network based estimation and Wave beam forming for speech processes
CN109166590A (en) * 2018-08-21 2019-01-08 江西理工大学 A kind of two-dimentional time-frequency mask estimation modeling method based on spatial correlation
CN110970046A (en) * 2019-11-29 2020-04-07 北京搜狗科技发展有限公司 Audio data processing method and device, electronic equipment and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113689870A (en) * 2021-07-26 2021-11-23 浙江大华技术股份有限公司 Multi-channel voice enhancement method and device, terminal and readable storage medium
CN114898767A (en) * 2022-04-15 2022-08-12 中国电子科技集团公司第十研究所 Airborne voice noise separation method, device and medium based on U-Net
CN114898767B (en) * 2022-04-15 2023-08-15 中国电子科技集团公司第十研究所 U-Net-based airborne voice noise separation method, equipment and medium

Also Published As

Publication number Publication date
CN110970046B (en) 2022-03-11
CN110970046A (en) 2020-04-07

Similar Documents

Publication Publication Date Title
WO2021103672A1 (en) Audio data processing method and apparatus, and electronic device and storage medium
CN108510987B (en) Voice processing method and device
US11284190B2 (en) Method and device for processing audio signal with frequency-domain estimation, and non-transitory computer-readable storage medium
CN110493690B (en) Sound collection method and device
US9489963B2 (en) Correlation-based two microphone algorithm for noise reduction in reverberation
WO2015184893A1 (en) Mobile terminal call voice noise reduction method and device
KR102497549B1 (en) Audio signal processing method and device, and storage medium
EP3657497B1 (en) Method and device for selecting target beam data from a plurality of beams
US10186278B2 (en) Microphone array noise suppression using noise field isotropy estimation
CN111009257B (en) Audio signal processing method, device, terminal and storage medium
CN111128221A (en) Audio signal processing method and device, terminal and storage medium
CN110634488B (en) Information processing method, device and system and storage medium
CN114363770A (en) Filtering method and device in pass-through mode, earphone and readable storage medium
WO2022062531A1 (en) Multi-channel audio signal acquisition method and apparatus, and system
CN113506582A (en) Sound signal identification method, device and system
CN112447184A (en) Voice signal processing method and device, electronic equipment and storage medium
CN111510846B (en) Sound field adjusting method and device and storage medium
CN105244037B (en) Audio signal processing method and device
US11682412B2 (en) Information processing method, electronic equipment, and storage medium
CN110459236A (en) Noise estimation method, device and the storage medium of audio signal
CN113223553B (en) Method, apparatus and medium for separating voice signal
CN113488066A (en) Audio signal processing method, audio signal processing apparatus, and storage medium
WO2022198820A1 (en) Speech processing method and apparatus, and apparatus for speech processing
CN112785997B (en) Noise estimation method and device, electronic equipment and readable storage medium
CN113362841B (en) Audio signal processing method, device and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20894066

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20894066

Country of ref document: EP

Kind code of ref document: A1