CN111179960A - Audio signal processing method and device and storage medium - Google Patents

Audio signal processing method and device and storage medium Download PDF

Info

Publication number
CN111179960A
CN111179960A CN202010153357.9A CN202010153357A CN111179960A CN 111179960 A CN111179960 A CN 111179960A CN 202010153357 A CN202010153357 A CN 202010153357A CN 111179960 A CN111179960 A CN 111179960A
Authority
CN
China
Prior art keywords
frequency
determining
frequency point
harmonic
sum
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010153357.9A
Other languages
Chinese (zh)
Other versions
CN111179960B (en
Inventor
侯海宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Xiaomi Pinecone Electronic Co Ltd
Original Assignee
Beijing Pinecone Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Pinecone Electronics Co Ltd filed Critical Beijing Pinecone Electronics Co Ltd
Priority to CN202010153357.9A priority Critical patent/CN111179960B/en
Publication of CN111179960A publication Critical patent/CN111179960A/en
Application granted granted Critical
Publication of CN111179960B publication Critical patent/CN111179960B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0224Processing in the time domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02165Two microphones, one receiving mainly the noise signal and the other one mainly the speech signal
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The disclosure relates to an audio signal processing method and apparatus, and a storage medium. The method comprises the following steps: acquiring audio signals sent by at least two sound sources respectively by at least two microphones to obtain original noisy signals of the at least two microphones respectively; for each frame in the time domain, acquiring respective frequency domain estimation signals of at least two sound sources according to respective original noisy signals of at least two microphones; dividing a predetermined frequency band range into a plurality of harmonic subsets, wherein each harmonic subset comprises a plurality of frequency point data; determining the weighting coefficient of each frequency point contained in the harmonic subsets according to the frequency domain estimation signal of each frequency point in each harmonic subset; determining a separation matrix of each frequency point according to the weighting coefficient; based on the separation matrix and the original noisy signal, audio signals emitted by at least two sound sources are obtained. By the method, the voice quality of the audio signal can be improved.

Description

Audio signal processing method and device and storage medium
Technical Field
The present disclosure relates to the field of signal processing, and in particular, to an audio signal processing method and apparatus, and a storage medium.
Background
In the related technology, the intelligent product equipment mostly adopts a microphone array for pickup, and a microphone beam forming technology is applied to improve the processing quality of a voice signal so as to improve the voice recognition rate in a real environment. However, the beam forming technology of multiple microphones is sensitive to the position error of the microphones, the performance influence is large, and in addition, the increase of the number of the microphones also causes the increase of the product cost.
Therefore, currently more and more smart product devices are configured with only two microphones; two microphones often adopt a blind source separation technology completely different from a plurality of microphone beam forming technologies to enhance voice, and how to make the voice quality of signals separated based on the blind source separation technology higher is a problem which needs to be solved urgently at present.
Disclosure of Invention
The present disclosure provides an audio signal processing method and apparatus, and a storage medium.
According to a first aspect of embodiments of the present disclosure, there is provided an audio signal processing method, including:
acquiring audio signals emitted by at least two sound sources respectively by at least two microphones to obtain original noisy signals of the at least two microphones respectively;
for each frame in the time domain, acquiring respective frequency domain estimation signals of the at least two sound sources according to the respective original noisy signals of the at least two microphones;
dividing a predetermined frequency band range into a plurality of harmonic subsets, wherein each harmonic subset comprises a plurality of frequency point data;
determining the weighting coefficient of each frequency point contained in each harmonic subset according to the frequency domain estimation signal of each frequency point in each harmonic subset;
determining a separation matrix of each frequency point according to the weighting coefficient;
and obtaining audio signals sent by at least two sound sources respectively based on the separation matrix and the original noisy signals.
In some embodiments, the determining, according to the frequency domain estimation signals of frequency points in each of the harmonic subsets, a weighting coefficient of each frequency point included in the harmonic subset includes:
determining a distribution function of the frequency domain estimation signals according to the frequency domain estimation signals of the frequency points in the harmonic subsets;
and determining the weighting coefficient of each frequency point according to the distribution function.
In some embodiments, the determining a distribution function of the frequency domain estimation signals according to the frequency domain estimation signals of frequency points in each of the harmonic subsets includes:
determining the square of the frequency domain estimation signal to standard deviation ratio of each frequency point in the frequency point set of each harmonic subset;
summing the squares of the ratios of the frequency point sets to determine a first sum;
acquiring the square sum of the first sum corresponding to each frequency point set to obtain a second sum;
and determining the distribution function according to the exponential function with the second sum as a variable.
In some embodiments, the determining a distribution function of the frequency domain estimation signals according to the frequency domain estimation signals of frequency points in each of the harmonic subsets includes:
determining the square of the ratio between the frequency domain estimation signal and the standard deviation of each frequency point in the frequency point set of each harmonic subset;
summing the squares of the ratios of the frequency point sets to determine a third sum;
determining a fourth sum according to a predetermined power of the third sum corresponding to each frequency point set;
and determining the distribution function according to the exponential function with the fourth sum as a variable.
In some embodiments, the method further comprises:
determining a base frequency point, the first M multiple frequency points and a frequency point in a preset bandwidth where each multiple frequency point is located of each harmonic subset;
and determining the frequency point set of each harmonic subset according to a set consisting of the basic frequency point, the first M times of frequency points and the frequency points in the preset bandwidth where each time of frequency points is located.
In some embodiments, the determining the fundamental frequency point, the first M multiple frequency points, and the frequency point within the preset bandwidth where each multiple frequency point of each harmonic subset is located includes:
determining the fundamental frequency point of each harmonic subset and the first M multiple frequency points corresponding to each fundamental frequency point according to the preset frequency band range and the preset number of divided harmonic subsets;
and determining the frequency points in the preset bandwidth according to the fundamental frequency points and the first M multiple frequency points of each harmonic subset.
According to a second aspect of the present disclosure, there is provided an audio signal processing apparatus comprising:
the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring audio signals emitted by at least two sound sources by at least two microphones respectively so as to obtain original noisy signals of the at least two microphones respectively;
a second obtaining module, configured to obtain, for each frame in a time domain, frequency domain estimation signals of the at least two sound sources according to the original noisy signals of the at least two microphones, respectively;
the device comprises a dividing module, a frequency-point data generating module and a frequency-point data generating module, wherein the dividing module is used for dividing a preset frequency band range into a plurality of harmonic subsets, and each harmonic subset comprises a plurality of frequency-point data;
a first determining module, configured to determine, according to the frequency domain estimation signal of each frequency point in each harmonic subset, a weighting coefficient of each frequency point included in the harmonic subset;
the second determining module is used for determining the separation matrix of each frequency point according to the weighting coefficient;
and the third acquisition module is used for acquiring audio signals sent by at least two sound sources respectively based on the separation matrix and the original noisy signals.
In some embodiments, the first determining module comprises:
the first determining submodule is used for determining a distribution function of the frequency domain estimation signals according to the frequency domain estimation signals of all frequency points in all the harmonic subsets;
and the second determining submodule is used for determining the weighting coefficient of each frequency point according to the distribution function.
In some embodiments, the first determining submodule is specifically configured to:
determining the square of the frequency domain estimation signal to standard deviation ratio of each frequency point in the frequency point set of each harmonic subset;
summing the squares of the ratios of the frequency point sets to determine a first sum;
acquiring the square sum of the first sum corresponding to each frequency point set to obtain a second sum;
and determining the distribution function according to the exponential function with the second sum as a variable.
In some embodiments, the first determining submodule is specifically configured to:
determining the square of the ratio between the frequency domain estimation signal and the standard deviation of each frequency point in the frequency point set of each harmonic subset;
summing the squares of the ratios of the frequency point sets to determine a third sum;
determining a fourth sum according to a predetermined power of the third sum corresponding to each frequency point set;
and determining the distribution function according to the exponential function with the fourth sum as a variable.
In some embodiments, the apparatus further comprises:
a third determining module, configured to determine a fundamental frequency point, the first M multiple frequency points, and a frequency point within a preset bandwidth where each multiple frequency point is located of each harmonic subset;
and the fourth determining module is used for determining the frequency point set of each harmonic subset according to a set consisting of the basic frequency point, the first M times of frequency points and the frequency points in the preset bandwidth where each time of frequency points is located.
In some embodiments, the third determining module comprises:
the third determining submodule is used for determining the fundamental frequency point of each harmonic subset and the first M multiple frequency points corresponding to each fundamental frequency point according to the preset frequency band range and the preset number of the divided harmonic subsets;
and the fourth determining submodule is used for determining the frequency points in the preset bandwidth according to the fundamental frequency points and the first M multiple frequency points of each harmonic subset.
According to a third aspect provided by the present disclosure, there is provided an audio signal processing apparatus, the apparatus comprising at least: a processor and a memory for storing executable instructions operable on the processor, wherein:
the processor is configured to execute the executable instructions, and the executable instructions perform the steps of any of the audio signal processing methods described above.
According to a fourth aspect provided by the present disclosure, there is provided a non-transitory computer-readable storage medium having stored therein computer-executable instructions that, when executed by a processor, implement the steps in any of the audio signal processing methods described above.
The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects: according to the frequency domain estimation signals corresponding to the frequency points in the harmonic subsets, the weighting coefficients are determined. Compared with the mode of directly determining the weighting coefficients according to the frequency points in the related technology, the embodiment of the disclosure divides the harmonic subsets, and the frequency points in the harmonic subsets form a harmonic structure by frequency doubling points, so that the frequency points in each harmonic subset have strong dependence. And respectively processing according to the harmonic subsets, thereby realizing processing of each frequency point in the whole frequency band according to different dependencies. Therefore, the accuracy of signal separation of each frequency point is enhanced, the recognition performance is improved, and the voice damage after separation is reduced.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
FIG. 1 is a first flowchart illustrating a method of audio signal processing according to an exemplary embodiment;
FIG. 2 is a flowchart II illustrating a method of audio signal processing according to an exemplary embodiment;
fig. 3 is a block diagram illustrating an application scenario of an audio signal processing method according to an exemplary embodiment.
FIG. 4 is a flowchart three illustrating a method of audio signal processing according to an exemplary embodiment;
fig. 5 is a block diagram illustrating a structure of an audio signal processing apparatus according to an exemplary embodiment;
fig. 6 is a block diagram illustrating a physical structure of an audio signal processing apparatus according to an exemplary embodiment.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.
Fig. 1 is a flowchart illustrating an audio signal processing method according to an exemplary embodiment, as shown in fig. 1, including the steps of:
step S101, acquiring audio signals sent by at least two sound sources by at least two microphones respectively to obtain original noisy signals of the at least two microphones respectively;
step S102, for each frame in a time domain, acquiring respective frequency domain estimation signals of the at least two sound sources according to the respective original noisy signals of the at least two microphones;
step S103, dividing a preset frequency band range into a plurality of harmonic subsets, wherein each harmonic subset comprises a plurality of frequency point data;
step S104, determining the weighting coefficient of each frequency point contained in each harmonic subset according to the frequency domain estimation signal of each frequency point in each harmonic subset;
s105, determining a separation matrix of each frequency point according to the weighting coefficient;
and S106, obtaining audio signals sent by at least two sound sources respectively based on the separation matrix and the original noisy signals.
The method disclosed by the embodiment of the disclosure is applied to the terminal. Here, the terminal is an electronic device into which two or more microphones are integrated. For example, the terminal may be a vehicle-mounted terminal, a computer, a server, or the like.
In an embodiment, the terminal may further be: an electronic device connected to a predetermined device in which two or more microphones are integrated; and the electronic equipment receives the audio signal collected by the predetermined equipment based on the connection and sends the processed audio signal to the predetermined equipment based on the connection. For example, the predetermined device is a sound box or the like.
In practical application, the terminal includes at least two microphones, and the at least two microphones simultaneously detect audio signals emitted by at least two sound sources respectively, so as to obtain original noisy signals of the at least two microphones respectively. Here, it is understood that in the present embodiment, the at least two microphones detect the audio signals emitted by the two sound sources synchronously.
In the embodiment of the present disclosure, the number of the microphones is 2 or more, and the number of the sound sources is 2 or more.
In the embodiment of the present disclosure, the original noisy signal is: comprising a mixed signal of the sounds emitted by at least two sound sources. For example, the number of the microphones is 2, namely a microphone 1 and a microphone 2; the number of the sound sources is 2, namely a sound source 1 and a sound source 2; the original noisy signal of said microphone 1 is an audio signal comprising a sound source 1 and a sound source 2; the original noisy signal of the microphone 2 is also an audio signal comprising both the sound source 1 and the sound source 2.
For example, the number of the microphones is 3, namely a microphone 1, a microphone 2 and a microphone 3; the number of the sound sources is 3, namely a sound source 1, a sound source 2 and a sound source 3; the original noisy signal of the microphone 1 is an audio signal comprising a sound source 1, a sound source 2 and a sound source 3; the original noisy signals of said microphone 2 and said microphone 3 are likewise audio signals each comprising a sound source 1, a sound source 2 and a sound source 3.
It will be appreciated that if the sound emitted by one sound source is an audio signal in a corresponding microphone, the signals from the other sound sources in the microphone are noise signals. The disclosed embodiments require recovery of sound sources emanating from at least two sound sources from at least two microphones.
It will be appreciated that the number of sound sources is generally the same as the number of microphones. If the number of microphones is smaller than the number of sound sources in some embodiments, the number of sound sources may be reduced to a dimension equal to the number of microphones.
It will be understood that when the microphones collect audio signals from sound sources, the audio signals of at least one frame of audio frame may be collected, and the collected audio signals are the original noisy signals of each microphone. The original noisy signal may be either a time domain signal or a frequency domain signal. If the original signal with noise is a time domain signal, the time domain signal can be converted into a frequency domain signal according to the operation of time-frequency conversion.
Here, the time domain signal may be frequency domain transformed based on Fast Fourier Transform (FFT). Alternatively, the time-domain signal may be frequency-domain transformed based on a short-time Fourier transform (STFT). Alternatively, the time domain signal may also be frequency domain transformed based on other fourier transforms.
For example, if the time domain signal of the p-th microphone in the n-th frame is:
Figure BDA0002403194680000061
transforming the time domain signal of the nth frame into a frequency domain signal, and determining the original noisy signal of the nth frame as follows:
Figure BDA0002403194680000062
wherein m is the number of discrete time points of the nth frame time domain signal,k is a frequency point. Thus, the present embodiment can obtain the original noisy signal of each frame through the time domain to frequency domain variation. Of course, the original noisy signal for each frame may be obtained based on other fast fourier transform equations, which is not limited herein.
According to the original noisy signal of the frequency domain, an initial frequency domain estimation signal can be obtained in a priori estimation mode.
Illustratively, the method may be based on an initialized separation matrix, such as an identity matrix; or separating the original signal with noise according to the separation matrix obtained from the previous frame to obtain the frequency domain estimation signal of each frame of each sound source. Therefore, the method provides a basis for separating the audio signals of the sound sources based on the frequency domain estimated signals and the separation matrix.
In the disclosed embodiment, the predetermined frequency band range is divided into a plurality of harmonic subsets. Here, the predetermined frequency band range may be a common range of the audio signal, or a frequency band range determined according to audio processing requirements. For example, the entire frequency band is divided into L harmonic subsets according to the pitch frequency range. Illustratively, the predetermined frequency band ranges from 55Hz to 880Hz, and L is 49, then in the ith harmonic subset, the fundamental frequency is: fl=F1·2(l-1)/12. Wherein, Fl=55Hz。
In the embodiment of the present disclosure, each harmonic subset includes a plurality of frequency point data, and according to the frequency domain estimation signal of each frequency point in each harmonic subset, a weighting coefficient of each frequency point included in the harmonic subset may be determined. According to the weighting coefficient, a separation matrix can be further determined, and then the separation matrix of each frequency point is determined. And separating the original signal with noise according to the separation matrix to obtain the posterior frequency domain estimation signal of each sound source. Here, the a posteriori frequency domain estimated signal can be added to the original signal close to each sound source, considering the weighting coefficient of each frequency point with respect to the a priori frequency domain estimated signal.
Here, let ClRepresenting the set of frequency bins contained by the h-th harmonic subset. Illustratively, the set is formed by the fundamental frequency FlAnd fundamental frequency FlThe first M multiple frequency points. Alternatively, the set consists of the fundamental frequency FlAnd at least part of frequency points in the bandwidth near the multiplied frequency point are formed.
Because the frequency point set of the harmonic subset representing the harmonic structure is determined based on the fundamental frequency and the first M frequency doubling points of the fundamental frequency, the frequency points in the frequency doubling point range have stronger dependency. Therefore, the weighting coefficients are determined according to the frequency domain estimation signals corresponding to the frequency points in each harmonic subset, and compared with a method of determining the weighting coefficients directly according to the frequency points in the related art, the embodiment of the present disclosure processes the frequency points according to different dependencies by dividing the harmonic subsets. Therefore, the accuracy of signal separation of each frequency point is enhanced, the recognition performance is improved, and the voice damage after separation is reduced.
In addition, compared with the prior art in which the sound source signals are separated by using the beam forming technology of multiple microphones, the audio signal processing method provided by the embodiment of the present disclosure does not need to consider the positions of the microphones, so that the separation of the audio signals of sounds emitted by the sound source with higher accuracy can be realized. If the audio signal processing method is applied to the terminal equipment with two microphones, compared with the prior art that the voice quality is improved by the beam forming technology of at least more than 3 microphones, the number of the microphones is greatly reduced, and the hardware cost of the terminal is reduced.
In some embodiments, as shown in fig. 2, in the step S104, the determining a weighting coefficient of each frequency point included in each harmonic subset according to the frequency domain estimation signal of each frequency point in each harmonic subset includes:
step S201, determining a distribution function of the frequency domain estimation signals according to the frequency domain estimation signals of each frequency point in each harmonic subset;
step S202, determining the weighting coefficient of each frequency point according to the distribution function.
In the embodiment of the present disclosure, the frequency points corresponding to the frequency domain estimation components may be continuously updated based on the weighting coefficients of the frequency points in the harmonic subsets, the frequency domain estimation signal of each frame, and the like, so that the separation matrix updated by the frequency points in the frequency domain estimation components may have better separation performance, and the accuracy of the separated audio signal may be further improved.
Here, a distribution function of the frequency domain estimation signals may be constructed from the frequency domain estimation signals of the frequency points in each harmonic subset. Because the frequency point set of each harmonic subset comprises each fundamental frequency and the first multiple frequency points of the fundamental frequency, a harmonic structure is formed, and the harmonic structure can be constructed on the basis of each harmonic subset based on the harmonic structure of the audio signal in the process of constructing the distribution function.
For example, the separation matrix may be determined based on eigenvalues solved by the covariance matrix. Covariance matrix Vp(k, n) satisfies the following relationship
Figure BDA0002403194680000081
wherein β is a smoothing coefficient, Vp(k, n-1) is the updated covariance of the previous frame, Xp(k, n) is the original noisy signal of the current frame,
Figure BDA0002403194680000082
the matrix is transposed for the conjugate of the original noisy signal of the current frame.
Figure BDA0002403194680000083
Are weighting coefficients. Wherein the content of the first and second substances,
Figure BDA0002403194680000084
are auxiliary variables.
Figure BDA0002403194680000085
Referred to as a contrast function. Here, the first and second liquid crystal display panels are,
Figure BDA0002403194680000086
a multi-dimensional super-gaussian prior probability density distribution model based on the whole frequency band, i.e. the above distribution function, is represented for the p-th sound source.
Figure BDA0002403194680000087
Is a matrix vector representing the frequency domain estimated signal of the p sound source in the n frame, Yp(n) estimating the signal in the frequency domain for the nth frame for the p sound source, YpAnd (k, n) represents the frequency domain estimation signal of the p sound source at the k frequency point of the n frame.
In the embodiment of the present disclosure, the signal may be estimated based on the frequency domain of each harmonic subset by the distribution function. Compared with the prior probability density of all frequency points of the whole frequency band in the related technology, the weighting coefficient determined in this way only needs to consider the prior probability density of the corresponding frequency point in the harmonic subset. Therefore, on one hand, calculation can be simplified, and on the other hand, frequency points far away from each other in the whole frequency band do not need to be considered. That is to say, the processing mode considers that different dependencies exist among frequency points with different distances, and the closer the distance, the stronger the dependency, so that the separation performance of the separation matrix is improved, and the subsequent separation of high-quality audio signals based on the separation matrix is facilitated.
In some embodiments, the determining a distribution function of the frequency domain estimation signals according to the frequency domain estimation signals of frequency points in each of the harmonic subsets includes:
determining the square of the frequency domain estimation signal to standard deviation ratio of each frequency point in the frequency point set of each harmonic subset;
summing the squares of the ratios of the frequency point sets to determine a first sum;
acquiring the square sum of the first sum corresponding to each frequency point set to obtain a second sum;
and determining the distribution function according to the exponential function with the second sum as a variable.
In embodiments of the present disclosure, the distribution function may be constructed from subsets of harmonics. The entire frequency band may be divided into L harmonic subsets, where each harmonic subset includes a number of frequency points. Let ClRepresenting the set of frequency bins contained by the h-th harmonic subset.
Based on this, the above distribution function can be defined according to the following formula (1):
Figure BDA0002403194680000091
in the above formula (1), k is a frequency point, and Y isp(k, n) is the frequency domain estimation signal of the frequency point k of the p sound source in the n frame,
Figure BDA0002403194680000092
is the variance, l is the harmonic subset, α is the coefficient, YpAnd (k, n) represents the frequency domain estimation signal of the p sound source at the k frequency point of the n frame. Based on the formula (1), according to the frequency points in each harmonic subset, squaring the ratio of the frequency domain estimation signal of each frequency point to the standard deviation, namely k belongs to ClAnd then summing the square values corresponding to each frequency point in the harmonic subset, namely the first sum. And summing the first sums corresponding to the frequency point sets, namely obtaining the sum of the first sums from 1 to L to obtain the second sum, and then obtaining the distribution function based on the exponential function of the second sum.
In the disclosed embodiment, the above formula is operated based on the frequency points included in each harmonic subset, and then operated based on each harmonic subset, thereby directly operating all frequency points on the whole frequency band, as compared with the prior art, for example,
Figure BDA0002403194680000093
for the processing mode that all frequency points are assumed to have the same dependency, the dependency among the frequency points in the harmonic structure is enhanced, and the dependency among the frequency points in different harmonic subsets is weakened. Therefore, the method better accords with the signal characteristics of the actual audio signal, and improves the accuracy of signal separation.
In some embodiments, the determining a distribution function of the frequency domain estimation signals according to the frequency domain estimation signals of frequency points in each of the harmonic subsets includes:
determining the square of the ratio between the frequency domain estimation signal and the standard deviation of each frequency point in the frequency point set of each harmonic subset;
summing the squares of the ratios of the frequency point sets to determine a third sum;
determining a fourth sum according to a predetermined power of the third sum corresponding to each frequency point set;
and determining the distribution function according to the exponential function with the fourth sum as a variable.
In the embodiment of the present disclosure, as in the above embodiments, the whole frequency band may be divided into L harmonic subsets, where each harmonic subset includes several frequency points. Let ClRepresenting the set of frequency bins contained by the h-th harmonic subset.
Based on this, the distribution function can also be defined according to the following formula (2):
Figure BDA0002403194680000094
in the above formula (2), k is a frequency point, and Y isp(k, n) is the frequency domain estimation signal of the frequency point k of the p sound source in the n frame,
Figure BDA0002403194680000101
based on the formula (2), according to the frequency points in each harmonic subset, squaring the ratio of the frequency domain estimation signal of each frequency point to the standard deviation, then summing the square values corresponding to each frequency point in the harmonic subset, namely the first sum, at the preset power of the first sum corresponding to each frequency point set (the formula (2) takes the 2/3 power as an example), and summing to obtain the second sum, and then obtaining the distribution function based on the exponential function of the second sum.
The formula (2) is similar to the formula (1), and the frequency points included in the harmonic subsets are used for operation, and then the operation is performed based on each harmonic subset, which has the same technical effect as the formula (1) in the embodiment with respect to the prior art, and is not described here again.
In some embodiments, the method further comprises:
determining a base frequency point, the first M multiple frequency points and a frequency point in a preset bandwidth where each multiple frequency point is located of each harmonic subset;
and determining the frequency point set of each harmonic subset according to a set consisting of the basic frequency point, the first M times of frequency points and the frequency points in the preset bandwidth where each time of frequency points is located.
In the embodiment of the present disclosure, the frequency points included in each harmonic subset are determined, and may be determined according to the fundamental frequency point and the multiple frequency point of each harmonic subset. The first M frequency doubling points in the harmonic subset and the frequency points near each frequency doubling point have stronger dependency, so the frequency point set C of the harmonic subsetlThe frequency point comprises the basic frequency point, the first M frequency doubling points and the frequency points in the preset bandwidth where each frequency doubling point is located.
In some embodiments, the determining the fundamental frequency point, the first M multiple frequency points, and the frequency point within the preset bandwidth where each multiple frequency point of each harmonic subset is located includes:
determining the fundamental frequency point of each harmonic subset and the first M multiple frequency points corresponding to each fundamental frequency point according to the preset frequency band range and the preset number of divided harmonic subsets;
and determining the frequency points in the preset bandwidth according to the fundamental frequency points and the first M multiple frequency points of each harmonic subset.
The frequency point set can be composed of
Figure BDA0002403194680000102
And (4) determining. Wherein f iskThe unit of the frequency represented by the k frequency point is Hz. Mth frequency point mFlNearby bandwidth of 2 δ mFl. δ is a parameter control bandwidth, i.e. the above-mentioned preset bandwidth, and is exemplarily equal to 0.2.
Therefore, the frequency point set of each harmonic subset is determined through the control of the preset bandwidth, the frequency points on the whole frequency band are grouped according to different dependencies based on the harmonic structure, and the accuracy of subsequent processing is improved.
Embodiments of the present disclosure also provide the following examples:
FIG. 4 is a flow chart illustrating a method of audio signal processing according to an exemplary embodiment; in the audio signal processing method, as shown in fig. 3, the sound source includes a sound source 1 and a sound source 2, and the microphone includes a microphone 1 and a microphone 2. Based on the audio signal processing method, the audio signals of the sound source 1 and the sound source 2 are restored from the original noisy signals of the microphone 1 and the microphone 2. As shown in fig. 4, the method comprises the steps of:
step S401: initializing W (k) and Vp(k);
Wherein the initialization comprises the following steps: if the system frame length is Nfft, the frequency point K is Nfft/2+ 1.
1) Initializing a separation matrix of each frequency point;
Figure BDA0002403194680000111
wherein, the
Figure BDA0002403194680000112
Is an identity matrix; the k is a frequency point; and K is 1, L and K.
2) Initializing weighted covariance matrix V of each sound source at each frequency pointp(k)。
Figure BDA0002403194680000113
Wherein the content of the first and second substances,
Figure BDA0002403194680000114
is a zero matrix; wherein p is used to represent a microphone; p is 1, 2.
Step S402: obtaining an original noisy signal of a p microphone in an n frame;
to pair
Figure BDA0002403194680000115
Windowing and Nfft point obtaining corresponding frequency domain signals:
Figure BDA0002403194680000116
wherein m is the number of points selected by Fourier transform; whereinThe STFT is a short-time Fourier transform; the above-mentioned
Figure BDA0002403194680000117
Time domain signals of the nth frame of the p microphone; here, the time domain signal is an original noisy signal.
Then the X ispThe observed signals of (k, n) are: x (k, n) ═ X1(k,n),X2(k,n)]T(ii) a Wherein, [ X ]1(k,n),X2(k,n)]TIs a transposed matrix.
Step S403: obtaining prior frequency domain estimation of two sound source signals by using W (k) of a previous frame;
let the a priori frequency domain estimates of the two source signals Y (k, n) be [ Y [ [ Y ]1(k,n),Y2(k,n)]TWherein Y is1(k,n),Y2(k, n) are estimated values of the sound source 1 and the sound source 2 at the time frequency points (k, n), respectively.
The observation matrix X (k, n) is separated by a separation matrix W (k) to obtain: y (k, n) ═ W (k)' X (k, n); where W' (k) is the separation matrix of the previous frame (i.e., the frame previous to the current frame).
Then the prior frequency domain estimation of the p sound source in the n frame is:
Figure BDA0002403194680000118
step S404: updating a weighted covariance matrix Vp(k,n);
Calculating an updated weighted covariance matrix:
Figure BDA0002403194680000121
in one embodiment, β is 0.98, wherein the V isp(k, n-1) is the weighted covariance matrix of the previous frame; the above-mentioned
Figure BDA0002403194680000122
The conjugate transpose of (1); the above-mentioned
Figure BDA0002403194680000123
Are weighting coefficients, whereinThe above-mentioned
Figure BDA0002403194680000124
Is an auxiliary variable; the above-mentioned
Figure BDA0002403194680000125
As a comparison function.
Wherein, the
Figure BDA0002403194680000126
A multi-dimensional super-gaussian prior probability density function based on the whole frequency band is represented for the p-th sound source. In one embodiment of the present invention, the substrate is,
Figure BDA0002403194680000127
at this time, if said
Figure BDA0002403194680000128
Then the
Figure BDA0002403194680000129
But this probability density distribution assumes that the same dependency exists between all bins. Actually, the dependence is weak when the distance between the frequency points is far, and the dependence is strong when the distance between the frequency points is near. Therefore, the embodiment of the present disclosure provides a harmonic structure based on voice
Figure BDA00024031946800001210
The construction method imposes strong dependence on frequency points in the harmonic structure.
Specifically, the entire frequency band is divided into L (illustratively, L ═ 49) harmonic subsets in terms of the pitch frequency range. Wherein, the fundamental frequency in the ith harmonic subset is: fl=F1·2(l-1)/12,F1=55Hz,FlThe range is 55Hz to 880Hz, covering the entire range of human speech pitch frequencies.
Let ClRepresenting the set of frequency bins contained by the h-th harmonic subset. It consists of a fundamental frequency FlThe first M (especially M is 8) multiple frequency points and the bandwidth near the multiple frequency pointsThe frequency point constitution is as follows:
Figure BDA00024031946800001211
wherein f iskThe unit of the frequency represented by the k frequency point is Hz. Mth frequency point mFlNearby bandwidth of 2 δ mFl. δ is a parameter control bandwidth, i.e., a preset bandwidth, and δ is 0.2, for example.
The distribution model based on the speech harmonic structure has the following two definitions:
Figure BDA0002403194680000131
Figure BDA0002403194680000132
wherein, alpha represents a coefficient,
Figure BDA0002403194680000133
representing variance, illustratively, α ═ 1,
Figure BDA0002403194680000134
the distribution function in the embodiment of the present disclosure, that is, the weighting coefficient obtained by the distribution model, is:
Figure BDA0002403194680000135
Figure BDA0002403194680000136
step S405: solving the feature problem to obtain a feature vector ep(k,n);
Here, said epAnd (k, n) is a feature vector corresponding to the p-th microphone.
Wherein, solving the characteristic problem: v2(k,n)ep(k,n)=λp(k,n)V1(k,n)ep(k, n) to obtain,
Figure BDA0002403194680000137
Figure BDA0002403194680000138
Figure BDA0002403194680000139
Figure BDA00024031946800001310
wherein the content of the first and second substances,
Figure BDA00024031946800001311
step S406: obtaining an updated separation matrix W (k) of each frequency point;
based on the characteristic vector of the characteristic problem, the updated separation matrix of the current frame is obtained
Figure BDA0002403194680000141
Step S407: obtaining posterior frequency domain estimation of two sound source signals by using W (k) of a current frame;
separating original noise signals by using W (k) of current frame to obtain posterior frequency domain estimation Y (k, n) ([ Y) Y of two sound source signals1(k,n),Y2(k,n)]T=W(k)X(k,n)。
Step S408: and performing time-frequency conversion according to the posterior frequency domain estimation to obtain a separated time domain signal.
Are respectively paired
Figure BDA0002403194680000142
ISTFT and overlap addition are carried out to obtain a separated time domain sound source signal
Figure BDA0002403194680000143
Namely, it is
Figure BDA0002403194680000144
Where m is 1, …, Nfft. p is 1, 2.
By the method provided by the embodiment of the disclosure, the separation performance can be improved, the voice damage degree after separation is reduced, and the recognition performance is improved. Meanwhile, the equivalent interference suppression performance can be achieved by using fewer microphones, and the cost of an intelligent product is reduced.
Fig. 5 is a block diagram illustrating an apparatus for processing an audio signal according to an exemplary embodiment. Referring to fig. 5, the apparatus 500 includes a first obtaining module 501, a second obtaining module 502, a dividing module 503, a first determining module 504, a second determining module 505, and a third obtaining module 506.
A first obtaining module 501, configured to obtain, by at least two microphones, audio signals emitted by at least two sound sources, respectively, so as to obtain original noisy signals of the at least two microphones, respectively;
a second obtaining module 502, configured to, for each frame in a time domain, obtain frequency domain estimation signals of the at least two sound sources according to the original noisy signals of the at least two microphones, respectively;
a dividing module 503, configured to divide a predetermined frequency band range into a plurality of harmonic subsets, where each harmonic subset includes a plurality of frequency point data;
a first determining module 504, configured to determine, according to the frequency domain estimation signal of each frequency point in each harmonic subset, a weighting coefficient of each frequency point included in the harmonic subset;
a second determining module 505, configured to determine a separation matrix of each frequency point according to the weighting coefficient;
a third obtaining module 506, configured to obtain, based on the separation matrix and the original noisy signal, audio signals sent by at least two sound sources respectively.
In some embodiments, the first determining module comprises:
the first determining submodule is used for determining a distribution function of the frequency domain estimation signals according to the frequency domain estimation signals of all frequency points in all the harmonic subsets;
and the second determining submodule is used for determining the weighting coefficient of each frequency point according to the distribution function.
In some embodiments, the first determining submodule is specifically configured to:
determining the square of the frequency domain estimation signal to standard deviation ratio of each frequency point in the frequency point set of each harmonic subset;
summing the squares of the ratios of the frequency point sets to determine a first sum;
acquiring the square sum of the first sum corresponding to each frequency point set to obtain a second sum;
and determining the distribution function according to the exponential function with the second sum as a variable.
In some embodiments, the first determining submodule is specifically configured to:
determining the square of the ratio between the frequency domain estimation signal and the standard deviation of each frequency point in the frequency point set of each harmonic subset;
summing the squares of the ratios of the frequency point sets to determine a third sum;
determining a fourth sum according to a predetermined power of the third sum corresponding to each frequency point set;
and determining the distribution function according to the exponential function with the fourth sum as a variable.
In some embodiments, the apparatus further comprises:
a third determining module, configured to determine a fundamental frequency point, the first M multiple frequency points, and a frequency point within a preset bandwidth where each multiple frequency point is located of each harmonic subset;
and the fourth determining module is used for determining the frequency point set of each harmonic subset according to a set consisting of the basic frequency point, the first M times of frequency points and the frequency points in the preset bandwidth where each time of frequency points is located.
In some embodiments, the third determining module comprises:
the third determining submodule is used for determining the fundamental frequency point of each harmonic subset and the first M multiple frequency points corresponding to each fundamental frequency point according to the preset frequency band range and the preset number of the divided harmonic subsets;
and the fourth determining submodule is used for determining the frequency points in the preset bandwidth according to the fundamental frequency points and the first M multiple frequency points of each harmonic subset.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
Fig. 6 is a block diagram illustrating a physical structure of an audio signal processing apparatus 600 according to an exemplary embodiment. For example, the apparatus 600 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and so forth.
Referring to fig. 6, apparatus 600 may include one or more of the following components: a processing component 601, a memory 602, a power component 603, a multimedia component 604, an audio component 605, an input/output (I/O) interface 606, a sensor component 607, and a communication component 608.
The processing component 601 generally controls the overall operation of the device 600, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 601 may include one or more processors 610 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 601 may also include one or more modules that facilitate interaction between the processing component 601 and other components. For example, the processing component 601 may include a multimedia module to facilitate interaction between the multimedia component 604 and the processing component 601.
The memory 610 is configured to store various types of data to support operations at the apparatus 600. Examples of such data include instructions for any application or method operating on the apparatus 600, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 602 may be implemented by any type or combination of volatile or non-volatile storage devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
The power supply component 603 provides power to the various components of the device 600. The power supply component 603 may include: a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the device 600.
The multimedia component 604 includes a screen that provides an output interface between the device 600 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 604 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the device 600 is in an operating mode, such as a shooting mode or a video mode. Each front camera and/or rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.
Audio component 605 is configured to output and/or input audio signals. For example, audio component 605 includes a Microphone (MIC) configured to receive external audio signals when apparatus 600 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 610 or transmitted via the communication component 608. In some embodiments, audio component 605 also includes a speaker for outputting audio signals.
The I/O interface 606 provides an interface between the processing component 601 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.
The sensor component 607 includes one or more sensors for providing various aspects of status assessment for the apparatus 600. For example, the sensor component 607 may detect the open/closed state of the apparatus 600, the relative positioning of components, such as a display and keypad of the apparatus 600, the sensor component 607 may also detect a change in the position of the apparatus 600 or a component of the apparatus 600, the presence or absence of user contact with the apparatus 600, orientation or acceleration/deceleration of the apparatus 600, and a change in the temperature of the apparatus 600. The sensor component 607 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact. The sensor component 607 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor component 607 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 608 is configured to facilitate wired or wireless communication between the apparatus 600 and other devices. The apparatus 600 may access a wireless network based on a communication standard, such as WiFi, 2G, or 3G, or a combination thereof. In an exemplary embodiment, the communication component 608 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 608 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, or other technologies.
In an exemplary embodiment, the apparatus 600 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.
In an exemplary embodiment, a non-transitory computer readable storage medium comprising instructions, such as the memory 602 comprising instructions, executable by the processor 610 of the apparatus 600 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
A non-transitory computer readable storage medium having instructions therein, which when executed by a processor of a mobile terminal, enable the mobile terminal to perform any of the methods provided in the above embodiments.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims (14)

1. An audio signal processing method, comprising:
acquiring audio signals emitted by at least two sound sources respectively by at least two microphones to obtain original noisy signals of the at least two microphones respectively;
for each frame in the time domain, acquiring respective frequency domain estimation signals of the at least two sound sources according to the respective original noisy signals of the at least two microphones;
dividing a predetermined frequency band range into a plurality of harmonic subsets, wherein each harmonic subset comprises a plurality of frequency point data;
determining the weighting coefficient of each frequency point contained in each harmonic subset according to the frequency domain estimation signal of each frequency point in each harmonic subset;
determining a separation matrix of each frequency point according to the weighting coefficient;
and obtaining audio signals sent by at least two sound sources respectively based on the separation matrix and the original noisy signals.
2. The method according to claim 1, wherein said determining weighting coefficients for frequency points included in each of the harmonic subsets according to the frequency domain estimation signals for frequency points in each of the harmonic subsets comprises:
determining a distribution function of the frequency domain estimation signals according to the frequency domain estimation signals of the frequency points in the harmonic subsets;
and determining the weighting coefficient of each frequency point according to the distribution function.
3. The method of claim 2, wherein determining a distribution function of the frequency domain estimation signals according to the frequency domain estimation signals of the frequency points in the harmonic subsets comprises:
determining the square of the frequency domain estimation signal to standard deviation ratio of each frequency point in the frequency point set of each harmonic subset;
summing the squares of the ratios of the frequency point sets to determine a first sum;
acquiring the square sum of the first sum corresponding to each frequency point set to obtain a second sum;
and determining the distribution function according to the exponential function with the second sum as a variable.
4. The method of claim 2, wherein determining a distribution function of the frequency domain estimation signals according to the frequency domain estimation signals of the frequency points in the harmonic subsets comprises:
determining the square of the ratio between the frequency domain estimation signal and the standard deviation of each frequency point in the frequency point set of each harmonic subset;
summing the squares of the ratios of the frequency point sets to determine a third sum;
determining a fourth sum according to a predetermined power of the third sum corresponding to each frequency point set;
and determining the distribution function according to the exponential function with the fourth sum as a variable.
5. The method according to claim 3 or 4, characterized in that the method further comprises:
determining a base frequency point, the first M multiple frequency points and a frequency point in a preset bandwidth where each multiple frequency point is located of each harmonic subset;
and determining the frequency point set of each harmonic subset according to a set consisting of the basic frequency point, the first M times of frequency points and the frequency points in the preset bandwidth where each time of frequency points is located.
6. The method of claim 5, wherein the determining the fundamental frequency point, the first M multiple frequency points and the frequency point within the preset bandwidth where each multiple frequency point is located of each harmonic subset comprises:
determining the fundamental frequency point of each harmonic subset and the first M multiple frequency points corresponding to each fundamental frequency point according to the preset frequency band range and the preset number of divided harmonic subsets;
and determining the frequency points in the preset bandwidth according to the fundamental frequency points and the first M multiple frequency points of each harmonic subset.
7. An audio signal processing apparatus, comprising:
the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring audio signals emitted by at least two sound sources by at least two microphones respectively so as to obtain original noisy signals of the at least two microphones respectively;
a second obtaining module, configured to obtain, for each frame in a time domain, frequency domain estimation signals of the at least two sound sources according to the original noisy signals of the at least two microphones, respectively;
the device comprises a dividing module, a frequency-point data generating module and a frequency-point data generating module, wherein the dividing module is used for dividing a preset frequency band range into a plurality of harmonic subsets, and each harmonic subset comprises a plurality of frequency-point data;
a first determining module, configured to determine, according to the frequency domain estimation signal of each frequency point in each harmonic subset, a weighting coefficient of each frequency point included in the harmonic subset;
the second determining module is used for determining the separation matrix of each frequency point according to the weighting coefficient;
and the third acquisition module is used for acquiring audio signals sent by at least two sound sources respectively based on the separation matrix and the original noisy signals.
8. The apparatus of claim 7, wherein the first determining module comprises:
the first determining submodule is used for determining a distribution function of the frequency domain estimation signals according to the frequency domain estimation signals of all frequency points in all the harmonic subsets;
and the second determining submodule is used for determining the weighting coefficient of each frequency point according to the distribution function.
9. The apparatus of claim 8, wherein the first determining submodule is specifically configured to:
determining the square of the frequency domain estimation signal to standard deviation ratio of each frequency point in the frequency point set of each harmonic subset;
summing the squares of the ratios of the frequency point sets to determine a first sum;
acquiring the square sum of the first sum corresponding to each frequency point set to obtain a second sum;
and determining the distribution function according to the exponential function with the second sum as a variable.
10. The apparatus of claim 8, wherein the first determining submodule is specifically configured to:
determining the square of the ratio between the frequency domain estimation signal and the standard deviation of each frequency point in the frequency point set of each harmonic subset;
summing the squares of the ratios of the frequency point sets to determine a third sum;
determining a fourth sum according to a predetermined power of the third sum corresponding to each frequency point set;
and determining the distribution function according to the exponential function with the fourth sum as a variable.
11. The apparatus of claim 9 or 10, further comprising:
a third determining module, configured to determine a fundamental frequency point, the first M multiple frequency points, and a frequency point within a preset bandwidth where each multiple frequency point is located of each harmonic subset;
and the fourth determining module is used for determining the frequency point set of each harmonic subset according to a set consisting of the basic frequency point, the first M times of frequency points and the frequency points in the preset bandwidth where each time of frequency points is located.
12. The apparatus of claim 11, wherein the third determining module comprises:
the third determining submodule is used for determining the fundamental frequency point of each harmonic subset and the first M multiple frequency points corresponding to each fundamental frequency point according to the preset frequency band range and the preset number of the divided harmonic subsets;
and the fourth determining submodule is used for determining the frequency points in the preset bandwidth according to the fundamental frequency points and the first M multiple frequency points of each harmonic subset.
13. Audio signal processing device, characterized in that it comprises at least: a processor and a memory for storing executable instructions operable on the processor, wherein:
the processor is adapted to execute the executable instructions, which when executed perform the steps of the audio signal processing method as provided in any of the preceding claims 1 to 6.
14. A non-transitory computer-readable storage medium having stored therein computer-executable instructions that, when executed by a processor, implement the steps in the audio signal processing method provided in any one of claims 1 to 6.
CN202010153357.9A 2020-03-06 2020-03-06 Audio signal processing method and device and storage medium Active CN111179960B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010153357.9A CN111179960B (en) 2020-03-06 2020-03-06 Audio signal processing method and device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010153357.9A CN111179960B (en) 2020-03-06 2020-03-06 Audio signal processing method and device and storage medium

Publications (2)

Publication Number Publication Date
CN111179960A true CN111179960A (en) 2020-05-19
CN111179960B CN111179960B (en) 2022-10-18

Family

ID=70656922

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010153357.9A Active CN111179960B (en) 2020-03-06 2020-03-06 Audio signal processing method and device and storage medium

Country Status (1)

Country Link
CN (1) CN111179960B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111724801A (en) * 2020-06-22 2020-09-29 北京小米松果电子有限公司 Audio signal processing method and device and storage medium
CN112863537A (en) * 2021-01-04 2021-05-28 北京小米松果电子有限公司 Audio signal processing method and device and storage medium
CN113053406A (en) * 2021-05-08 2021-06-29 北京小米移动软件有限公司 Sound signal identification method and device
CN113345435A (en) * 2020-07-03 2021-09-03 北京声智科技有限公司 Audio noise reduction method, device, equipment and medium
CN113362848A (en) * 2021-06-08 2021-09-07 北京小米移动软件有限公司 Audio signal processing method, device and storage medium
CN112863537B (en) * 2021-01-04 2024-06-04 北京小米松果电子有限公司 Audio signal processing method, device and storage medium

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070025556A1 (en) * 2005-07-26 2007-02-01 Kabushiki Kaisha Kobe Seiko Sho Sound source separation apparatus and sound source separation method
CN102238456A (en) * 2010-03-31 2011-11-09 索尼公司 Signal processing device, signal processing method and program
CN103098132A (en) * 2010-08-25 2013-05-08 旭化成株式会社 Sound source separator device, sound source separator method, and program
WO2014079484A1 (en) * 2012-11-21 2014-05-30 Huawei Technologies Co., Ltd. Method for determining a dictionary of base components from an audio signal
CN108364659A (en) * 2018-02-05 2018-08-03 西安电子科技大学 Frequency domain convolution Blind Signal Separation method based on multiple-objection optimization
CN108447500A (en) * 2018-04-27 2018-08-24 深圳市沃特沃德股份有限公司 The method and apparatus of speech enhan-cement
CN108806712A (en) * 2018-04-27 2018-11-13 深圳市沃特沃德股份有限公司 Reduce the method and apparatus of frequency domain treating capacity
CN109003621A (en) * 2018-09-06 2018-12-14 广州酷狗计算机科技有限公司 A kind of audio-frequency processing method, device and storage medium
WO2019016494A1 (en) * 2017-07-19 2019-01-24 Cedar Audio Ltd Acoustic source separation systems
CN109410978A (en) * 2018-11-06 2019-03-01 北京智能管家科技有限公司 A kind of speech signal separation method, apparatus, electronic equipment and storage medium
CN109994120A (en) * 2017-12-29 2019-07-09 福州瑞芯微电子股份有限公司 Sound enhancement method, system, speaker and storage medium based on diamylose
CN110010148A (en) * 2019-03-19 2019-07-12 中国科学院声学研究所 A kind of blind separation method in frequency domain and system of low complex degree
US20200051580A1 (en) * 2019-07-30 2020-02-13 Lg Electronics Inc. Method and apparatus for sound processing

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070025556A1 (en) * 2005-07-26 2007-02-01 Kabushiki Kaisha Kobe Seiko Sho Sound source separation apparatus and sound source separation method
CN102238456A (en) * 2010-03-31 2011-11-09 索尼公司 Signal processing device, signal processing method and program
CN103098132A (en) * 2010-08-25 2013-05-08 旭化成株式会社 Sound source separator device, sound source separator method, and program
WO2014079484A1 (en) * 2012-11-21 2014-05-30 Huawei Technologies Co., Ltd. Method for determining a dictionary of base components from an audio signal
WO2019016494A1 (en) * 2017-07-19 2019-01-24 Cedar Audio Ltd Acoustic source separation systems
CN109994120A (en) * 2017-12-29 2019-07-09 福州瑞芯微电子股份有限公司 Sound enhancement method, system, speaker and storage medium based on diamylose
CN108364659A (en) * 2018-02-05 2018-08-03 西安电子科技大学 Frequency domain convolution Blind Signal Separation method based on multiple-objection optimization
CN108806712A (en) * 2018-04-27 2018-11-13 深圳市沃特沃德股份有限公司 Reduce the method and apparatus of frequency domain treating capacity
CN108447500A (en) * 2018-04-27 2018-08-24 深圳市沃特沃德股份有限公司 The method and apparatus of speech enhan-cement
CN109003621A (en) * 2018-09-06 2018-12-14 广州酷狗计算机科技有限公司 A kind of audio-frequency processing method, device and storage medium
CN109410978A (en) * 2018-11-06 2019-03-01 北京智能管家科技有限公司 A kind of speech signal separation method, apparatus, electronic equipment and storage medium
CN110010148A (en) * 2019-03-19 2019-07-12 中国科学院声学研究所 A kind of blind separation method in frequency domain and system of low complex degree
US20200051580A1 (en) * 2019-07-30 2020-02-13 Lg Electronics Inc. Method and apparatus for sound processing

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111724801A (en) * 2020-06-22 2020-09-29 北京小米松果电子有限公司 Audio signal processing method and device and storage medium
EP3929920A1 (en) * 2020-06-22 2021-12-29 Beijing Xiaomi Pinecone Electronics Co., Ltd. Method and device for processing audio signal, and storage medium
US11430460B2 (en) 2020-06-22 2022-08-30 Beijing Xiaomi Pinecone Electronics Co., Ltd. Method and device for processing audio signal, and storage medium
CN113345435A (en) * 2020-07-03 2021-09-03 北京声智科技有限公司 Audio noise reduction method, device, equipment and medium
CN112863537A (en) * 2021-01-04 2021-05-28 北京小米松果电子有限公司 Audio signal processing method and device and storage medium
CN112863537B (en) * 2021-01-04 2024-06-04 北京小米松果电子有限公司 Audio signal processing method, device and storage medium
CN113053406A (en) * 2021-05-08 2021-06-29 北京小米移动软件有限公司 Sound signal identification method and device
CN113362848A (en) * 2021-06-08 2021-09-07 北京小米移动软件有限公司 Audio signal processing method, device and storage medium

Also Published As

Publication number Publication date
CN111179960B (en) 2022-10-18

Similar Documents

Publication Publication Date Title
CN111128221B (en) Audio signal processing method and device, terminal and storage medium
CN111429933B (en) Audio signal processing method and device and storage medium
CN111179960B (en) Audio signal processing method and device and storage medium
CN108510987B (en) Voice processing method and device
CN111009256B (en) Audio signal processing method and device, terminal and storage medium
CN111402917B (en) Audio signal processing method and device and storage medium
CN111009257B (en) Audio signal processing method, device, terminal and storage medium
CN110133594B (en) Sound source positioning method and device for sound source positioning
CN111883164A (en) Model training method and device, electronic equipment and storage medium
CN113053406A (en) Sound signal identification method and device
CN112447184A (en) Voice signal processing method and device, electronic equipment and storage medium
CN111724801A (en) Audio signal processing method and device and storage medium
CN113223553B (en) Method, apparatus and medium for separating voice signal
CN111667842B (en) Audio signal processing method and device
CN111429934B (en) Audio signal processing method and device and storage medium
CN113362848B (en) Audio signal processing method, device and storage medium
CN112863537B (en) Audio signal processing method, device and storage medium
CN113362847A (en) Audio signal processing method and device and storage medium
EP4113515A1 (en) Sound processing method, electronic device and storage medium
CN113421579B (en) Sound processing method, device, electronic equipment and storage medium
CN114724578A (en) Audio signal processing method and device and storage medium
CN113345456A (en) Echo separation method, device and storage medium
CN112863537A (en) Audio signal processing method and device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 100085 unit C, building C, lin66, Zhufang Road, Qinghe, Haidian District, Beijing

Applicant after: Beijing Xiaomi pinecone Electronic Co.,Ltd.

Address before: 100085 unit C, building C, lin66, Zhufang Road, Qinghe, Haidian District, Beijing

Applicant before: BEIJING PINECONE ELECTRONICS Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant