CN112242145A - Voice filtering method, device, medium and electronic equipment - Google Patents

Voice filtering method, device, medium and electronic equipment Download PDF

Info

Publication number
CN112242145A
CN112242145A CN201910645775.7A CN201910645775A CN112242145A CN 112242145 A CN112242145 A CN 112242145A CN 201910645775 A CN201910645775 A CN 201910645775A CN 112242145 A CN112242145 A CN 112242145A
Authority
CN
China
Prior art keywords
frequency domain
voice
frame
speech
block
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910645775.7A
Other languages
Chinese (zh)
Inventor
范文之
卢晶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Artificial Intelligence Advanced Research Institute Co ltd
Original Assignee
Nanjing Artificial Intelligence Advanced Research Institute Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Artificial Intelligence Advanced Research Institute Co ltd filed Critical Nanjing Artificial Intelligence Advanced Research Institute Co ltd
Priority to CN201910645775.7A priority Critical patent/CN112242145A/en
Priority to PCT/CN2019/100985 priority patent/WO2021007902A1/en
Publication of CN112242145A publication Critical patent/CN112242145A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain

Abstract

A voice filtering method, apparatus, medium, and electronic device are disclosed. The method comprises the following steps: acquiring frequency domain filtering parameters corresponding to each voice frequency domain block in a previous voice frame adjacent to the current voice frame; determining the frequency domain filter estimation parameters corresponding to each voice frequency domain block in the current voice frame according to the frequency domain filtering parameters corresponding to each voice frequency domain block in the previous voice frame and the two first time domain constraint matrixes; and filtering the current voice frame according to the frequency domain filter estimation parameters corresponding to the voice frequency domain blocks in the current voice frame to obtain a frequency domain error signal of the current voice frame. The technical scheme provided by the disclosure is beneficial to improving the voice filtering performance.

Description

Voice filtering method, device, medium and electronic equipment
Technical Field
The present disclosure relates to voice filtering technologies, and in particular, to a voice filtering method, a voice filtering apparatus, a storage medium, and an electronic device.
Background
Filtering techniques, particularly adaptive filtering techniques, have been widely used in a variety of applications such as echo cancellation, active noise control, and channel equalization.
Frequency-domain Kalman Filter (FKF) technology is an adaptive filtering technique. The frequency domain kalman filtering technique can ensure a faster convergence speed and a lower steady-state error, but the frequency domain kalman filtering technique has a higher computational complexity, thereby causing a longer time delay. The block-based Frequency-domain Kalman Filter (PFKF) technology reduces the calculation complexity and time delay by a block structure while ensuring faster convergence speed and lower steady-state error. However, the blocking frequency domain kalman filtering technique cannot guarantee convergence to an optimal solution under the condition of insufficient filter orders, thereby affecting the filtering performance.
Disclosure of Invention
The present disclosure is proposed to solve the above technical problems. Embodiments of the present disclosure provide a voice filtering method, a voice filtering apparatus, a storage medium, and an electronic device.
According to an aspect of the embodiments of the present disclosure, there is provided a speech filtering method, the method including: acquiring frequency domain filtering parameters corresponding to each voice frequency domain block in a previous voice frame adjacent to the current voice frame; determining the frequency domain filter estimation parameters corresponding to each voice frequency domain block in the current voice frame according to the frequency domain filtering parameters corresponding to each voice frequency domain block in the previous voice frame and the two first time domain constraint matrixes; and filtering the current voice frame according to the frequency domain filter estimation parameters corresponding to the voice frequency domain blocks in the current voice frame to obtain a frequency domain error signal of the current voice frame.
According to another aspect of the embodiments of the present disclosure, there is provided a speech filtering apparatus, the apparatus including: an obtaining module, configured to obtain frequency domain filtering parameters corresponding to each voice frequency domain block in a previous voice frame adjacent to a current voice frame; a parameter determining module, configured to determine, according to the frequency domain filtering parameters and the two first time domain constraint matrices that the obtaining module obtains from the previous speech frame, the frequency domain filter estimation parameters that each speech frequency domain block in the current speech frame corresponds to; and the error determining module is used for carrying out filtering processing on the current voice frame according to the frequency domain filter estimation parameters corresponding to the voice frequency domain blocks in the current voice frame determined by the parameter determining module to obtain a frequency domain error signal of the current voice frame.
According to still another aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium storing a computer program for executing the above-mentioned voice filtering method.
According to still another aspect of an embodiment of the present disclosure, there is provided an electronic apparatus including: a processor; a memory for storing the processor-executable instructions; the processor is used for reading the executable instruction from the memory and executing the instruction to realize the voice filtering method.
Based on the voice filtering method and device provided by the embodiments of the present disclosure, in the process of obtaining the frequency domain filter estimation parameters corresponding to each voice frequency domain block in the current voice frame, two time domain constraint matrices are used, so that the filtering algorithm can converge to an optimal solution by aiming at the phenomenon of insufficient filter order under the condition of basically no influence on the filtering computation complexity. Therefore, the technical scheme provided by the disclosure is beneficial to improving the voice filtering performance.
The technical solution of the present disclosure is further described in detail by the accompanying drawings and examples.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description, serve to explain the principles of the disclosure.
The present disclosure may be more clearly understood from the following detailed description, taken with reference to the accompanying drawings, in which:
FIG. 1 is a schematic view of a scenario in which the present disclosure is applicable;
FIG. 2 is a schematic diagram of another scenario in which the present disclosure is applicable;
FIG. 3 is a flow chart of one embodiment of a speech filtering method of the present disclosure;
FIG. 4 is a schematic diagram illustrating an implementation of an embodiment of a speech filtering method according to the present disclosure;
FIG. 5 is a schematic structural diagram of an embodiment of a speech filtering apparatus according to the present disclosure;
fig. 6 is a block diagram of an electronic device according to an exemplary embodiment of the present disclosure.
Detailed Description
Example embodiments according to the present disclosure will be described in detail below with reference to the accompanying drawings. It is to be understood that the described embodiments are merely a subset of the embodiments of the present disclosure and not all embodiments of the present disclosure, with the understanding that the present disclosure is not limited to the example embodiments described herein.
It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.
It will be understood by those of skill in the art that the terms "first," "second," and the like in the embodiments of the present disclosure are used merely to distinguish one element from another, and are not intended to imply any particular technical meaning, nor is the necessary logical order between them.
It is also understood that in embodiments of the present disclosure, "a plurality" may refer to two or more than two and "at least one" may refer to one, two or more than two.
It is also to be understood that any reference to any component, data, or structure in the embodiments of the disclosure, may be generally understood as one or more, unless explicitly defined otherwise or stated otherwise.
In addition, the term "and/or" in the present disclosure is only one kind of association relationship describing the associated object, and means that there may be three kinds of relationships, such as a and/or B, and may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" in the present disclosure generally indicates that the former and latter associated objects are in an "or" relationship.
It should also be understood that the description of the various embodiments of the present disclosure emphasizes the differences between the various embodiments, and the same or similar parts may be referred to each other, so that the descriptions thereof are omitted for brevity.
Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.
The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.
Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.
Embodiments of the present disclosure may be implemented in electronic devices such as terminal devices, computer systems, servers, etc., which are operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known terminal devices, computing systems, environments, and/or configurations that may be suitable for use with an electronic device, such as a terminal device, computer system, or server, include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set top boxes, programmable consumer electronics, network pcs, minicomputer systems, mainframe computer systems, distributed cloud computing environments that include any of the above, and the like.
Electronic devices such as terminal devices, computer systems, servers, etc. may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types. The computer system/server may be implemented in a distributed cloud computing environment. In a distributed cloud computing environment, tasks may be performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.
Summary of the disclosure
In the course of implementing the present disclosure, the inventor finds that, in the existing block frequency domain kalman filtering (FKF) technology, the frequency domain filter corresponding to the b-th speech frequency domain block in the k + 1-th speech frame estimates the parameter
Figure BDA0002133537560000041
Can be expressed in the form of the following formula (1):
Figure BDA0002133537560000042
in the formula (1), a represents the uncertainty of the acoustic echo path, and the value of a is a constant, for example, the value of a may be 1;
Figure BDA0002133537560000043
representing the estimation parameters of the frequency domain filter corresponding to the b-th speech frequency domain block in the k-th speech frame (here, capital W, corresponding to the frequency domain); gL,0Representing a first time domain constraint matrix; mu.sb(k) Representing a frequency domain filter step matrix corresponding to a b-th voice frequency domain block in a k-th voice frame; xb(k) A frequency domain reference signal (here, capital X, corresponding to the frequency domain) representing the b-th speech frequency domain block in the k-th speech frame; e (k) represents the frequency domain error signal for the k speech frame (capital E corresponds to the frequency domain).
In order to analyze the steady state solution characteristic of the block frequency domain kalman filtering technique, the inventor assumes that a is 1, and connects two sides of the equal sign of the above formula (1) with the inverse matrix F of discrete fourier transform whose size is the frame length multiplied by the frame length-1Multiplying and finishing to obtain the following formula (2):
Figure BDA0002133537560000044
in the above formula (2), Mb(k)=Fμb(k)F-1;Xc,b(k)=FXb(k)F-1(ii) a L represents the length of each speech frequency domain block in the speech frame, i.e. the block length; i isLRepresenting an identity matrix of size lxl; e (k) a time domain error signal representing a k-th speech frame; c denotes a circulant matrix, i.e. XC,b(k) Is a circulant matrix.
M in the above formula (2)bAnd XC,b(k) Are all circulant matrices, and MbAnd XC,b(k) Can be expressed in the form of the following equation (3), respectively:
Figure BDA0002133537560000051
Figure BDA0002133537560000052
e (k) in the above formula (2) can be expressed in the form of the following formula (4)
Figure BDA0002133537560000053
In the above formula (4), e (k) represents the time domain error signal of the k-th speech frame (lower case e corresponds to the time domain); y (k) represents the time domain desired signal for the k speech frame (lower case y corresponds to the time domain);
Figure BDA0002133537560000054
representing the time domain filter estimation parameter (here, lower case w, corresponding to the time domain) corresponding to the b-th speech frequency domain block in the k-th speech frame, which can be regarded as an equivalent time domain form of the frequency domain filter estimation parameter; b denotes the total number of speech frequency domain blocks contained in a speech frame.
By mathematically expecting both sides of the equal sign of the above formula (2), the following formula (5) can be obtained:
Figure BDA0002133537560000055
in the above formula (5), E { } represents a mathematical expectation; b represents the b-th speech frequency domain block in the speech frame;
Figure BDA0002133537560000056
representing the estimation parameter of a time domain filter corresponding to the b-th voice frequency domain block in the k + 1-th voice frame;
Figure BDA0002133537560000061
representing the time domain filter estimation parameter (here, lower case w, corresponding to the time domain) corresponding to the b-th speech frequency domain block in the k-th speech frame; lambdab,1(k)、Λb,2(k)、Rb,m
Figure BDA0002133537560000062
And rbCan be expressed in the form of the following formula (6); b denotes the number of speech frequency domain blocks contained in one speech frame.
Figure BDA0002133537560000063
Figure BDA0002133537560000064
rb=E{XC,b,2(k)y(k)}
Figure BDA0002133537560000065
Figure BDA0002133537560000066
In the above formula (6), E { } represents a mathematical expectation; c represents a circulant matrix; xC,b,2(k) And XC,b,1(k) Are all XC,b(k) In the above-mentioned order of the elements,XC,b(k) as shown in the above equation (3);
Figure BDA0002133537560000067
represents XC,m,2(k) The transposed matrix of (2); y (k) represents a time domain desired signal for the k speech frame; mb,1(k) And Mb,2(k) Are all Mb(k) An element of (1), Mb(k) Expressed as shown in the above formula (3); lambdab,1(k) Represents taking Mb,1(k) The mathematically expected matrix of (a); lambdab,2(k) Represents taking Mb,2(k) The mathematically expected matrix of (a).
As can be seen from the above equations (5) and (6), the steady state solution of the frequency domain filter coefficients in the conventional block frequency domain kalman filtering technique can be expressed in the form of the following equation (7):
Figure BDA0002133537560000068
in the above formula (7), b represents the b-th speech frequency domain block in the speech frame;
Figure BDA0002133537560000071
representing the time domain filter estimation parameter (here, lower case w, corresponding to the time domain) corresponding to the b-th speech frequency domain block in the speech frame;
Figure BDA0002133537560000072
representing a steady state solution of a time domain filter estimation parameter corresponding to a b-th voice frequency domain block in a voice frame; lambdab,1(∞)、Λb,2(∞)、Rb,m
Figure BDA0002133537560000073
Rb,b
Figure BDA0002133537560000074
And rbCan be expressed in the form of the following formula (2); b denotes the number of speech frequency domain blocks contained in one speech frame.
Assuming that the order of the actually used adaptive filter is N', and the order of the unknown target system filter to be matched is N; if N' < N, the signal y (k) output by the unknown target system filter to be matched can be expressed in the form of the following equation (8):
Figure BDA0002133537560000075
in the above formula (8), s (k) represents the background noise of the k-th speech frame; b represents the number of voice frequency domain blocks contained in one voice frame;
Figure BDA0002133537560000076
represents XC,b,1(k) Transposed matrix of (2), XC,b,1(k) Is XC,b(k) One element as shown in the above formula (3); w is ao,mRepresenting the mth block (here lower case w, corresponding to the time domain) in the optimal solution of the time domain filter coefficients.
Substituting the above formula (8) into the above formula (7) and arranging, the following formula (9) can be obtained:
Figure BDA0002133537560000077
as can be seen from the above equation (9), under the condition that the value of the acoustic echo path uncertainty a is 1 and the order of the adaptive filter is sufficient, the equation (9) can prove that the steady-state solution of the coefficients of the block frequency domain kalman filter is equal to the optimal solution; however, when the adaptive filter order is insufficient, the above equation (8) cannot be satisfied, and the above equation (9) cannot be obtained, and the above equation (7) cannot converge the frequency domain kalman filter coefficients to the optimal solution, that is, the optimal solution of the frequency domain kalman filter coefficients does not coincide with the steady-state solution.
Brief description of the drawings
In the applications of field conferences, teleconferencing, voice interaction and the like, the voice filtering technology disclosed by the invention can be used for realizing echo cancellation, active noise control, channel equalization and the like.
One example is shown in figure 1. The microphone 101 is provided on the platform 100. The microphone 101 is connected to the data processing device 102, and the data processing device 102 may be connected to at least one loudspeaker. The loudspeaker 103 is only schematically shown in fig. 1, and the connection between the data processing device 102 and the loudspeaker 103 may be a wireless connection.
It is assumed that the main speaker is positioned in front of the microphone 101 to speak. The voice of the main speaker is mixed with the echo and picked up by the microphone 101, and an audio signal in the time domain is obtained. The data processing device 102 converts the collected audio signal in the time domain into speech frames in the frequency domain and processes each speech frame separately (which may be referred to as a dereverberation process) using the speech filtering techniques provided by the present disclosure to remove echoes in each speech frame. Thereafter, the data processing device 102 may perform voice separation processing or the like on the processed voice frame, thereby obtaining a sound source signal from which the speaker is separated, and the data processing device 102 forms an output signal from the sound source signal of the speaker and plays it through the speaker 103. In the above application scenario, the data processing device 102 may effectively avoid the phenomenon that the collected echoes are simultaneously played, thereby being beneficial to improving the sound definition of the main speaker, so that the participants participating in the on-site conference can hear the clear speech of the main speaker.
In addition, the data processing device 102 may also transmit the obtained sound source signal of the speaker to a device in a different place meeting place (e.g., a data processing device in the different place meeting place) in real time through a network, and the device in the different place meeting place performs playing processing according to the received sound source signal, thereby implementing a teleconference.
Another example is shown in fig. 2. A microphone may be provided in a portable translation device 200 (e.g., a smart mobile phone, etc.). The translation apparatus 200 is used to implement bilingual translation.
During the dialog between the user 201 and the user 202, the user 201 puts his translation device 200 in a working state for bi-directional translation between a first language (e.g., chinese, etc.) and a second language (e.g., english, etc.).
The translation apparatus 200 converts an audio signal in a time domain, which is collected in real time by a microphone provided therein, into a speech frame in a frequency domain. The translation device 200 may utilize the speech filtering technique provided in the present disclosure to process each speech frame (which may be referred to as noise suppression processing) respectively to remove background noise in the speech frame, thereby being beneficial to avoiding the influence of the background noise on subsequent speech recognition processing and improving the sound definition of the current speaker. After that, the translation device 200 may perform speech separation processing, speech recognition processing, and the like on the processed speech frames, and the translation device 200 may determine the language used by the current speaker and the content of the current speaker according to the result obtained by the speech recognition processing operation. Finally, the translator device 200 may convert the content of the current speaker speaking into another language and output the converted another language, for example, the translator device 200 displays the converted another language on its display screen, and for example, the translator device 200 plays the converted another language through its speaker.
The operations of collecting the audio signal, background noise removal processing, speech separation processing, speech recognition processing, and language conversion processing are repeated, thereby helping the user 201 and the user 202 to have a continuous conversation.
Exemplary method
Fig. 3 is a schematic structural diagram of an embodiment of a speech filtering method according to the present disclosure. As shown in fig. 3, the method of this embodiment includes steps S300, S301, and S302. The following describes each step.
S300, acquiring frequency domain filtering parameters corresponding to each voice frequency domain block in the previous voice frame adjacent to the current voice frame.
The current speech frame as well as the last speech frame in this disclosure are both speech frames in the frequency domain. The current speech frame and the previous speech frame in the present disclosure may be speech frames obtained by performing transform processing (such as fourier transform) on the time domain signal acquired by the audio acquisition apparatus. The last speech frame in this disclosure refers to a speech frame that is chronologically adjacent to and chronologically before the current speech frame. The current speech frame and the previous speech frame in the present disclosure respectively include a plurality of speech frequency domain blocks, and the number of the speech frequency domain blocks included in the current speech frame is the same as the number of the speech frequency domain blocks included in the previous speech frame. The number of speech frequency domain blocks contained by the current speech frame and the last speech frame is typically related to the filter block length. The speech frequency domain block in this disclosure may refer to a block in a partitioned frequency domain kalman filtering (PFKF) algorithm.
The frequency domain filtering parameters in this disclosure may refer to parameters used by the filter. The frequency domain filtering parameters may be referred to as frequency domain filtering algorithm parameters. A speech frequency domain block typically corresponds to a set of frequency domain filter parameters. Different speech frequency domain blocks typically correspond to different sets of frequency domain filtering parameters. A set of frequency domain filtering parameters typically includes a plurality of parameters.
S301, determining the frequency domain filter estimation parameters corresponding to each voice frequency domain block in the current voice frame according to the frequency domain filter parameters corresponding to each voice frequency domain block in the previous voice frame and the two first time domain constraint matrixes.
The first time domain constraint matrix in the present disclosure may refer to a matrix for converting frequency domain filter parameters to a time domain, constraining the frequency domain filter parameters in the time domain, and then converting the constrained filter parameters in the time domain to the frequency domain again. The constraining of the filtering parameters in the time domain may refer to: and limiting values of some elements in the filtering parameters in the time domain, for example, zeroing values of some elements exceeding a preset filter length in the filtering parameters in the time-frequency domain. The frequency domain filter estimation parameter corresponding to each speech frequency domain block in the current speech frame in the present disclosure may refer to a frequency domain filter coefficient corresponding to each speech frequency domain block in the current speech frame. According to the method, corresponding frequency domain filtering algorithms (such as frequency domain filter parameter iterative algorithm based on block frequency domain Kalman filtering) are adopted for calculation according to the frequency domain filtering parameters corresponding to the voice frequency domain blocks in the previous voice frame and the two first time domain constraint matrixes, so that the frequency domain filter estimation parameters corresponding to the voice frequency domain blocks in the current voice frame are obtained. And under the condition that the current voice frame becomes the previous voice frame, the frequency domain filter estimation parameters corresponding to each voice frequency domain block in the current voice frame become a part of the frequency domain filter parameters corresponding to each voice frequency domain block in the previous voice frame.
S302, filtering the current voice frame according to the frequency domain filter estimation parameters corresponding to the voice frequency domain blocks in the current voice frame to obtain a frequency domain error signal of the current voice frame.
The filtering process in the present disclosure may include: a process of obtaining a difference between an actual desired signal in the frequency domain and an estimated desired signal in the frequency domain. And when the current voice frame becomes the previous voice frame, the obtained frequency domain error signal in the current voice frame becomes a part of the frequency domain filtering parameters corresponding to the voice frequency domain blocks in the previous voice frame. The actual desired signal may be obtained from external input information.
According to the method and the device, two time domain constraint matrixes are used in the process of obtaining the frequency domain filter estimation parameters corresponding to each voice frequency domain block in the current voice frame, so that the filtering algorithm can be converged to the optimal solution by aiming at the phenomenon of insufficient filter order under the condition of basically having no influence on the filtering calculation complexity. Therefore, the technical scheme provided by the disclosure is beneficial to improving the voice filtering performance.
In an alternative example, the frequency domain filtering parameters corresponding to each speech frequency domain block in the previous speech frame adjacent to the current speech frame in the present disclosure may include: the method comprises the steps of obtaining a frequency domain filter step size matrix corresponding to each voice frequency domain block in a previous voice frame adjacent to a current voice frame, a frequency domain filter estimation parameter corresponding to each voice frequency domain block in the previous voice frame, a frequency domain reference signal of each voice frequency domain block in the previous voice frame and a frequency domain error signal of the previous voice frame. The frequency domain reference signal in the present disclosure may be obtained from externally input information. The frequency domain filter step size matrix may also be referred to as an equivalent step size of the frequency domain filter. In this disclosure, the frequency domain filter step size matrix, the frequency domain filter estimation parameter, and the frequency domain error signal of the previous speech frame, which correspond to each speech frequency domain block in the previous speech frame, can all be obtained in an iterative computation manner, that is, in the process of the next iterative computation, the current speech frame will become the previous speech frame, and the frequency domain filter step size matrix, the frequency domain filter estimation parameter, and the frequency domain error signal of the current speech frame, which correspond to each speech frequency domain block in the current speech frame, obtained in the current iterative computation, will become the frequency domain filter step size matrix, the frequency domain filter estimation parameter, and the frequency domain error signal of the previous speech frame, which correspond to each speech frequency domain block in the previous speech frame. For the technical solution of the present disclosure, the frequency domain reference signal of each speech frequency domain block in the previous speech frame may be regarded as a known signal. For example, the present disclosure may obtain the frequency domain reference signal of each speech frequency domain block in the last speech frame from the external input information.
According to the method and the device, the frequency domain filter estimation parameters corresponding to the voice frequency domain blocks in the current voice frame can be conveniently obtained by utilizing the parameters, so that the filtering algorithm can be converged to an optimal solution.
In an optional example, an example of a manner of the present disclosure to obtain a frequency domain filter step size matrix corresponding to each speech frequency domain block in a previous speech frame of a current speech frame may be: and for the ith voice frequency domain block in the previous voice frame, determining a frequency domain filter step size matrix corresponding to the ith voice frequency domain block in the previous voice frame according to the covariance matrix of the frequency domain error corresponding to the ith voice frequency domain block in the previous voice frame, the frequency domain reference signal of the ith voice frequency domain block in the previous voice frame, the autocorrelation matrix of the background noise of the previous voice frame, the frame length and the block length. For example, the present disclosure may determine the frequency domain filter step size matrix corresponding to each speech frequency domain block in the previous speech frame by using the following formula (10):
Figure BDA0002133537560000111
in the above equation (10), μb(k +1) represents the frequency domain filter corresponding to the b-th speech frequency domain block in the k + 1-th speech frameA step size matrix; l represents the length of the speech frequency domain block; m represents the length of the speech frame; b represents the total number of voice frequency domain blocks contained in a voice frame; pb(k) A covariance matrix representing a frequency domain error corresponding to a b-th speech frequency domain block in a k-th speech frame; xb(k) A frequency domain reference signal representing a b-th speech frequency domain block in a k-th speech frame; ΨSS(k) An autocorrelation matrix representing the background noise of the k-th speech frame.
It should be particularly noted that the disclosure may use μ in the above formula (10)b(k +1) is used as the frequency domain filter step size matrix corresponding to the b-th speech frequency domain block in the previous speech frame, that is, the k + 1-th speech frame represents the previous speech frame of the current frame, and the k-th speech frame represents the previous speech frame of the previous speech frame.
According to the method and the device, the frequency domain filter step length matrix corresponding to each voice frequency domain block in the previous voice frame of the current voice frame can be conveniently obtained by utilizing the covariance matrix of the frequency domain error corresponding to the ith voice frequency domain block in the previous voice frame, the frequency domain reference signal of the ith voice frequency domain block in the previous voice frame, the autocorrelation matrix of the background noise of the previous voice frame, the frame length and the block length, so that the filtering algorithm can be converged to the optimal solution.
In an alternative example, one example of a way for the present disclosure to determine the covariance matrix of the frequency domain error corresponding to the ith speech frequency domain block in the previous speech frame may be: for the ith voice frequency domain block in the last voice frame, determining the covariance matrix of the frequency domain error corresponding to the ith voice frequency domain block in the last voice frame according to the covariance matrix of the frequency domain error corresponding to the ith voice frequency domain block in the last voice frame of the last voice frame, the frequency domain reference signal of the ith voice frequency domain block in the last voice frame of the last voice frame, the process noise of the ith voice frequency domain block of the last voice frame, the frequency domain reference signal of the ith voice frequency domain block in the last voice frame of the last voice frame, the frequency domain filter step size matrix corresponding to the ith voice frequency domain block in the last voice frame of the last voice frame, the unit matrix with the size of the frame length multiplied by the frame length, the frame length and the block length. Optionally, the present disclosure may obtain a covariance matrix of a frequency domain error corresponding to an ith speech frequency domain block in a previous speech frame by using the following formula (11):
Figure BDA0002133537560000121
in the above formula (11), Pb(k +1) a covariance matrix of a frequency domain error corresponding to the b-th speech frequency domain block in the k + 1-th speech frame; a represents acoustic echo path uncertainty, e.g., a may take on a value of 1; i isMRepresenting an identity matrix of size M × M; l represents the length of the speech frequency domain block, i.e. the block length; m represents the length of the voice frame, namely the frame length; mu.sb(k) Representing a frequency domain filter step matrix corresponding to a b-th voice frequency domain block in a k-th voice frame; xb(k) A frequency domain reference signal representing a b-th speech frequency domain block in a k-th speech frame;
Figure BDA0002133537560000122
represents Xb(k) The conjugate transpose matrix of (a); pb(k) A covariance matrix representing a frequency domain error corresponding to a b-th speech frequency domain block in a k-th speech frame; Ψb,Δ(k) Representing the process noise of the b-th speech frequency domain block in the k-th speech frame.
It should be noted that P in the above formula (11) can be specifically mentioned in the present disclosureb(k +1) is used as the covariance matrix of the frequency domain error corresponding to the b-th speech frequency domain block in the previous speech frame, that is, the k + 1-th speech frame represents the previous speech frame of the current frame, and the k-th speech frame represents the previous speech frame of the previous speech frame.
The method and the device can conveniently obtain the covariance matrix of the frequency domain errors corresponding to the voice frequency domain blocks in the previous voice frame by utilizing the covariance matrix of the frequency domain errors corresponding to the voice frequency domain block in the previous voice frame of the previous voice frame, the frequency domain reference signal of the voice frequency domain block in the previous voice frame of the previous voice frame, the process noise of the voice frequency domain block of the previous voice frame, the frequency domain reference signal of the voice frequency domain block in the previous voice frame of the previous voice frame, the frequency domain filter step size matrix corresponding to the voice frequency domain block in the previous voice frame of the previous voice frame, the unit matrix with the size of the frame length multiplied by the frame length, the frame length and the block length, thereby being beneficial to converging the filtering algorithm to the optimal solution.
In an alternative example, the process of determining the frequency domain filter estimation parameter corresponding to each speech frequency domain block in the current speech frame according to the present disclosure may include: firstly, for the ith voice frequency domain block of the current voice frame, performing matrix multiplication on a frequency domain filter step size matrix corresponding to the ith voice frequency domain block of the previous voice frame, a frequency domain reference signal of the previous voice frame, a frequency domain error signal of the previous voice frame and two first time domain constraint matrixes. Secondly, the result obtained by the matrix multiplication is added with the frequency domain filter estimation parameter corresponding to the ith voice frequency domain block of the previous voice frame. And finally, determining a frequency domain filter estimation parameter corresponding to the ith voice frequency domain block in the current voice frame according to the addition result and the preset echo path uncertainty. The process of determining the frequency domain filter estimation parameter corresponding to each speech frequency domain block in the current speech frame according to the present disclosure can be represented in the form of the following formula (12):
Figure BDA0002133537560000131
in the above-mentioned formula (12),
Figure BDA0002133537560000132
representing the estimation parameters of the frequency domain filter corresponding to the b-th speech frequency domain block in the k + 1-th speech frame (here, capital W, corresponding to the frequency domain); a represents acoustic echo path uncertainty, e.g., a may take on a value of 1;
Figure BDA0002133537560000133
representing the estimation parameter of a frequency domain filter corresponding to a b-th voice frequency domain block in a k-th voice frame; gL,0Representing a first time domain constraint matrix; mu.sb(k) Representing a frequency domain filter step matrix corresponding to a b-th voice frequency domain block in a k-th voice frame; xb(k) Representing the k-th speechFrequency domain reference signals of a (b) th speech frequency domain block in the frame (X capitalize corresponds to a frequency domain); e (k) represents the frequency domain error signal for the k speech frame.
In the case where a in the above equation (12) takes a value of 1, the above equation (12) can be simplified to the form of the following equation (13):
Figure BDA0002133537560000141
it should be particularly noted that the present disclosure may adopt the above formulas (12) and (13)
Figure BDA0002133537560000142
The covariance matrix is used as the covariance matrix of the frequency domain error corresponding to the b-th speech frequency domain block in the current speech frame, that is, the k + 1-th speech frame represents the current speech frame, and the k-th speech frame represents the last speech frame of the current speech frame.
According to the method, the frequency domain filter estimation parameters corresponding to the voice frequency domain blocks in the current voice frame can be conveniently obtained by using the formula (12) or the formula (13), so that the filtering algorithm can be converged to an optimal solution.
In an optional example, the filtering processing of the current speech frame according to the frequency domain filter estimation parameter corresponding to each speech frequency domain block in the current speech frame, so as to obtain the frequency domain error signal of the current speech frame, according to the present disclosure, may be as follows: firstly, multiplying the frequency domain reference signal of each voice frequency domain block in the current voice frame with the frequency domain filter estimation parameter corresponding to each voice frequency domain block in the current voice frame respectively to obtain the multiplication result corresponding to each voice frequency domain block in the current voice frame; secondly, accumulating the respective multiplication results corresponding to the voice frequency domain blocks, and multiplying the accumulated results by a second time domain constraint matrix to obtain the multiplication result corresponding to the current voice frame; and finally, determining the frequency domain error signal of the current voice frame according to the difference of the multiplication result corresponding to the frequency domain expected signal (namely the actual frequency domain expected signal) of the current voice frame and the current voice frame. The above-described processing procedure can be expressed in the form of the following equation (14) and equation (15):
Figure BDA0002133537560000143
in the above equation (14), e (k) represents a frequency domain error signal of the k-th speech frame; y (k) represents the frequency domain desired signal for the k speech frame,
Figure BDA0002133537560000144
represents the estimated frequency domain desired signal, which may also be referred to as an estimate of the frequency domain desired signal, and
Figure BDA0002133537560000145
can be expressed in the form of the following formula (15).
Figure BDA0002133537560000146
In the above formula (15), G0,LRepresenting a time domain constraint matrix, i.e. a second time domain constraint matrix; b represents the total number of voice frequency domain blocks contained in a voice frame; xb(k) A frequency domain reference signal representing a b-th speech frequency domain block in a k-th speech frame;
Figure BDA0002133537560000151
and representing the frequency domain filter estimation parameters corresponding to the b-th voice frequency domain block in the k-th voice frame.
According to the method, the frequency domain error signal of the current voice frame can be conveniently obtained by using the formula (14) and the formula (15), so that the filtering algorithm can be converged to an optimal solution.
In an alternative example, the first time domain constraint matrix of the present disclosure may be a matrix determined according to a discrete fourier transform matrix with a size of a frame length multiplied by a frame length, a four-element matrix, and an inverse discrete fourier transform matrix with a size of a frame length multiplied by a frame length, for example, a discrete fourier transform matrix with a size of a frame length multiplied by a frame length, a four-element matrix, and an inverse discrete fourier transform matrix with a size of a frame length multiplied by a frame length are multiplied, and the first time domain constraint matrix is determined by a matrix obtained by the multiplication. The upper right corner element, the lower right corner element and the lower left corner element of the four-element matrix are all-zero matrixes with the size of block length multiplied by the block length, and the upper left corner element is an identity matrix with the size of frame length multiplied by the frame length. The first time domain constraint matrix may be represented in the form of equation (16) below:
Figure BDA0002133537560000152
in the above formula (16), ILRepresenting an identity matrix of size lxl; l represents the length of the speech frequency domain block, i.e. the block length; f represents a discrete Fourier transform matrix with the size of the frame length multiplied by the frame length; f-1The inverse matrix of the discrete fourier transform is represented with the size of the frame length multiplied by the frame length.
According to the method, the first time domain constraint matrix shown in the formula (16) is utilized, so that the frequency domain error signal of the current voice frame can be conveniently obtained under the condition that the filtering calculation complexity is basically not influenced, and the filtering algorithm can be favorably converged to the optimal solution.
The following is an analysis description of an optimal solution that the present disclosure can obtain coefficients of a block frequency domain kalman filter (frequency domain filter estimation parameters).
The present disclosure may multiply F on both sides of the equation of the above equation (16) separately-1Thereafter, the arrangement is performed, and if desired, the following formula (17) can be obtained:
Figure BDA0002133537560000153
the steady state solution of equation (17) above may be expressed in the form of equation (18) below:
Figure BDA0002133537560000161
an optimal solution for a filter order of N should generally satisfy the Wiener-hough equation as shown in equation (19) below:
Figure BDA0002133537560000162
in the above formula (19), RxRepresenting an nxn reference signal autocorrelation matrix;
Figure BDA0002133537560000163
represents the optimal solution of the time domain filter coefficients (here lower case w, corresponding to the time domain); p represents the cross-correlation vector between the nx 1 time-domain reference signal and the unknown target system filter output signal to be matched.
Splitting the left and right equal sign sides of the above formula (19) into B vectors with length of L, the following formula (20) can be obtained:
Figure BDA0002133537560000164
in the above formula (20), B represents the number of speech frequency domain blocks included in one speech frame; the value range of B is [0, B-1 ]];
Figure BDA0002133537560000165
Represents Rx,b,bInverse matrix of Rx,b,bAnd Rx,b,mCan be expressed by the following formula (21); p is a radical ofbThe b-th block in the cross-correlation vector p, and pb=[piL,......,piL+L-1];wo,mRepresents an optimal solution for the mth block in the time domain filter coefficients, and
Figure BDA0002133537560000166
l represents a block length.
Figure BDA0002133537560000167
In the above formula (21), Rx(i)=E{x(n)x(n-i)},Rx(i) To representAn autocorrelation function of the reference signal.
The present disclosure may transform the above equation (6) into the form of the following equation (22):
Figure BDA0002133537560000171
rb=E{XC,0,2(k-b)y(k)}=Lpbformula (22)
The present disclosure can obtain the following formula (23) by substituting the above formula (22) into the above formula (18):
Figure BDA0002133537560000172
as can be seen from comparing equation (20) and equation (23), the steady-state solution and the optimal solution of the filtering algorithm in the present disclosure are consistent, that is, the speech filtering technique provided in the present disclosure can make the filtering algorithm converge to the optimal solution in the case of insufficient filter orders.
An example implementation flow of the speech filtering method of the present disclosure is shown in fig. 4.
Assume that the order N of the adaptive filter is 10. Assuming that the time domain reference signal is white noise with mean zero, the white noise is filtered through a time domain filter with coefficients [ 0.10.2-0.40.7 ]]The time domain signal obtained by the filter of Finite Impulse Response (FIR) of (1) can obtain a frequency domain reference signal after performing discrete fourier transform on the time domain signal. Assuming that the time domain desired signal (i.e. the time domain actual desired signal) is the time domain reference signal, the filter coefficient is [ 0.010.02-0.04-0.080.15-0.30.450.60.60.45-0.30.15-0.08-0.040.020.01 ]]The time domain signal obtained after the filtering by the 16 th order FIR filter can obtain the frequency domain reference signal after performing discrete fourier transform on the time domain signal. The present disclosure may assume that the background noise s (k) in the frequency domain desired signal is uncorrelated white noise with a mean of 0 and a variance of 10-3
First, an initialization step is performed.
For example, the covariance matrix P of the frequency domain error corresponding to the b-th speech frequency domain block in the 0 th speech frameb(0) Is set as epsilon I, wherein epsilon can take a value of 10-1
For another example, the parameters of the frequency domain filter corresponding to the b-th speech frequency domain block in the 0 th speech frame are estimated
Figure BDA0002133537560000173
Set to 0 (here, capital W, corresponding to the frequency domain);
as another example, the value of the acoustic echo path uncertainty a is set to 1, and the number of blocks is set to 2 (i.e., the total number of speech frequency domain blocks contained in a speech frame is 2).
Secondly, aiming at each voice frame in the iterative process, calculating according to the following steps:
step 1, performing serial-to-parallel conversion on the received time domain reference signal x (n) of the nth sampling point, and caching the serial-to-parallel converted time domain reference signal.
And 2, performing discrete Fourier transform (such as fast discrete Fourier transform) on the time domain reference signal of the cached nth sampling point, and transforming the time domain reference signal to a frequency domain, thereby obtaining the frequency domain reference signal of each voice frequency domain block in the latest voice frame. When the total number of the speech frequency domain blocks included in a speech frame is 2, the obtained frequency domain reference signal may be X0(k) And X1(k)。
And step 3, calculating and obtaining a frequency domain expected signal (namely an estimated expected signal in a frequency domain) Y (k) of the latest speech frame by using the formula (15).
Specifically, multiple delay processing, X in FIG. 40(k) And
Figure BDA0002133537560000181
dot product of (1), X1(k) And
Figure BDA0002133537560000182
dot product of (8), … …, XB-1(k) And
Figure BDA0002133537560000183
the dot multiplication of (a) and the addition of a plurality of results of the dot multiplication can realize the above equation (15). In addition, when there is signal overlap between two adjacent speech frames, there may be X shown in fig. 41(k)=X0(k-1), … … and XB-1(k)=X0(k-B + 1).
And 4, performing inverse discrete Fourier transform on the frequency domain expected signal Y (k) of the latest speech frame to obtain a time domain expected signal (i.e. an estimated expected signal y (k) in the time domain) of the latest speech frame, wherein the time domain expected signal forms a serial time domain expected signal after parallel-to-serial conversion.
And 5, subtracting the actual expected signal in the time domain from the estimated expected signal in the time domain to obtain a time domain error signal e (k) of the latest voice frame, carrying out zero filling processing on the time domain error signal e (k), and carrying out discrete Fourier transform on the time domain error signal after the zero filling processing to obtain a frequency domain error signal E (k) of the latest voice frame.
Step 5 above may be equivalent to: performing serial-to-parallel conversion on the time domain expected signal y (n) of the latest speech frame, performing discrete fourier transform (such as fast discrete fourier transform) on the time domain expected signal of the latest buffered speech frame (i.e. the current speech frame, which will be described below by taking the k-th speech frame as an example), transforming the time domain expected signal into the frequency domain to obtain a frequency domain expected signal (i.e. an estimated expected signal in the frequency domain), and subtracting the actual expected signal in the frequency domain from the estimated expected signal in the frequency domain to obtain a frequency domain error signal e (k) of the latest speech frame.
Step 6, updating the filter coefficient by using the formula (12) or (13), and updating the obtained filter coefficient
Figure BDA0002133537560000191
Can be used as the frequency domain filter estimation parameter corresponding to each voice frequency domain block in the latest voice frame in the next iteration process
Figure BDA0002133537560000192
And 7, updating the frequency domain filter step size matrix mu by using the formula (10)b(k +1), obtaining a frequency domain filter step size matrix corresponding to each voice frequency domain block in the latest voice frame in the next iteration process; the covariance matrix P of the frequency domain error is updated using equation (11) aboveb(k +1), obtaining the covariance matrix of the frequency domain error corresponding to each voice frequency domain block in the latest voice frame in the next iteration process. Additionally, the present disclosure may also update the kalman gain using equation (24) below:
Figure BDA0002133537560000193
in the above formula (24), Kb(k +1) represents the Kalman gain corresponding to the b-th voice frequency domain block in the k + 1-th voice frame; mu.sb(k) Representing a frequency domain filter step matrix corresponding to a b-th voice frequency domain block in a k-th voice frame;
Figure BDA0002133537560000194
represents Xb(k) The conjugate transpose matrix of (a); xb(k) A frequency domain reference signal representing the b-th speech frequency domain block in the k-th speech frame.
In general, the optimal solution for the frequency domain filter coefficients should satisfy the wiener-hough equation, i.e., the optimal solution should be consistent with the wiener solution. If the first time domain constraint matrix is added in the block frequency domain Kalman filtering algorithm, the influence of an incontinent part is favorably eliminated in the process of iteratively updating the coefficients of the block frequency domain Kalman filter, so that the coefficient of the equivalent time domain filter of the improved block frequency domain Kalman filtering algorithm provided by the disclosure is favorably consistent with the wiener solution, and the filtering algorithm can be quickly converged to the optimal solution.
Exemplary devices
Fig. 5 is a schematic structural diagram of an embodiment of a speech filtering apparatus according to the present disclosure. The apparatus of this embodiment may be used to implement the method embodiments of the present disclosure described above. As shown in fig. 5, the apparatus of this embodiment includes: an acquisition module 500, a determine parameter module 501, and a determine error module 502.
The obtaining module 500 is configured to obtain frequency domain filtering parameters corresponding to each voice frequency domain block in a previous voice frame adjacent to a current voice frame.
Optionally, the obtaining module 500 may obtain a frequency domain filter step size matrix corresponding to each speech frequency domain block in a previous speech frame adjacent to the current speech frame, a frequency domain filter estimation parameter corresponding to each speech frequency domain block in the previous speech frame, a frequency domain reference signal of each speech frequency domain block in the previous speech frame, and a frequency domain error signal of the previous speech frame. For example, for the ith speech frequency domain block in the previous speech frame, the obtaining module 500 may determine the frequency domain filter step size matrix corresponding to the ith speech frequency domain block in the previous speech frame according to the covariance matrix of the frequency domain error corresponding to the ith speech frequency domain block in the previous speech frame, the frequency domain reference signal of the ith speech frequency domain block in the previous speech frame, the autocorrelation matrix of the background noise of the previous speech frame, the frame length, and the block length.
Optionally, the manner of obtaining the covariance matrix of the frequency domain error corresponding to the ith speech frequency domain block in the previous speech frame by the obtaining module 500 may be: for the ith voice frequency domain block in the last voice frame, determining the covariance matrix of the frequency domain error corresponding to the ith voice frequency domain block in the last voice frame according to the covariance matrix of the frequency domain error corresponding to the ith voice frequency domain block in the last voice frame of the last voice frame, the frequency domain reference signal of the ith voice frequency domain block in the last voice frame of the last voice frame, the process noise of the ith voice frequency domain block of the last voice frame, the frequency domain reference signal of the ith voice frequency domain block in the last voice frame of the last voice frame, the frequency domain filter step size matrix corresponding to the ith voice frequency domain block in the last voice frame of the last voice frame, the unit matrix with the size of the frame length multiplied by the frame length, the frame length and the block length.
The parameter determining module 501 is configured to determine, according to the frequency domain filtering parameters and the two first time domain constraint matrices corresponding to the respective voice frequency domain blocks in the previous voice frame obtained by the obtaining module 500, frequency domain filter estimation parameters corresponding to the respective voice frequency domain blocks in the current voice frame.
Optionally, for the ith speech frequency domain block of the current speech frame, the parameter determining module 501 may perform matrix multiplication on the frequency domain filter step size matrix corresponding to the ith speech frequency domain block of the previous speech frame, the frequency domain reference signal of the previous speech frame, the frequency domain error signal of the previous speech frame, and the two first time domain constraint matrices; then, the parameter determining module 501 adds the result obtained by matrix multiplication to the frequency domain filter estimation parameter corresponding to the ith voice frequency domain block of the previous voice frame; then, the parameter determining module 501 determines the frequency domain filter estimation parameter corresponding to the ith speech frequency domain block in the current speech frame according to the result of the addition and the preset echo path uncertainty.
Optionally, the parameter determining module 501 may multiply the discrete fourier transform matrix whose size is the frame length multiplied by the frame length, the four-element matrix, and the inverse discrete fourier transform matrix whose size is the frame length multiplied by the frame length, and determine the first time domain constraint matrix according to a matrix obtained by the multiplication. The upper right corner element, the lower right corner element and the lower left corner element of the four-element matrix are all-zero matrixes with the size of block length multiplied by the block length, and the upper left corner element is an identity matrix with the size of frame length multiplied by the frame length.
The error determining module 502 is configured to perform filtering processing on the current speech frame according to the frequency domain filter estimation parameters corresponding to each speech frequency domain block in the current speech frame determined by the parameter determining module 501, so as to obtain a frequency domain error signal of the current speech frame.
Optionally, the error determining module 502 may multiply the frequency domain reference signal of each speech frequency domain block in the current speech frame with the frequency domain filter estimation parameter corresponding to each speech frequency domain block in the current speech frame, respectively, to obtain the multiplication result corresponding to each speech frequency domain block in the current speech frame; then, the error determining module 502 accumulates the respective multiplication result of each voice frequency domain block, and multiplies the accumulated result by the second time domain constraint matrix to obtain the multiplication result corresponding to the current voice frame; then, the error determining module 502 determines the frequency domain error signal of the current speech frame according to the difference between the frequency domain expected signal of the current speech frame and the multiplication result corresponding to the current speech frame.
Exemplary electronic device
An electronic device according to an embodiment of the present disclosure is described below with reference to fig. 6. FIG. 6 shows a block diagram of an electronic device in accordance with an embodiment of the disclosure. As shown in fig. 6, the electronic device 61 includes one or more processors 611 and a memory 612.
The processor 611 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 61 to perform desired functions.
The memory 612 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory, for example, may include: random Access Memory (RAM) and/or cache memory (cache), etc. The nonvolatile memory, for example, may include: read Only Memory (ROM), hard disk, flash memory, and the like. One or more computer program instructions may be stored on the computer-readable storage medium and executed by processor 611 to implement the speech filtering methods of the various embodiments of the present disclosure described above and/or other desired functions. Various contents such as an input signal, a signal component, a noise component, etc. may also be stored in the computer-readable storage medium.
In one example, the electronic device 61 may further include: an input device 613, an output device 614, etc., which are interconnected by a bus system and/or other form of connection mechanism (not shown). The input device 613 may also include, for example, a keyboard, a mouse, and the like. The output device 614 can output various information to the outside. The output devices 614 may include, for example, a display, speakers, printer, and communication network and remote output devices connected thereto, among others.
Of course, for simplicity, only some of the components of the electronic device 61 relevant to the present disclosure are shown in fig. 6, omitting components such as buses, input/output interfaces, and the like. In addition, the electronic device 61 may include any other suitable components, depending on the particular application.
Exemplary computer program product and computer-readable storage Medium
In addition to the above-described methods and apparatus, embodiments of the present disclosure may also be a computer program product comprising computer program instructions that, when executed by a processor, cause the processor to perform the steps in the speech filtering method according to various embodiments of the present disclosure described in the "exemplary methods" section of this specification above.
The computer program product may write program code for carrying out operations for embodiments of the present disclosure in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.
Furthermore, embodiments of the present disclosure may also be a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, cause the processor to perform steps in a speech filtering method according to various embodiments of the present disclosure described in the "exemplary methods" section above of this specification.
The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium may include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The foregoing describes the general principles of the present disclosure in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present disclosure are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present disclosure. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the disclosure is not intended to be limited to the specific details so described.
In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts in the embodiments are referred to each other. For the system embodiment, since it basically corresponds to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The block diagrams of devices, apparatuses, systems referred to in this disclosure are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, and systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," comprising, "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".
The methods and apparatus of the present disclosure may be implemented in a number of ways. For example, the methods and apparatus of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless specifically stated otherwise. Further, in some embodiments, the present disclosure may also be embodied as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.
It is also noted that in the devices, apparatuses, and methods of the present disclosure, each component or step can be decomposed and/or recombined. These decompositions and/or recombinations are to be considered equivalents of the present disclosure.
The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these aspects, and the like, will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The foregoing description has been presented for purposes of illustration and description. Furthermore, the description is not intended to limit embodiments of the disclosure to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims (12)

1. A method of speech filtering, comprising:
acquiring frequency domain filtering parameters corresponding to each voice frequency domain block in a previous voice frame adjacent to the current voice frame;
determining the frequency domain filter estimation parameters corresponding to each voice frequency domain block in the current voice frame according to the frequency domain filtering parameters corresponding to each voice frequency domain block in the previous voice frame and the two first time domain constraint matrixes;
and filtering the current voice frame according to the frequency domain filter estimation parameters corresponding to the voice frequency domain blocks in the current voice frame to obtain a frequency domain error signal of the current voice frame.
2. The speech filtering method according to claim 1, wherein the obtaining of the frequency domain filtering parameters corresponding to each speech frequency domain block in the previous speech frame adjacent to the current speech frame comprises:
acquiring a frequency domain filter step size matrix corresponding to each voice frequency domain block in a previous voice frame adjacent to a current voice frame, a frequency domain filter estimation parameter corresponding to each voice frequency domain block in the previous voice frame, a frequency domain reference signal of each voice frequency domain block in the previous voice frame and a frequency domain error signal of the previous voice frame.
3. The speech filtering method according to claim 2, wherein said obtaining a frequency-domain filter step size matrix corresponding to each speech frequency-domain block in a previous speech frame of a current speech frame comprises:
and for the ith voice frequency domain block in the previous voice frame, determining a frequency domain filter step size matrix corresponding to the ith voice frequency domain block in the previous voice frame according to the covariance matrix of the frequency domain error corresponding to the ith voice frequency domain block in the previous voice frame, the frequency domain reference signal of the ith voice frequency domain block in the previous voice frame, the autocorrelation matrix of the background noise of the previous voice frame, the frame length and the block length.
4. The speech filtering method according to claim 3, wherein the method further comprises:
and for the ith voice frequency domain block in the last voice frame, determining the covariance matrix of the frequency domain error corresponding to the ith voice frequency domain block in the last voice frame according to the covariance matrix of the frequency domain error corresponding to the ith voice frequency domain block in the last voice frame of the last voice frame, the frequency domain reference signal of the ith voice frequency domain block in the last voice frame of the last voice frame, the process noise of the ith voice frequency domain block of the last voice frame, the frequency domain reference signal of the ith voice frequency domain block in the last voice frame of the last voice frame, the step size matrix of the frequency domain filter corresponding to the ith voice frequency domain block in the last voice frame of the last voice frame, the unit matrix with the size of the frame length multiplied by the frame length, the frame length and the block length.
5. The speech filtering method according to any one of claims 2 to 4, wherein the determining, according to the frequency domain filtering parameters and the two first time domain constraint matrices corresponding to the respective speech frequency domain blocks in the previous speech frame, the frequency domain filter estimation parameters corresponding to the respective speech frequency domain blocks in the current speech frame comprises:
for the ith voice frequency domain block of the current voice frame, performing matrix multiplication on a frequency domain filter step length matrix corresponding to the ith voice frequency domain block of the previous voice frame, a frequency domain reference signal of the previous voice frame, a frequency domain error signal of the previous voice frame and two first time domain constraint matrixes;
adding the result obtained by multiplying the matrix with the frequency domain filter estimation parameter corresponding to the ith voice frequency domain block of the previous voice frame;
and determining a frequency domain filter estimation parameter corresponding to the ith voice frequency domain block in the current voice frame according to the addition result and the preset echo path uncertainty.
6. The speech filtering method according to any one of claims 2 to 5, wherein the filtering the current speech frame according to the frequency domain filter estimation parameter corresponding to each speech frequency domain block in the current speech frame to obtain the frequency domain error signal of the current speech frame includes:
multiplying the frequency domain reference signal of each voice frequency domain block in the current voice frame with the frequency domain filter estimation parameter corresponding to each voice frequency domain block in the current voice frame respectively to obtain the multiplication result corresponding to each voice frequency domain block in the current voice frame;
accumulating the multiplication result corresponding to each voice frequency domain block, and multiplying the accumulated result by a second time domain constraint matrix to obtain the multiplication result corresponding to the current voice frame;
and determining a frequency domain error signal of the current voice frame according to the difference of the multiplication result corresponding to the frequency domain expected signal of the current voice frame and the current voice frame.
7. The speech filtering method according to any one of claims 1 to 6, wherein the method further comprises:
multiplying a discrete Fourier transform matrix with the size of the frame length multiplied by the frame length, a four-element matrix and a discrete Fourier transform inverse matrix with the size of the frame length multiplied by the frame length, wherein the first time domain constraint matrix is determined by a matrix obtained by the multiplication;
the upper right corner element, the lower right corner element and the lower left corner element of the four-element matrix are all-zero matrixes with the size of multiplying the block length by the block length, and the upper left corner element is an identity matrix with the size of multiplying the frame length by the frame length.
8. A speech filtering apparatus, wherein the apparatus comprises:
an obtaining module, configured to obtain frequency domain filtering parameters corresponding to each voice frequency domain block in a previous voice frame adjacent to a current voice frame;
a parameter determining module, configured to determine, according to the frequency domain filtering parameters and the two first time domain constraint matrices that the obtaining module obtains from the previous speech frame, the frequency domain filter estimation parameters that each speech frequency domain block in the current speech frame corresponds to;
and the error determining module is used for carrying out filtering processing on the current voice frame according to the frequency domain filter estimation parameters corresponding to the voice frequency domain blocks in the current voice frame determined by the parameter determining module to obtain a frequency domain error signal of the current voice frame.
9. The speech filtering apparatus of claim 8, wherein the obtaining module is further configured to:
acquiring a frequency domain filter step size matrix corresponding to each voice frequency domain block in a previous voice frame adjacent to a current voice frame, a frequency domain filter estimation parameter corresponding to each voice frequency domain block in the previous voice frame, a frequency domain reference signal of each voice frequency domain block in the previous voice frame and a frequency domain error signal of the previous voice frame.
10. The speech filtering apparatus of claim 9, wherein the determine parameters module is further to:
for the ith voice frequency domain block of the current voice frame, performing matrix multiplication on a frequency domain filter step length matrix corresponding to the ith voice frequency domain block of the previous voice frame, a frequency domain reference signal of the previous voice frame, a frequency domain error signal of the previous voice frame and two first time domain constraint matrixes;
adding the result obtained by multiplying the matrix with the frequency domain filter estimation parameter corresponding to the ith voice frequency domain block of the previous voice frame;
and determining a frequency domain filter estimation parameter corresponding to the ith voice frequency domain block in the current voice frame according to the addition result and the preset echo path uncertainty.
11. A computer-readable storage medium, the storage medium storing a computer program for performing the method of any of the preceding claims 1-7.
12. An electronic device, the electronic device comprising:
a processor;
a memory for storing the processor-executable instructions;
the processor is configured to read the executable instructions from the memory and execute the instructions to implement the method of any one of claims 1-7.
CN201910645775.7A 2019-07-17 2019-07-17 Voice filtering method, device, medium and electronic equipment Pending CN112242145A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910645775.7A CN112242145A (en) 2019-07-17 2019-07-17 Voice filtering method, device, medium and electronic equipment
PCT/CN2019/100985 WO2021007902A1 (en) 2019-07-17 2019-08-16 Voice filtering method and apparatus, medium, and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910645775.7A CN112242145A (en) 2019-07-17 2019-07-17 Voice filtering method, device, medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN112242145A true CN112242145A (en) 2021-01-19

Family

ID=74167020

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910645775.7A Pending CN112242145A (en) 2019-07-17 2019-07-17 Voice filtering method, device, medium and electronic equipment

Country Status (2)

Country Link
CN (1) CN112242145A (en)
WO (1) WO2021007902A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022183326A1 (en) * 2021-03-01 2022-09-09 深圳市大疆创新科技有限公司 Filtering method and apparatus, movable platform, and storage medium
WO2023093292A1 (en) * 2021-11-26 2023-06-01 腾讯科技(深圳)有限公司 Multi-channel echo cancellation method and related apparatus

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101777887A (en) * 2010-01-08 2010-07-14 西安电子科技大学 FPGA (Field Programmable Gata Array)-based unscented kalman filter system and parallel implementation method
EP3217399A1 (en) * 2016-03-11 2017-09-13 GN ReSound A/S Kalman filtering based speech enhancement using a codebook based approach
CN107393550A (en) * 2017-07-14 2017-11-24 深圳永顺智信息科技有限公司 Method of speech processing and device
CN108806709A (en) * 2018-06-13 2018-11-13 南京大学 Adaptive acoustic echo cancellation method based on frequency domain Kalman filtering
CN109584898A (en) * 2018-12-29 2019-04-05 上海瑾盛通信科技有限公司 A kind of processing method of voice signal, device, storage medium and electronic equipment

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101727906B (en) * 2008-10-29 2012-02-01 华为技术有限公司 Method and device for coding and decoding of high-frequency band signals
CN103854655B (en) * 2013-12-26 2016-10-19 上海交通大学 A kind of low bit-rate speech coder and decoder

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101777887A (en) * 2010-01-08 2010-07-14 西安电子科技大学 FPGA (Field Programmable Gata Array)-based unscented kalman filter system and parallel implementation method
EP3217399A1 (en) * 2016-03-11 2017-09-13 GN ReSound A/S Kalman filtering based speech enhancement using a codebook based approach
CN107393550A (en) * 2017-07-14 2017-11-24 深圳永顺智信息科技有限公司 Method of speech processing and device
CN108806709A (en) * 2018-06-13 2018-11-13 南京大学 Adaptive acoustic echo cancellation method based on frequency domain Kalman filtering
CN109584898A (en) * 2018-12-29 2019-04-05 上海瑾盛通信科技有限公司 A kind of processing method of voice signal, device, storage medium and electronic equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
陆彩霞 等: "一种可快速再跟踪的无延时频域卡尔曼滤波啸叫抑制算法", 电子学报, vol. 46, no. 8, 31 August 2018 (2018-08-31), pages 1954 - 1959 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022183326A1 (en) * 2021-03-01 2022-09-09 深圳市大疆创新科技有限公司 Filtering method and apparatus, movable platform, and storage medium
WO2023093292A1 (en) * 2021-11-26 2023-06-01 腾讯科技(深圳)有限公司 Multi-channel echo cancellation method and related apparatus

Also Published As

Publication number Publication date
WO2021007902A1 (en) 2021-01-21

Similar Documents

Publication Publication Date Title
CN109686381B (en) Signal processor for signal enhancement and related method
US10123113B2 (en) Selective audio source enhancement
Enzner et al. Acoustic echo control
US7925007B2 (en) Multi-input channel and multi-output channel echo cancellation
CN107924684B (en) Acoustic keystroke transient canceller for communication terminals using semi-blind adaptive filter models
US6377637B1 (en) Sub-band exponential smoothing noise canceling system
CN108172231B (en) Dereverberation method and system based on Kalman filtering
US8462962B2 (en) Sound processor, sound processing method and recording medium storing sound processing program
WO2007139621A1 (en) Adaptive acoustic echo cancellation
JP2014502074A (en) Echo suppression including modeling of late reverberation components
CN112863535B (en) Residual echo and noise elimination method and device
CN110428852B (en) Voice separation method, device, medium and equipment
CN112242145A (en) Voice filtering method, device, medium and electronic equipment
CN112185411A (en) Voice separation method, device, medium and electronic equipment
CN109215672B (en) Method, device and equipment for processing sound information
Schneider et al. The generalized frequency-domain adaptive filtering algorithm as an approximation of the block recursive least-squares algorithm
US8208649B2 (en) Methods and systems for robust approximations of impulse responses in multichannel audio-communication systems
Kamarudin et al. Acoustic echo cancellation using adaptive filtering algorithms for Quranic accents (Qiraat) identification
JP7270869B2 (en) Information processing device, output method, and output program
WO2023093292A1 (en) Multi-channel echo cancellation method and related apparatus
JP6648436B2 (en) Echo suppression device, echo suppression program, and echo suppression method
JP4425114B2 (en) Echo canceling method, echo canceling apparatus, echo canceling program, and recording medium recording the same
Ruiz et al. Cascade algorithms for combined acoustic feedback cancelation and noise reduction
CN113783551A (en) Filter coefficient determining method, echo eliminating method and device
JP2016152455A (en) Echo suppression device, echo suppression program and echo suppression method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination