Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
As used in this disclosure, "module," "device," "system," and the like are intended to refer to a computer-related entity, either hardware, a combination of hardware and software, or software in execution. In particular, for example, an element may be, but is not limited to being, a process running on a processor, an object, an executable, a thread of execution, a program, and/or a computer. Also, an application or script running on a server, or a server, may be an element. One or more elements may be in a process and/or thread of execution and an element may be localized on one computer and/or distributed between two or more computers and may be operated by various computer-readable media. The elements may also communicate by way of local and/or remote processes based on a signal having one or more data packets, e.g., from a data packet interacting with another element in a local system, distributed system, and/or across a network in the internet with other systems by way of the signal.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
As shown in fig. 1, an embodiment of the present invention provides a delay estimation method for echo cancellation of an electronic device (e.g., a smart speaker, a voice-controlled television set-top box, etc.), including the following steps:
s10, acquiring a reference signal and a microphone signal collected by a microphone, and performing fast Fourier transform to obtain a frequency domain reference signal and a frequency domain microphone signal;
s20, inputting the frequency-domain reference signal to an adaptive filter to obtain a correlated frequency-domain reference signal corresponding to the frequency-domain reference signal and contained in the frequency-domain microphone signal, where the frequency-domain microphone signal is used to update the adaptive filter;
s30, calculating adaptive filter energy according to the related frequency domain reference signal output by the adaptive filter for determining a time delay value.
The embodiment of the invention determines the relevant frequency domain reference signal which is contained in the microphone signal collected by the microphone and is related to the reference signal by adopting the adaptive filter, and further determines the time delay value of the reference signal by calculating the energy of the adaptive filter. The method solves the problems that the performance is sharply reduced under the condition of environmental interference and the time delay estimation result is very unstable under the condition of a complex environment or a double-talk condition by adopting a cross-correlation method in the prior art.
In addition, the updating complexity of the adaptive filter is reduced due to the fact that the adaptive filter is subjected to block updating instead of point updating in the frequency domain based on the frequency domain reference signal and the frequency domain microphone signal, and therefore the complexity of time delay estimation is reduced. By performing adaptive filtering on the reference signal, a block or a corresponding sampling point which is most matched (closest) to a frequency domain reference signal associated with the reference signal contained in the microphone signal is obtained, so that delay is obtained, and a more stable result can be obtained by continuously adaptively adding certain post-processing.
As shown in fig. 2, in some embodiments of the invention, the step S10: the acquiring a reference signal and a microphone signal collected by a microphone, and performing fast fourier transform to obtain a frequency domain reference signal and a frequency domain microphone signal includes:
s11, acquiring a prestored reference signal and a microphone signal acquired by a microphone; illustratively, for the smart speaker, when a song is played, the played song is stored as a reference signal, and a user instruction voice collected by a microphone of the smart speaker collects a signal for the microphone.
S12, inputting the reference signal and the microphone signal into a low-pass filter for filtering; the low-pass filter adopts a 15-order FIR, and the down-sampling adopts eight times down-sampling in consideration of complexity and stability.
S13, down-sampling the filtered reference signal and the microphone signal respectively; specifically, in order to prevent frequency aliasing, a low-pass filter is used to perform low-pass filtering processing on the filtered microphone signal and the reference signal, and then down-sampling is performed to obtain a down-sampled signal. The high-frequency signal is filtered by the filter, so that aliasing phenomenon cannot be generated when the frequency spectrum is expanded outwards in the down-sampling process, and the aliasing can generate a high-frequency signal to be converted into a low-frequency band.
And S14, respectively carrying out fast Fourier transform on the down-sampled reference signal and the down-sampled microphone signal obtained by down-sampling to obtain the frequency domain reference signal and the frequency domain microphone signal. The microphone signal and the reference signal after the down-sampling are converted from the time domain to the frequency domain to reduce the complexity of data processing. The frame length of the Fast Fourier Transform (FFT) is 128, 256, 512, or other sizes, which is not limited by the present invention.
As shown in fig. 3, in some embodiments of the invention, the above step S30: said calculating filter energy from said correlated frequency domain reference signal output by said adaptive filter for determining a delay value comprises:
s31, calculating the energy of each filter block according to the related frequency domain reference signal in the frequency domain; e.g. in the frequency domainThe values of the lower filter block are a number of complex points, [ a ]1+j*b1,a2+j*b2,...,an+j*bn]Then the energy of the block can be represented as (a)1 2+b1 2+a2 2+b2 2+...+an 2+bn 2) I.e. the square of the absolute value of the complex number or the square of the amplitude of the call.
And S32, determining the time delay value according to the maximum value in the energy of each block of filter. For example, the energy of the filter block is once [1,2,4,2,1], then the maximum energy is 4 (3 rd value from left to right), then the corresponding delay is 3 (i.e. is an index value of 4), indicating that the number of delay blocks is 3.
The embodiment of the invention is used for determining the time delay value in a mode of calculating the energy of the filter block in the frequency domain, the data amount required to be calculated and processed is small, and the complexity of estimating the time delay value is reduced.
As shown in fig. 4, in some embodiments of the invention, the above step S30: said calculating filter energy from said correlated frequency domain reference signal output by said adaptive filter for determining a delay value comprises:
s31', inverse Fourier transform the related frequency domain reference signal to obtain a corresponding related time domain reference signal;
s32', calculating the energy of each sampling point according to the related time domain reference signal in the time domain; for example, a series of input samples [1,2,3,4 ]]Then the energy per sample point is 12,22,32,42I.e. the square of the corresponding sample point.
S33', determining the delay value according to the maximum value of the energy of each sampling point. For example, assuming that the energy of each sampling point is [1,2,4,2,1], then the maximum energy is 4 (3 rd value from left to right), then the corresponding delay is 3 (i.e. is an index value of 4), which means that the sampling point of the delay is 3, and it is different from the block delay in unit, if the block delay is to be converted to the sampling point delay, then the block delay needs to be multiplied by the size of each block.
In the embodiment of the invention, the time delay value is estimated by calculating the energy of the sampling points in the time domain, and the estimation precision of the time delay value can be improved by estimating the energy of a plurality of sampling points in the time domain.
As shown in fig. 5, in some embodiments of the invention, the above step S30: said calculating filter energy from said correlated frequency domain reference signal output by said adaptive filter for determining a delay value comprises:
s31', calculating the energy of each filter block according to the related frequency domain reference signal in the frequency domain;
s32', determining a first time delay value according to the maximum value in the energy of each filter;
s33', inverse Fourier transform the related frequency domain reference signal to obtain a corresponding related time domain reference signal;
s34', calculating the energy of each sampling point according to the related time domain reference signal in the time domain;
s35', determining a second time delay value according to the maximum value of the energy of each sampling point;
s36', determining the delay value according to the first delay value and the second delay value. For example, each block of the filter has 512 points, if the sampling point delay estimation result is 1024, the block delay result is 1, the sampling point delay is converted to the block delay 1024/512 which is 2, and the block delay is not equal to the estimated value of the block delay, then the estimation result is invalidated, and the time still outputs the delay value of the previous time. If the block delay estimation result is also 2 at this time, the block delay corresponds to the sampling point delay estimation result, and the current result can be output.
In the embodiment of the invention, the sampling point delay and the block delay are the delays corresponding to the points needing to carry out peak value search on the W value of the filter, and the sampling point delay and the block delay are the results of the peak value search. The embodiment of the invention comprehensively considers the sampling point delay and the block delay, and the sampling point delay and the block delay are mutually used as references, thereby further improving the precision of the time delay estimation.
As shown in fig. 6, a flowchart of a field embodiment of the delay estimation method of the present invention specifically includes the following steps: down-sampling, fourier transform, adaptive filtering, peak search, and post-processing. Two paths of signals (a microphone signal and a reference signal) are input, and each frame is output. Each step is described separately below.
1) Down sampling
The reference signal and the microphone signal are low-pass filtered and then down-sampled (algorithm complexity can be reduced).
In order to prevent frequency aliasing, a low pass filter is first passed, where a FIR of 15 th order or 7 th order is used, which is not limited by the present invention. And in the sampling, the complexity and the stability are considered, and 8-time down-sampling or 4-time down-sampling is adopted, which is not limited by the invention. The microphone signal and the reference signal are down-sampled together to ensure that the data length processed each time is consistent, and the low-pass filtering and the sampling are included.
The high-frequency signal is filtered by the filter, so that aliasing phenomenon cannot be generated when the frequency spectrum is expanded outwards in the down-sampling process, and the aliasing can generate a high-frequency signal to be converted into a low-frequency band.
2) FFT (Fourier transform)
The down-sampled microphone signal and the reference signal are subjected to FFT (fourier transform) respectively. In order to reduce the complexity, processing is performed in the frequency domain, and therefore an FFT is required, and the frame length of the FFT is 128, 256, 512 or other sizes, which is not limited by the present invention.
3) Adaptive filtering
Here, the cyclic convolution is used to replace the linear convolution mode, and the overlap-preserving method is adopted to realize the method, and 50% overlap is used. The purpose of the adaptive filtering is to estimate the part of the microphone signal that is correlated with the reference signal. The input of the adaptive filtering is the reference signal and the output is the correlated part of the estimated microphone signal.
For the kth block filter and the reference signal, the reference signal filtering output result is:
y (k) the second half of the IFFT [ x (k) w (k) ],
where x (k) is the far-end block signal and w (k) is the filter block coefficients, and only the second half of the elements are retained because the second half of the elements are the result of the cyclic convolution. Wherein, the element is the sampling point, and the remote block signal is the form of a block into which the previous reference signal is divided.
The time domain block error signal is:
e(k)=d(k)-y(k),
where d (k) represents a microphone signal.
The frequency domain block error signal is:
E(k)=FFT[0e(k)],
where 0 means that half of 0 is added before e (k).
Normalizing E (k) to obtain
Wherein | X (k) | represents the smoothed energy of the reference signal,δis a fixed value that prevents the filter from diverging.
The update amount of the filter is:
the first half of the elements of (a),
where μ is the step-size coefficient factor, since only the first half of the results are correct results, and the second half needs to be discarded. This method is called overlap-save method, meaning that only a portion is saved.
The filter update formula is:
W(k+1)=W(k)+FFT[Φ(k)0],
where 0 represents the addition of half of 0 after Φ (k).
Updating of the filter is a relatively critical step, but where Φ (k) is the error
In connection with
And near-end signal, reference signalThe estimate of the signal at the near end is related to the estimate of the reference signal at the near end, which requires reference signal filtering. This is an iterative process.
4) Peak search
Considering here the output sample point delay and the block delay, two different branches are needed. Wherein the content of the first and second substances,
the flow of sampling point time delay needs to perform IFFT transformation on the whole filter to time domain, then calculate the energy of each coefficient of the filter, and select the sampling point with the largest energy as the estimated point time delay.
The block delay process is to calculate the energy of each filter block in the frequency domain, and then to take the block with the maximum energy as the block delay.
The sampling point delay and the block delay are delays corresponding to the points needing to carry out peak value search on the W value of the filter, and are the results of the peak value search. The point delay and the block delay can be considered together or can be considered separately for estimating the delay. The block delay has low precision requirement, low complexity and large error; the sampling points are delayed and have small errors but are more complex. The two can also be estimated together, and can be mutually used as reference, so that the precision is further improved.
5) Post-treatment
The post-treatment is mainly considered from two aspects: (1) the case where the delay is outside the filter length; (2) short time delay is abnormal jitter.
The current approach to alleviate the first problem is to perform short-time analysis on 20 consecutive frames, obtain the average energy and energy peak of the filter for each frame, then average the 20 frames, compare the average energy and peak energy, and consider the estimate of the 20 frames to be unreliable if the peak energy is less than some multiple of the average energy. Therefore, the one-time statistics of 20 frames is to prevent sudden and sporadic time delay values in a short time, which has no meaning on subsequent AEC filter adjustment and can also reduce certain complexity.
The second problem is alleviated by comparing the current estimated delay with the average estimated delay of the previous 20 frames, and if a threshold is exceeded, the current frame estimation result is deemed to be unreliable, and the previous frame result is output. If the current frame is deemed to be credible and the continuous multiframes are deemed to be credible, the time delay result of the current frame is output and multiplied by the down-sampling times to be used as the finally output sampling point time delay result.
And post-processing on block delay is relatively simple, because the estimation result of the block delay has small floating, and excessive post-processing can affect the accuracy, only considering the problem (1), comparing the energy of the current block with the average energy of the whole filter, and if the energy is smaller than a certain threshold value, considering that the current delay result is invalid.
The frame concept here is the number of samples or the length of the reference signal and the microphone signal processed at one time. Since real-time output is required, one audio segment cannot be processed all together, and therefore, output while processing is required. Here the frame concept is the same as the length of the previous block, but the filter consists of several such blocks. The input to the post-processing is the previously estimated delay result for each frame. Here, operations such as smoothing are performed on multi-frame results, and the post-processing is essential to make the results more stable. The output is also a time delay result, but is a composite output combining multiple frames.
It should be noted that for simplicity of explanation, the foregoing method embodiments are described as a series of acts or combination of acts, but those skilled in the art will appreciate that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention. In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
As shown in fig. 7, an embodiment of the present invention further provides a delay estimation system 700, including:
a signal obtaining program module 710, configured to obtain a reference signal and a microphone signal collected by a microphone, and perform fast fourier transform to obtain a frequency domain reference signal and a frequency domain microphone signal;
an adaptive filter procedure module 720, configured to input the frequency-domain reference signal to an adaptive filter to obtain a relevant frequency-domain reference signal corresponding to the frequency-domain reference signal included in the frequency-domain microphone signal, where the frequency-domain microphone signal is used to update the adaptive filter;
a delay determination module 730, configured to calculate filter energy for determining a delay value according to the correlated frequency domain reference signal output by the adaptive filter.
The embodiment of the invention determines the relevant frequency domain reference signal which is contained in the microphone signal collected by the microphone and is related to the reference signal by adopting the adaptive filter, and further determines the time delay value of the reference signal by calculating the energy of the adaptive filter. The method solves the problems that the performance is sharply reduced under the condition of environmental interference and the time delay estimation result is very unstable under the condition of a complex environment or a double-talk condition by adopting a cross-correlation method in the prior art.
In addition, the updating complexity of the adaptive filter is reduced due to the fact that the adaptive filter is subjected to block updating instead of point updating in the frequency domain based on the frequency domain reference signal and the frequency domain microphone signal, and therefore the complexity of time delay estimation is reduced. By performing adaptive filtering on the reference signal, a block or a corresponding sampling point which is most matched (closest) to a frequency domain reference signal associated with the reference signal contained in the microphone signal is obtained, so that delay is obtained, and a more stable result can be obtained by continuously adaptively adding certain post-processing.
As shown in fig. 8, in some embodiments of the invention, the signal acquisition program module 710 includes:
a signal acquisition program unit 711 for acquiring a reference signal stored in advance and a microphone signal acquired by a microphone;
a filter processing program unit 712, configured to input the reference signal and the microphone signal to a low-pass filter for filter processing;
a down-sampling program unit 713, configured to down-sample the filtered reference signal and the filtered microphone signal, respectively;
a fourier transform program unit 714, configured to perform fast fourier transform on the down-sampled reference signal and the down-sampled microphone signal obtained by down-sampling to obtain the frequency-domain reference signal and the frequency-domain microphone signal, respectively.
As shown in fig. 9, in some embodiments of the invention, the latency determination module 730 includes:
an energy calculation program unit 731 for calculating the energy of each filter block in the frequency domain from the correlated frequency domain reference signal;
a delay determination unit 732, configured to determine the delay value according to a maximum value of the energy of each block of the filter.
As shown in fig. 10, in some embodiments of the invention, the latency determination module 730 includes:
a signal conversion program unit 731' for performing inverse fourier transform on the correlated frequency domain reference signal to obtain a corresponding correlated time domain reference signal;
an energy calculation program unit 732' for calculating the energy of each sample point in the time domain from the correlated time domain reference signal;
a delay determination program unit 733' for determining said delay value from a maximum value of the energy of said each sample point.
As shown in fig. 11, in some embodiments of the invention, the latency determination module 730 includes:
a first energy calculation program unit 731' for calculating the energy per filter block in the frequency domain from the correlated frequency domain reference signal;
a first delay determination program unit 732' for determining a first delay value from a maximum of the energy of said each block of filters;
an inverse fourier transform program unit 733' for performing an inverse fourier transform on the correlated frequency domain reference signal to obtain a corresponding correlated time domain reference signal;
a second energy calculation program unit 734' for calculating the energy of each sampling point in the time domain according to the correlated time domain reference signal;
a second delay determining program unit 735' for determining a second delay value based on a maximum value of the energy of said each sample point;
a delay determining program unit 736' configured to determine the delay value according to the first delay value and the second delay value.
In some embodiments, the present invention provides a non-transitory computer-readable storage medium, in which one or more programs including executable instructions are stored, where the executable instructions can be read and executed by an electronic device (including but not limited to a computer, a server, or a network device, etc.) to perform any of the latency estimation methods of the present invention.
In some embodiments, the present invention further provides a computer program product comprising a computer program stored on a non-volatile computer-readable storage medium, the computer program comprising program instructions that, when executed by a computer, cause the computer to perform any of the latency estimation methods described above.
In some embodiments, an embodiment of the present invention further provides an electronic device, which includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a latency estimation method.
In some embodiments, the present invention further provides a storage medium having a computer program stored thereon, wherein the computer program is executed by a processor to perform the time delay estimation method.
The delay estimation system of the embodiment of the present invention may be used to execute the delay estimation method of the embodiment of the present invention, and accordingly achieve the technical effect achieved by the implementation of the delay estimation method of the embodiment of the present invention, which is not described herein again. In the embodiment of the present invention, the relevant functional module may be implemented by a hardware processor (hardware processor).
Fig. 12 is a schematic diagram of a hardware structure of an electronic device for executing a delay estimation method according to another embodiment of the present application, and as shown in fig. 12, the electronic device includes:
one or more processors 1210 and a memory 1220, with one processor 1210 being an example in fig. 12.
The apparatus for performing the delay estimation method may further include: an input device 1230 and an output device 1240.
The processor 1210, memory 1220, input device 1230, and output device 1240 may be connected by a bus or other means, such as by a bus connection in fig. 12.
The memory 1220 is a non-volatile computer-readable storage medium, and can be used for storing non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the time delay estimation method in the embodiment of the present application. The processor 1210 executes various functional applications of the server and data processing by running nonvolatile software programs, instructions and modules stored in the memory 1220, so as to implement the latency estimation method of the above method embodiment.
The memory 1220 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created from use of the delay estimation device, and the like. Further, the memory 1220 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the memory 1220 may optionally include memory located remotely from the processor 1210, and such remote memory may be connected to the latency estimation apparatus via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 1230 may receive input numeric or character information and generate signals related to user settings and function control of the delay estimation device. The output device 1240 may include a display device such as a display screen.
The one or more modules are stored in the memory 1220 and, when executed by the one or more processors 1210, perform the latency estimation method of any of the method embodiments described above.
The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the methods provided in the embodiments of the present application.
The electronic device of the embodiments of the present application exists in various forms, including but not limited to:
(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.
(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as ipads.
(3) Portable entertainment devices such devices may display and play multimedia content. Such devices include audio and video players (e.g., ipods), handheld game consoles, electronic books, as well as smart toys and portable car navigation devices.
(4) The server is similar to a general computer architecture, but has higher requirements on processing capability, stability, reliability, safety, expandability, manageability and the like because of the need of providing highly reliable services.
(5) And other electronic devices with data interaction functions.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a general hardware platform, and certainly can also be implemented by hardware. Based on such understanding, the above technical solutions substantially or contributing to the related art may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.