CN109727607B

CN109727607B - Time delay estimation method and device and electronic equipment

Info

Publication number: CN109727607B
Application number: CN201711043361.4A
Authority: CN
Inventors: 王天宝
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2017-10-31
Filing date: 2017-10-31
Publication date: 2022-08-05
Anticipated expiration: 2037-10-31
Also published as: CN109727607A

Abstract

The disclosure relates to a time delay estimation method, a time delay estimation device and an electronic device, wherein the time delay estimation method comprises the following steps: acquiring a sound signal acquired by a microphone and a far-end voice signal output by a loudspeaker; respectively taking the sound signal and the far-end voice signal as fingerprint input signals, and extracting dynamic change characteristics of audio energy of the fingerprint input signals to obtain an audio fingerprint of the sound signal and an audio fingerprint of the far-end voice signal; and comparing the audio fingerprint of the sound signal with the audio fingerprint of the far-end voice signal to obtain a time delay estimation result. By adopting the method and the device, the response speed of time delay estimation can be improved, and the problem of echo leakage caused by the fact that echo cancellation cannot work normally during time delay change is avoided.

Description

Time delay estimation method and device and electronic equipment

Technical Field

The present disclosure relates to the field of communications technologies, and in particular, to a time delay estimation method and apparatus, and an electronic device.

Background

With the development of communication technology, more and more application scenarios relate to a call process, for example, a user uses a smart phone to perform a video/voice call, or a call when the user performs a teleconference through a video conference system, or a session related to a session system in which a chat robot participates.

In the process of communication, the client of one communication party firstly receives the far-end voice signal output by the loudspeaker, then the microphone collects the voice signal, and finally the collected voice signal is output for the client of the other communication party to receive.

If the sound signal collected by the microphone includes the far-end speech signal besides the near-end speech signal (generated by the user speaking in the conversation process), the sound signal is interfered by echo in the conversation process, and the conversation quality is directly influenced. Therefore, the prior art proposes an echo cancellation method, that is, first find the time delay of the sound signal relative to the far-end speech signal, and then cancel the far-end speech signal in the sound signal by using the time delay.

However, in many electronic devices, even if the time delay changes in real time during the same call, if the response speed to the time delay change is too slow, echo cancellation will not work normally during the time delay change, resulting in echo leakage.

Disclosure of Invention

In order to solve the above technical problem, an object of the present disclosure is to provide a delay estimation method, a delay estimation device and an electronic device.

Wherein, the technical scheme who this disclosure adopted does:

in one aspect, a method for estimating delay includes: acquiring a sound signal acquired by a microphone and a far-end voice signal output by a loudspeaker; respectively taking the sound signal and the far-end voice signal as fingerprint input signals, and extracting dynamic change characteristics of audio energy of the fingerprint input signals to obtain an audio fingerprint of the sound signal and an audio fingerprint of the far-end voice signal; and comparing the audio fingerprint of the sound signal with the audio fingerprint of the far-end voice signal to obtain a time delay estimation result.

In another aspect, a delay estimation apparatus includes: the signal acquisition module is used for acquiring the sound signal acquired by the microphone and the far-end voice signal output by the loudspeaker; the feature extraction module is used for taking the sound signal and the far-end voice signal as fingerprint input signals respectively, and performing dynamic change feature extraction on the fingerprint input signals to obtain an audio fingerprint of the sound signal and an audio fingerprint of the far-end voice signal; and the fingerprint comparison module is used for comparing the audio fingerprint of the sound signal with the audio fingerprint of the far-end voice signal to obtain a time delay estimation result.

In another aspect, an electronic device includes a processor and a memory, the memory having stored thereon computer-readable instructions, which, when executed by the processor, implement the latency estimation method as described above.

In another aspect, a computer-readable storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, implements a latency estimation method as described above.

Compared with the prior art, the method has the following beneficial effects:

the audio frequency fingerprint of the sound signal is obtained by extracting the dynamic change characteristic of the audio frequency energy of the sound signal collected by the microphone, the dynamic change characteristic of the audio frequency energy of the far-end voice output by the loudspeaker is extracted, the audio frequency fingerprint of the far-end voice signal is obtained, the audio frequency fingerprint of the sound signal and the audio frequency fingerprint of the far-end voice signal are compared through fingerprints, a time delay estimation result is obtained, and then echo cancellation in the passing process can be carried out according to the time delay estimation result.

The audio fingerprint represents the dynamic change characteristic of the audio energy of the signal, and the time difference of the similarity between the sound signal and the far-end voice signal can be reflected in real time through fingerprint comparison, namely the time delay of the sound signal relative to the far-end voice signal is reflected in real time, and once the time delay is changed, the time difference of the similarity between the sound signal and the far-end voice signal is changed, so that the time delay change can be responded in time, the response speed of time delay estimation is effectively improved, and the problem of echo leakage caused by the fact that echo cancellation cannot work normally in the time delay change period is solved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

Fig. 1 is a block diagram illustrating a hardware configuration of an electronic device according to an exemplary embodiment.

Fig. 2 is a flow chart illustrating a method of delay estimation in accordance with an example embodiment.

Fig. 3 is a flow chart illustrating another method of delay estimation in accordance with an example embodiment.

Fig. 4 is a flowchart of an embodiment of the step of extracting the dynamic change feature of the audio energy from the fingerprint input signal to obtain the audio fingerprint in the embodiment corresponding to fig. 2 or fig. 3.

Fig. 5 is a flow chart illustrating another method of delay estimation in accordance with an example embodiment.

FIG. 6 is a flow diagram of one embodiment of step 250 of the corresponding embodiment of FIG. 2.

FIG. 7 is a flowchart of one embodiment of step 255 of the corresponding embodiment of FIG. 6.

Fig. 8 is a flowchart of a specific implementation of a delay estimation method in an application scenario.

Fig. 9 is a block diagram illustrating a latency estimation apparatus according to an example embodiment.

Fig. 10 is a block diagram illustrating another latency estimation apparatus according to an example embodiment.

Fig. 11 is a block diagram for one embodiment of the feature extraction module 730 in the corresponding embodiment of fig. 9.

Fig. 12 is a block diagram illustrating another latency estimation apparatus according to an example embodiment.

FIG. 13 is a block diagram of one embodiment of the fingerprint comparison module 750 of the corresponding embodiment of FIG. 9.

Fig. 14 is a block diagram of one embodiment of a result update unit 757 in the corresponding embodiment of fig. 12.

While specific embodiments of the disclosure have been shown and described in detail in the drawings and foregoing description, such drawings and description are not intended to limit the scope of the disclosed concepts in any way, but rather to explain the concepts of the disclosure to those skilled in the art by reference to the particular embodiments.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

Currently, to eliminate far-end speech signals in sound signals, the delay estimation method generally includes two methods: a time stamp method and a binary line method.

In the time stamp method, a time stamp is added to a signal to obtain a time delay, that is, a sound signal collected by a microphone lags behind a far-end speech signal output by a loudspeaker, so that the time delay of the sound signal relative to the far-end speech signal can be obtained according to a difference between the time stamp of the sound signal and the time stamp of the far-end speech signal. The method is only suitable for Personal Computer (PC) devices, but cannot be applied to other electronic devices capable of making calls, and has poor universality and low accuracy of time delay estimation.

The binary spectral line method estimates the time delay by using signal characteristics, but the method is limited by the design of the algorithm, when the time delay changes, a long time is needed to estimate the new time delay again, and during the time delay re-estimation, echo cancellation is carried out according to the old time delay to cause echo leakage.

For this reason, the present disclosure particularly proposes a delay estimation method, which is applicable to an electronic device, for example, the electronic device may be a smart phone, a desktop computer, a laptop computer, a tablet computer, or other electronic devices capable of making a call, and is not limited herein. The time delay estimation method can respond to the time delay change in the same conversation process in time, so that the time delay estimation result is updated immediately, the response speed of time delay estimation can be improved, the time delay estimation method has higher time delay estimation accuracy, and echo elimination is facilitated, and the conversation quality is improved.

Referring to fig. 1, fig. 1 is a block diagram illustrating an electronic device according to an example embodiment. It should be noted that the electronic device 100 is only an example adapted to the present disclosure, and should not be considered as providing any limitation to the scope of the present disclosure. The electronic device 100 also should not be construed as necessarily dependent on or having one or more components of the exemplary electronic device 100 shown in fig. 1.

As shown in fig. 1, the electronic device 100 includes a memory 101, a memory controller 103, one or more processors 105 (only one shown), a peripheral interface 107, a radio frequency module 109, a positioning module 111, a camera module 113, an audio module 115, a touch screen 117, and a key module 119. These components communicate with each other via one or more communication buses/signal lines 121.

The memory 101 may be used to store software programs and modules, such as program instructions and modules corresponding to the time delay estimation method and apparatus in the exemplary embodiments of the present disclosure, and the processor 105 executes various functions and data processing by executing the program instructions stored in the memory 101, that is, implements the time delay estimation method.

Memory 101, as a carrier of resource storage, may be a random access medium, such as high speed random access memory, non-volatile memory, such as one or more magnetic storage devices, flash memory, or other solid state memory. The storage means may be a transient storage or a permanent storage.

The peripheral interface 107 may include at least one wired or wireless network interface, at least one serial-to-parallel conversion interface, at least one input/output interface, at least one USB interface, and the like, for coupling various external input/output devices to the memory 101 and the processor 105 to realize communication with various external input/output devices.

The rf module 109 is configured to receive and transmit electromagnetic waves, and achieve interconversion between the electromagnetic waves and electrical signals, so as to communicate with other devices through a communication network. Communication networks include cellular telephone networks, wireless local area networks, or metropolitan area networks, which may use various communication standards, protocols, and technologies.

The positioning module 111 is used for acquiring the current geographic position of the electronic device 100. Examples of the positioning module 111 include, but are not limited to, a global positioning satellite system (GPS), a wireless local area network-based positioning technology, or a mobile communication network-based positioning technology.

The camera module 113 is attached to a camera and is used for taking pictures or videos. The shot pictures or videos can be stored in the memory 101 and also can be sent to an upper computer through the radio frequency module 109.

Audio module 115 provides an audio interface to a user, which may include one or more microphone interfaces, one or more speaker interfaces, and one or more headphone interfaces. And performing audio data interaction with other equipment through the audio interface. The audio data may be stored in the memory 101 and may also be transmitted through the radio frequency module 109.

The touch screen 117 provides an input-output interface between the electronic device 100 and a user. Specifically, the user may perform an input operation, such as a gesture operation, e.g., clicking, touching, sliding, etc., through the touch screen 117, so that the electronic apparatus 100 responds to the input operation. The electronic device 100 displays and outputs output contents formed by any one or combination of text, pictures or videos to the user through the touch screen 117.

The key module 119 includes at least one key for providing an interface for a user to input to the electronic device 100, and the user can cause the electronic device 100 to perform different functions by pressing different keys. For example, the sound adjustment key may allow a user to effect an adjustment of the volume of sound played by the electronic device 100.

It will be appreciated that the configuration shown in FIG. 1 is merely illustrative and that electronic device 100 may include more or fewer components than shown in FIG. 1 or different components than shown in FIG. 1.

Furthermore, the present disclosure can be implemented equally as hardware circuitry or hardware circuitry in combination with software instructions, and thus implementation of the present disclosure is not limited to any specific hardware circuitry, software, or combination of both.

Referring to fig. 2, in an exemplary embodiment, a delay estimation method is applied to an electronic device, and the electronic device may adopt the hardware structure shown in fig. 1.

The delay estimation method may be executed by the electronic device 100 in fig. 1, and may include the following steps:

step 210, acquiring a sound signal collected by a microphone and a far-end voice signal output by a loudspeaker.

It should be noted that, all the signals generated during the call process belong to audio signals, for the convenience of distinction, the audio signal output by the speaker is defined as a far-end speech signal, and the audio signal acquired by the microphone is defined as a sound signal, where the sound signal includes not only the audio signal related to the actual call content, for example, what the user directly says at the time of the speech call, but also the far-end speech signal or other noises.

In the process of communication, the voice signal is transmitted between the clients where the two parties of communication are located, and if the voice signal contains a far-end voice signal, the voice signal may be interfered by echo in the process of communication to influence the communication quality. Therefore, in order to avoid the echo interference during the call, it is necessary to apply an echo cancellation technique to the voice signal to cancel the echo component in the voice signal, i.e. the far-end voice signal.

For example, when the user a and the user B perform a voice call, an audio signal corresponding to the content of the utterance of the user a, that is, a near-end voice signal C0, is generated first, and the client of the user a acquires a voice signal through the microphone a1, obtains a voice signal including the near-end voice signal C0, and outputs the voice signal to the client of the user B.

For the user B, the client receives the sound signal and outputs the sound signal through the speaker B1 to form a far-end sound signal C1, when the user B hears the far-end sound signal C1, an audio signal corresponding to the content of the user B, that is, a near-end sound signal C2, is generated accordingly, and a sound signal including a near-end sound signal C2 is acquired by the microphone B2 and output to the client of the user a.

In the above-mentioned conversation process, it can be understood that the far-end speech signal C1 is actually formed by the output of the near-end speech signal C0 generated by the content of the speech of the user a, and if the sound signal collected by the microphone B2 includes the far-end speech signal C1 in addition to the near-end speech signal C2, when the sound signal is output to the client of the user a, the user a will feel that the speech (i.e., the near-end speech signal C0) is back, that is, an echo is generated during the conversation process.

Therefore, in order to eliminate the echo component (i.e. far-end voice signal) in the sound signal later, the sound signal is collected by the microphone first, and the far-end voice signal output by the loudspeaker is acquired.

And step 230, taking the sound signal and the far-end voice signal as fingerprint input signals respectively, and performing dynamic change feature extraction on the fingerprint input signals to obtain an audio fingerprint of the sound signal and an audio fingerprint of the far-end voice signal.

In order to perform echo cancellation on a voice signal, it is first necessary to know the time delay of the voice signal relative to a far-end voice signal.

It will be appreciated that the far-end speech signal is only output by the loudspeaker and may be picked up by the microphone to form an echo component of the acoustic signal. In other words, the echo component in the sound signal is derived from the far-end speech signal currently output by the speaker and lags behind the far-end speech signal currently output by the speaker. Thereby, a time delay of the sound signal with respect to the far-end speech signal is formed.

Based on the above, the time delay of the sound signal relative to the far-end speech signal is substantially the time difference between the sound signal and the far-end speech signal where the similarity exists.

In this embodiment, signal characteristics of the sound signal and the far-end sound signal are first expressed as audio fingerprints, so that similarities between the two are obtained subsequently according to the audio fingerprints of the two, and further, the relative time delay between the two is determined according to a time difference between the similarities.

In a specific implementation of an exemplary embodiment, as shown in fig. 4, the audio fingerprint acquisition process may include the following steps:

step 411, performing time-frequency transformation on the fingerprint input signal to obtain multi-frame frequency domain data.

First, the fingerprint input signal refers to a sound signal collected by a microphone or a far-end voice signal output by a speaker. The fingerprint input signal is a discrete digital signal rather than a discontinuous analog signal by means of analog-to-digital conversion modules arranged in the microphone and the loudspeaker.

The discrete digital signal is first subjected to time domain framing, i.e., windowing, so that time domain data of a specified number of sampling points is extracted from the discrete digital signal and is obtained in units of frames. The range of the designated sampling point number can be flexibly adjusted according to the actual application scene, for example, when the requirement on the accuracy of the delay estimation is high, a larger designated sampling point number is set, for example, the designated sampling point number is set to 4096, and when the requirement on the response speed of the delay estimation is high, a smaller designated sampling point number is set, for example, the designated sampling point number is set to 10.

It should be noted that, in some other embodiments, if the sound signal collected by the microphone or the far-end speech signal output by the speaker, i.e. the fingerprint input signal, is a continuous analog signal rather than a discrete digital signal, then sampling at a specified sampling frequency is further required before performing the time-domain framing processing to convert the continuous analog signal into the discrete digital signal. The appointed sampling frequency is configured to be more than 5000Hz so as to ensure that enough high-frequency components in the analog signal are sampled, and further ensure the accuracy of time delay estimation. For example, the specified sampling frequency may be 32000Hz, 44100Hz, 48000Hz, and the like, without limitation.

Secondly, the time-frequency Transformation is to transform a frame of time-domain data into a frame of frequency-domain data by Fast Fourier Transform (FFT) for the frame of time-domain data.

Specifically, the calculation formula (1) of the fast fourier transform is as follows:

wherein, N is the appointed sampling point number, X [ N ] represents the nth sampling point data in the time domain data, and X [ k ] represents the kth frequency point data in the frequency domain data.

Further, in performing time-domain framing, it is configured to perform according to the inter-frame overlap 3/4, thereby improving the accuracy of the delay estimation.

Therefore, the fingerprint input signal can obtain multi-frame frequency domain data through time-frequency transformation, so that frequency band division is carried out on the basis of the multi-frame frequency domain data, and dynamic change characteristics of audio energy are extracted according to the frequency bands obtained through division.

Step 413, performing frequency band division on the current frame frequency domain data to obtain a plurality of frequency bands, and calculating audio energy of the plurality of frequency bands respectively.

The frequency band division and the subsequent sub-audio fingerprint generation are performed on one frame frequency domain data, so that for the convenience of distinction, the frequency domain data being processed is defined as the current frame frequency domain data, the frequency domain data to be processed is the next frame frequency domain data, such as the next frame frequency domain data, and the processed frequency domain data is the previous frame frequency domain data, such as the previous frame frequency domain data.

The frequency band division may be performed according to a specified number, which includes a specified number of frequency bands and a specified number of frequency point data, in other words, the frequency domain data is divided into frequency bands of the specified number of frequency bands, and each frequency band contains frequency point data of the specified number of frequency point data.

The designated number can be flexibly adjusted according to the actual application scene, for example, in low frequency, each frequency band contains less frequency point data, in high frequency, each frequency band contains more frequency point data, that is, the number of the designated frequency point data can be adjusted. For another example, the number of frequency bands is small to reduce the processing pressure of the processor and increase the response speed of the delay estimation, and the number of frequency bands is large to increase the accuracy of the delay estimation, i.e., the number of designated frequency bands is adjustable.

Alternatively, the frequency band division may be performed according to the following calculation formula (2), and the calculation formula (2) is specifically as follows:

wherein M is the number of frequency bands contained in the frequency domain data, F (M) represents the lower frequency limit corresponding to the mth frequency band, F _min The lower frequency limit, F, of the frequency data of the frequency band containing the frequency point data of the mth frequency band _max Frequency indicating that the frequency domain data of the mth frequency band contains frequency point dataUpper limit of rate.

For example, when m is 1, F (1) is F _min ，

When m is equal to 2, the compound is,

thus, the frequency range of the frequency point data included in the 1 st frequency band is between f (1) and f (2).

Preferably, the frequency band division is configured in the frequency domain data of the frequency range of 300Hz to 2000Hz in consideration of the hearing range of the user, thereby facilitating reduction of processing pressure of the processor and facilitating improvement of response speed of the time delay estimation.

In the above process, the sub-audio fingerprints corresponding to the frequency domain data generated by the frequency band division according to the change rule of the difference between the audio energy of the adjacent frequency bands in the adjacent frame frequency domain data provide a sufficient basis.

Further, the audio energy of each frequency band is calculated according to the following calculation formula (3), and the calculation formula (3) is as follows:

wherein, M is the number of frequency bands of the nth frame frequency domain data, E (n, M) represents the audio energy of the mth frequency band of the nth frame frequency domain data, x (k) represents the kth frequency point data in the mth frequency band of the nth frame frequency domain data, and the frequency range of each frequency point data is between f (M) and f (M + 1).

Step 415, for a plurality of frequency bands, generating a sub-audio fingerprint corresponding to the frequency domain data of the current frame according to a variation rule of a difference between the audio energy of the frequency band and the adjacent frequency band and the audio energy of the frequency band at the corresponding position in the frequency domain data of the previous frame.

For a plurality of frequency bands, each frequency band included in the current frame frequency domain data is used to reflect the variation rule of the difference value of the audio energy in turn for generating the sub-audio fingerprint.

Of course, in other embodiments, for example, when the requirement on the accuracy of the delay estimation is not high, in the sub-audio fingerprint generation, the frequency domain data of the current frame may also include several frequency bands of all the frequency bands, that is, only several frequency bands are selected to reflect the change rule of the difference value of the audio energy, so as to further improve the response speed of the delay estimation.

The sub-audio fingerprint generation process comprises the following steps: firstly, obtaining a plurality of audio energy change characteristic values according to the change rule of the difference between the audio energy of the frequency band and the adjacent frequency band and the audio energy of the frequency band at the corresponding position in the previous frame frequency domain data; and generating a sub-audio fingerprint corresponding to the current frame frequency domain data according to the plurality of audio energy change characteristic values.

Specifically, the audio energy variation characteristic value of the frequency band is generated according to the following calculation formula (4), and the calculation formula (4) is specifically as follows:

wherein, F (n, m) represents the audio energy variation characteristic value of the mth frequency band of the nth frame frequency domain data, E (n, m) represents the audio energy of the mth frequency band of the nth frame frequency domain data, E (n, m +1) represents the audio energy of the mth +1 frequency band of the nth frame frequency domain data, E (n-1, m) represents the audio energy of the mth frequency band of the nth-1 frame frequency domain data, and E (n-1, m +1) represents the audio energy of the mth +1 frequency band of the nth-1 frame frequency domain data.

(E (n, m) -E (n, m +1)) - (E (n-1, m) -E (n-1, m +1)) represents a variation rule between the audio energy difference value of the mth frequency band of the nth frame frequency domain data and the m +1 th frequency band adjacent to the mth frequency band and the audio energy difference value of the mth frequency band and the m +1 th frequency band at the corresponding position in the previous frame of the nth-1 frame frequency domain data, F (n, m) is 0 when the difference value indicated by the variation rule is not greater than zero, and F (n, m) is 1 when the difference value indicated by the variation rule is greater than zero.

Of course, in other embodiments, the audio energy variation characteristic value may also be configured to be other values according to the practical application, and the audio energy variation characteristic value is used to represent the variation rule between the audio energy difference value of the frequency band and its adjacent frequency band and the audio energy difference value of the frequency band at the corresponding position in the previous frame frequency domain data.

Therefore, the nth frame frequency domain data is used as the current frame frequency domain data, M-1 audio energy change characteristic values can be obtained according to the calculation formula (4) aiming at M frequency bands contained in the nth frame frequency domain data, and the sub-audio fingerprint corresponding to the nth frame frequency domain data is formed by the M-1 audio energy change characteristic values.

That is, g (l) { F (l,1), F (l,2), …, F (l, M-1) }. (5)

Wherein, g (l) represents the sub-audio fingerprint corresponding to the l frame frequency domain data, and M represents the number of frequency bands of the l frame frequency domain data.

It should be noted that, since the mth band of each frame frequency domain data does not substantially have the M +1 th band adjacent thereto, the number of the audio energy variation characteristic values obtained for the M bands is M-1.

And step 417, taking the sub-audio fingerprint corresponding to the frequency domain data of the specified frame number as the audio fingerprint.

Due to the continuous variability of the audio signal, for example, the current frame frequency domain data collected by the microphone continuously changes, or the current frame frequency domain data output by the speaker continuously changes, accordingly, sub audio fingerprints corresponding to several frame frequency domain data can be generated, for example, the sub audio fingerprint corresponding to the current frame frequency domain data and the sub audio fingerprint corresponding to the previous frame frequency domain data.

It can be understood that if the time length of the signal on which the delay estimation is based is too short, the degree of dependency on the change between the audio energy of the adjacent frequency bands in the frequency domain data of the adjacent frames is shallow, which may cause severe jump in the delay estimation result, thereby affecting the accuracy of the delay estimation.

Therefore, in the embodiment, the audio fingerprint comprises a sub-audio fingerprint corresponding to the frequency domain data of the specified frame number, so that the stability and the accuracy of the time delay estimation result are fully ensured.

That is, H ═ { G (1), G (2), …, G (L-1), G (L) }. (6)

Wherein L represents a specified frame number, i.e., the number of sub-audio fingerprints included in the audio fingerprint.

The number of designated frames L may be flexibly adjusted according to an actual application scenario, for example, in an application scenario with a high requirement on response speed, a smaller number of designated frames is set, and in an application scenario with a high requirement on accuracy, a larger number of designated frames is set, which is not limited herein.

Preferably, in one specific implementation, the specified number of frames L is dynamically adjusted based on the audio energy of the fingerprint input signal and a signal energy threshold, as shown in FIG. 5.

Wherein, the audio energy calculation formula (7) of the fingerprint input signal is as follows:

wherein, e (N) represents the audio energy of the nth frame frequency domain data, x (k) represents the kth frequency point data in the nth frame frequency domain data, and N represents the number of designated sampling points.

It will be appreciated that if the audio energy of the fingerprint input signal is low, the delay error is less costly and the probability of even a missing echo of error is low.

For this reason, when the audio energy of the fingerprint input signal is lower than the signal energy threshold, the audio energy of the fingerprint input signal is considered to be lower, and the specified frame number is adjusted, so as to further improve the response speed to the time delay change.

Correspondingly, when the audio energy of the fingerprint input signal is higher than the signal energy threshold, the designated frame number is adjusted upwards if the audio energy of the fingerprint input signal is considered to be larger, so that the accuracy of time delay estimation is fully ensured, and the probability of echo leakage is reduced.

In the process, the dynamic planning of the specified frame number is realized, and the balance between the delay error cost and the response speed is fully ensured.

Further, it can be understood that if the audio fingerprint of the sound signal is consistent everywhere with the audio fingerprint of the far-end voice signal, for example, the audio fingerprints of the two mute signals are consistent everywhere, the time delay of the sound signal relative to the far-end voice signal obtained thereby has no practical meaning.

Therefore, as shown in fig. 3, for the far-end speech signal, to avoid silence interference, before performing audio fingerprint generation on the far-end speech signal, random noise preprocessing needs to be performed on the far-end speech signal first, that is, random noise is added to the far-end speech signal.

In random noise acquisition, a pseudo-random sequence is generated by using a random function generation algorithm and is used as random noise. The random function generation algorithm includes, but is not limited to, a hybrid congruence method, a multiply-congruence method, an iterative extraction method, and the like.

It is worth mentioning that the random noise added to the far-end speech signal has a very low decibel number, for example, -80dB, so that the influence of the random noise addition on the far-end speech signal audio fingerprint generation is negligible.

Step 250, comparing the audio fingerprint of the sound signal with the audio fingerprint of the far-end voice signal to obtain a time delay estimation result.

The far-end voice signal is output by a loudspeaker, and the audio fingerprint of the far-end voice signal is used for representing the dynamic change characteristic of the audio energy of the far-end voice signal.

Since the signal characteristics of the sound signal and the far-end speech signal are represented as audio fingerprints, in this embodiment, in order to obtain the similarity between the sound signal and the far-end speech signal, the signal characteristic comparison is converted into a fingerprint comparison, so as to determine the similarity between the sound signal and the far-end speech signal, and further obtain the time difference between the similarity between the sound signal and the far-end speech signal, so as to obtain the time delay estimation result.

In a specific implementation of an exemplary embodiment, as shown in fig. 6, the delay estimation process may include the following steps:

step 251, comparing the audio fingerprint of the voice signal with the audio fingerprint of the far-end voice signal to obtain the audio fingerprint similarity.

The audio fingerprint similarity is used to reflect the correlation between the sound signal and the far-end speech signal, i.e. is used to indicate whether there is similarity between the audio fingerprint of the sound signal and the audio fingerprint of the far-end speech signal, so as to determine whether the sound signal contains echo component.

As described above, referring to the calculation formulas (5), (6), the audio fingerprint is expressed as:

H＝{G(1),G(2),…,G(L-1),G(L)}＝{F(1,1),F(1,2),…,F(1,M-1),F(2,1),F(2,2),…,F(L,1),F(L,2),…,F(L,M-1),}。

therefore, comparing the audio fingerprints of the sound signal and the far-end voice signal is to determine whether the audio energy change characteristic values in the audio fingerprints of the sound signal and the far-end voice signal are the same, if the audio energy change characteristic values are the same, the audio fingerprint similarity is 1, if half of the audio energy change characteristic values are the same, the audio fingerprint similarity is 0.5, and so on.

Step 253, when the similarity of the audio fingerprints exceeds a similarity threshold, acquiring a first similar time indicated by the most similar audio fingerprint position in the far-end voice signal and a second similar time indicated by the frequency domain data of the current frame, acquired by the sound signal.

When the similarity of the audio fingerprint exceeds the similarity threshold, it is determined that the audio signal is related to the far-end speech signal, i.e., the audio signal contains an echo component, and the time delay estimation needs to be performed, i.e., step 255 is continuously performed.

Otherwise, when the similarity of the audio fingerprint does not exceed the similarity threshold, the audio signal is considered to be uncorrelated with the far-end speech signal, that is, the audio signal does not contain an echo component, and the time delay estimation result is kept unchanged.

The time delay estimation is to determine an audio fingerprint position most similar to the sound signal in the far-end speech signal, obtain a first similar time according to a frame time indicated by the most similar audio fingerprint position, and use a frame time indicated by the current frame frequency domain data acquired by the sound signal as a second similar time, thereby obtaining the time delay of the sound signal relative to the far-end speech signal.

Specifically, the sub-audio fingerprint corresponding to the current frame frequency domain data in the voice signal is further compared with the sub-audio fingerprints corresponding to the previous frame frequency domain data in the far-end voice signal, so as to obtain the similarity of the sub-audio fingerprints.

And determining the sub-audio fingerprint with the maximum similarity in the far-end voice signal according to the similarity of the sub-audio fingerprints, and taking the frequency domain data corresponding to the sub-audio fingerprint with the maximum similarity as the most similar audio fingerprint position to further obtain the first similar time.

It should be noted that, in the sub-audio fingerprint comparison, the number of frames of the previous frame frequency domain data in the far-end audio signal that needs to be used may be flexibly set according to the actual application scenario, and since the delay of the sound signal lagging behind the far-end audio signal is limited, the number of frames may be set within the range of 3 frames to 5 frames, which is not limited herein.

For example, when the audio fingerprint similarity exceeds the similarity threshold, the sub-audio fingerprint similarity is further determined.

Suppose that the sub-audio fingerprint corresponding to the frequency domain data of the current frame in the audio signal is G ₀ (l ₀ ) The sub-audio fingerprints corresponding to the first three frame frequency domain data in the far-end voice signal are respectively G ₁ (l ₀ -1)、G ₁ (l ₀ -2)、G ₁ (l ₀ -3). Wherein G is ₁ (l ₀ -1) represents the sub-audio fingerprint corresponding to the previous frame frequency domain data in the far-end speech signal, G ₁ (l ₀ -2) represents the sub-audio fingerprint corresponding to the previous two frame frequency domain data in the far-end speech signal, G ₁ (l ₀ And-3) represents the sub-audio fingerprint corresponding to the data of the first three frame frequency domains in the far-end voice signal.

Comparing sub-audio fingerprints G ₀ (l ₀ ) And sub-audio fingerprint G ₁ (l ₀ -1) whether each audio energy variation characteristic value is the same, if they are the same, the similarity of sub-audio fingerprints of the front and back two is 1, if they are half the same, the similarity of sub-audio fingerprints of the front and back two is 0.5, and so on.

In the same way, the sub-audio fingerprint G ₀ (l ₀ ) The characteristic value of audio energy variation in (1) and the sub-audio fingerprint G ₁ (l ₀ -2)、G ₁ (l ₀ -audio energy variation in 3)And comparing the characteristic values to respectively obtain the corresponding sub-audio fingerprint similarity.

And after all the comparison is completed to obtain all the sub-audio fingerprint similarity, determining the maximum sub-audio fingerprint similarity.

Assume sub-audio fingerprint G ₀ (l ₀ ) And sub-audio fingerprint G ₁ (l ₀ -3) having the greatest similarity between sub-audio fingerprints, i.e. determining the sub-audio fingerprint G ₁ (l ₀ And-3) the audio fingerprint with the maximum similarity in the far-end speech signal is the audio fingerprint position which is most similar to the sound signal in the far-end speech signal and corresponds to the frequency domain data.

Based on the far-end speech signal, assume that the speaker outputs the frame time indicated by the current frame frequency domain data, i.e. the current frame time at which the far-end speech signal is output is t ₁ If the time length of each frame of frequency domain data is T, the frame time indicated by the audio fingerprint position most similar to the sound signal in the far-end speech signal, i.e. the first similar time is T ₁ -3T。

For the sound signal, the frame time indicated by the collected current frame frequency domain data, i.e. the current frame time t of the sound signal ₀ Considered as a second similar time.

After the first similar time and the second similar time are obtained, the time delay of the sound signal relative to the far-end voice signal can be obtained according to the first similar time, and then the time delay estimation result is updated.

Further, in a dual-speaker system, i.e., the speaker continuously outputs a far-end voice signal, and the microphone continuously collects a voice signal, for example, voice calls made by both users, even if the similarity of audio fingerprints reaches a certain degree, the correlation between the voice signal and the far-end voice signal is not good enough, and if time delay estimation is performed again each time, not only is memory resources wasted, but also the accuracy of the time delay estimation may be affected.

Thus, to ensure the accuracy of the delay estimation, it will also be determined whether the maximum sub-audio fingerprint similarity exceeds the similarity threshold before the delay estimation is performed again.

And only when the similarity of the maximum sub-audio fingerprint exceeds the similarity threshold, the sound signal is considered to be really related to the far-end voice signal, and the time delay estimation result is determined again, otherwise, when the similarity of the maximum sub-audio fingerprint does not exceed the similarity threshold, the time delay error possibly exists in the time delay estimation, and the time delay estimation result is kept unchanged.

It should be noted that, the electronic device will open up a storage space for storing sub-audio fingerprints corresponding to the previous several frame frequency domain data in the far-end voice signal for performing echo cancellation. Since the far-end speech signal output by the speaker is continuously varied, the sub-audio fingerprint stored in the storage space is also dynamically varied accordingly.

For example, the storage space stores sub audio fingerprints corresponding to the 1 st frame to 5 th frame frequency domain data in the far-end voice signal, when the 6 th frame frequency domain data in the far-end voice signal is output by the loudspeaker, the sub audio fingerprint corresponding to the 1 st frame frequency domain data stored in the storage space is deleted, and the sub audio fingerprint corresponding to the 6 th frame frequency domain data is stored in the storage space. Therefore, the effective utilization rate of the storage space is ensured, and the smooth proceeding of echo cancellation is also ensured.

Step 255, updating the time difference between the first similar time and the second similar time as the time delay estimation result.

Specifically, as shown in fig. 7, step 255 may include the following steps:

in step 2551, a time difference between the first similar time and the second similar time is calculated.

Still by way of example, considering that the sound signal lags the far-end speech signal, the first similar time t ₁ -3T and second similar time T ₀ Time difference of t ₀ -t ₁ +3T。

And step 2553, performing fluctuation elimination processing on the time difference, and updating the processed time difference into a time delay estimation result.

As mentioned above, in order to improve the accuracy of the delay estimation, the time-domain framing process may be configured to be performed according to the inter-frame overlap 3/4. At this time, the sub-audio fingerprint has a time length of T/4, which means that when the time delay of the sound signal relative to the far-end voice signal reaches T/8, the time delay estimation fluctuates, i.e. the time delay estimation result swings back and forth between 0 and T/4.

For this reason, it is necessary to perform a fluctuation elimination process on the time difference, and thus perform a delay estimation result update.

The fluctuation elimination process may be to set an update determination rule, for example, to update the determination rule to be an invariant priority rule, that is, assuming that the old delay estimation result is 0, if the new delay estimation result is T/4, at this time, since the new delay estimation result and the old delay estimation result are different by exactly one time granularity T/4 (that is, the time length described by the sub audio fingerprint is T/4), it is considered that the delay estimation result will swing back and forth between 0 and T/4, and in order to avoid the delay estimation fluctuation, 0 is still used as the delay estimation result, that is, the old delay estimation result is preferentially kept unchanged.

In the process, the similarity between the sound signal and the far-end voice signal can be reflected in real time through the audio fingerprint technology, the complexity is low, and timely response to time delay change is facilitated.

After the time delay estimation result is obtained, echo cancellation processing can be performed on the far-end voice signal in the sound signal.

Specifically, the far-end voice signal output by the loudspeaker is delayed according to the time delay estimation result, so that the delayed far-end voice signal is aligned with the sound signal collected by the microphone, and the aligned sound signal and the far-end voice signal are offset with each other, thereby realizing the elimination of the echo component in the sound signal.

After the echo cancellation process is completed, the sound signal with the echo component eliminated can be output, namely transmitted to the client of the opposite user, thereby avoiding the echo interference in the conversation process.

Through the process, once the time delay of the sound signal relative to the far-end voice signal is changed, the time delay change is reflected on the position change of the similarity place existing between the sound signal and the far-end voice signal, so that the time delay estimation result can be immediately updated along with the real-time update of the position of the similarity place when the time delay is changed in the same conversation process, and the problem of echo leakage caused by the fact that echo cancellation cannot normally work in the time delay change period in the prior art is solved.

Fig. 8 is a flowchart of a specific implementation of a delay estimation method in an application scenario. In the application scenario, the echo cancellation process in the voice call process is illustrated by taking the electronic device as the smart phone.

When a speaker configured in the smart phone outputs a far-end voice signal, if a sound signal collected by a microphone configured in the smart phone includes not only a near-end voice signal but also the far-end voice signal, the voice communication process will be interfered by echo.

Therefore, firstly, a sound signal collected by a microphone and a far-end voice signal output by a loudspeaker are obtained, the sound signal and the far-end voice signal are respectively used as fingerprint input signals, and dynamic change characteristics of audio energy are extracted to obtain a corresponding audio fingerprint.

Specifically, step 603 is first performed to perform framing and time-frequency transformation on the fingerprint input signal.

Assuming that the fingerprint input signal is 324,423,5423,8763,7425,2445,832,432, the time domain data of each frame includes 4 sample point data, and the inter-frame overlap 3/4 performs framing, and accordingly, the fingerprint input signal will be framed in turn as the following time domain data ("[ ] identifies a frame of time domain data):

first frame time domain data: [ 324,423,5423,8763 ], 7425,2445,832,432,

second frame time domain data: 324, [ 423,5423,8763,7425 ], 2445,832,432,

third frame time domain data: 324,423 [ 5423,8763,7425,2445 ], 832,432,

fourth frame time domain data: 324,423,5423 [ 8763,7425,2445,832 ], 432.

And after the framing processing is finished, performing time-frequency transformation on each frame of time domain data according to a calculation formula (1) to obtain corresponding frequency domain data. For example, the first frame time domain data [ 324,423,5423,8763 ] corresponds to first frame frequency domain data [ X (0), X (1), X (2), and X (3) ], X [ k ], where k is 0,1,2, and 3, and represents frequency point data.

After obtaining the multiple frames of frequency domain data, by executing step 604, performing frequency band division on each frame of frequency domain data according to the calculation formula (2), and calculating the audio energy of the divided frequency band according to the calculation formula (3).

Taking the first frame frequency domain data as an example, it is assumed that the first frame frequency domain data is divided into 3 frequency bands, the frequency point data included in the first frequency band is X (0), the frequency point data included in the second frequency band is X (1) and X (2), and the frequency point data included in the third frequency band is X (3).

Accordingly, the audio energy of the first band in the first frame frequency domain data is E (1,1) ═ X (0) | < Lambda > 2, the audio energy of the second band in the first frame frequency domain data is E (1,2) | < Lambda > 2+ | X (2) | < Lambda > 2, and the audio energy of the third band in the first frame frequency domain data is E (1,3) | < Lambda > 2.

After the audio energy of each frequency band is calculated, by executing step 605, the audio energy variation characteristic value related to each frequency band in each frame of frequency domain data is generated according to the calculation formula (4). For example, for 3 bands of the second frame frequency domain data, 2 audio energy variation feature values are obtained, which are F (2,1) ═ E (2,1) -E (2,2)) - (E (1,1) -E (1,2)), and F (2,2) ═ E (2,2) -E (2,3)) - (E (1,2) -E (1,3)), respectively.

By performing step 606, a sub-audio fingerprint corresponding to each frame of frequency domain data is generated according to the calculation formula (5), for example, G (2) { F (2,1), F (2,2) } represents a sub-audio fingerprint corresponding to the second frame of frequency domain data, and further by performing step 607, an audio fingerprint H ═ { G (1), G (2), …, G (L-1), G (L) } of the fingerprint input signal according to the calculation formula (6) is generated.

Therefore, in the process, the audio fingerprint of the sound signal and the audio fingerprint of the far-end voice signal can be obtained correspondingly.

By performing step 608, the audio fingerprint of the sound signal and the audio fingerprint of the far-end speech signal are subjected to fingerprint comparison, and the audio fingerprint position in the far-end speech signal that is most similar to the sound signal is determined, and thus the first similar time indicated by the audio fingerprint position is obtained.

The fourth frame frequency domain data is assumed to be the current frame frequency domain data, and at this time, the frame time indicated by the fourth frame frequency domain data in the sound signal is the second similar time.

And comparing the sub-audio fingerprint corresponding to the fourth frame frequency domain data in the sound signal with the sub-audio fingerprints corresponding to the first frame frequency domain data, the second frame frequency domain data and the third frame frequency domain data in the far-end voice signal respectively, and assuming that the audio fingerprint position most similar to the sound signal in the far-end voice signal is the first frame frequency domain data, the first similar time is the frame time indicated by the first frame frequency domain data in the far-end voice signal.

By executing step 609, the time difference between the first similar time and the second similar time is updated to be the time delay estimation result, and then the updated time delay estimation result is utilized to perform echo cancellation on the sound signal, that is, step 610 is executed.

Finally, the sound signal with the echo component eliminated is output, so that echo interference in the voice call process is avoided.

In the application scenario, the quick and accurate time delay change response is realized, and the echo leakage phenomenon during the time delay change period is effectively reduced.

The following are embodiments of the apparatus of the present disclosure, which may be used to perform the delay estimation method of the present disclosure. For details that are not disclosed in the embodiments of the apparatus of the present disclosure, please refer to the embodiments of the delay estimation method related to the present disclosure.

Referring to fig. 9, in an exemplary embodiment, a delay estimation apparatus 700 is applied to a user equipment, including but not limited to: a signal acquisition module 710, a feature extraction module 730, and a fingerprint comparison module 750.

The signal collecting module 710 is configured to obtain a sound signal collected by a microphone and a far-end speech signal output by a speaker.

The feature extraction module 730 is configured to extract a dynamic change feature of audio energy of the fingerprint input signal by using the sound signal and the far-end speech signal as the fingerprint input signal, respectively, to obtain an audio fingerprint of the sound signal and an audio fingerprint of the far-end speech signal.

The fingerprint comparison module 750 is configured to perform fingerprint comparison on the audio fingerprint of the sound signal and the audio fingerprint of the remote voice signal to obtain a delay estimation result.

Referring to fig. 10, in an exemplary embodiment, the apparatus 700 as described above further includes, but is not limited to: a speech signal processing module 810.

The speech signal processing module 810 is configured to perform random noise preprocessing on the far-end speech signal.

The preprocessed far-end voice signal is used as a fingerprint input signal and is notified to the feature extraction module 730.

Referring to FIG. 11, in an exemplary embodiment, the feature extraction module 730 includes, but is not limited to: a frequency domain data acquisition unit 731, a band division unit 733, a feature value acquisition unit 735, a sub-fingerprint generation unit 737, and an audio fingerprint generation unit 739.

The frequency domain data obtaining unit 731 is configured to perform time-frequency transformation on the fingerprint input signal to obtain multiple frames of frequency domain data.

The band division unit 733 divides a frequency band of the current frame frequency domain data into a plurality of frequency bands, and calculates audio energy of the plurality of frequency bands, respectively.

The feature value obtaining unit 735 is configured to, for a plurality of frequency bands, obtain a plurality of audio energy change feature values according to a change rule between an audio energy difference value of a frequency band and an adjacent frequency band thereof and an audio energy difference value of a frequency band at a corresponding position in previous frame frequency domain data.

The sub-fingerprint generating unit 737 is configured to generate a sub-audio fingerprint corresponding to the current frame frequency domain data according to the multiple audio energy variation feature values.

The audio fingerprint generation unit 739 is configured to use a sub-audio fingerprint corresponding to the frequency domain data of the specified frame number as an audio fingerprint.

Referring to fig. 12, in an exemplary embodiment, the apparatus 700 as described above further includes, but is not limited to: an energy calculation module 910 and a parameter adjustment module 930.

Wherein the energy calculating module 910 is configured to calculate the audio energy of the fingerprint input signal.

The parameter adjustment module 930 is configured to dynamically adjust the specified frame number according to the audio energy of the fingerprint input signal and the signal energy threshold.

Referring to FIG. 13, in an exemplary embodiment, the fingerprint comparison module 750 includes, but is not limited to: a similarity acquisition unit 751, a fingerprint position acquisition unit 753, a time acquisition unit 755, and a result update unit 757.

The similarity obtaining unit 751 is configured to compare the audio fingerprint of the sound signal with the audio fingerprint of the far-end speech signal, and obtain an audio fingerprint similarity.

The fingerprint position obtaining unit 753 is configured to obtain an audio fingerprint position in the far-end speech signal that is most similar to the sound signal when the similarity of the audio fingerprint exceeds a similarity threshold.

The time obtaining unit 755 is configured to use the frame time indicated by the most similar audio fingerprint position as the first similar time, and use the current frame time at which the sound signal is collected as the second similar time.

The result updating unit 757 is configured to update a time difference between the first similar time and the second similar time as a delay estimation result.

Referring to FIG. 14, in an exemplary embodiment, the result update unit 757 includes, but is not limited to: a time difference calculation unit 7571 and a fluctuation elimination unit 7573.

Wherein the time difference calculation unit 7571 is configured to calculate a time difference between the first similar time and the second similar time.

The fluctuation elimination unit 7573 is configured to perform fluctuation elimination processing on the time difference, and update the processed time difference to the delay estimation result.

It should be noted that, when the delay estimation device provided in the foregoing embodiment performs echo cancellation processing, only the division of the functional modules is illustrated, in practical applications, the functions may be distributed to different functional modules according to needs, that is, the internal structure of the delay estimation device is divided into different functional modules, so as to complete all or part of the functions described above.

In addition, the embodiments of the delay estimation apparatus and the delay estimation method provided by the above embodiments belong to the same concept, and the specific manner in which each module performs operations has been described in detail in the method embodiments, and is not described herein again.

In an exemplary embodiment, an electronic device includes a processor and a memory.

Wherein the memory has stored thereon computer readable instructions which, when executed by the processor, implement the time delay estimation method in the embodiments as described above.

In an exemplary embodiment, a computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements a latency estimation method in embodiments as described above.

The above disclosure is only a preferred exemplary embodiment of the present disclosure, and is not intended to limit the embodiments of the present disclosure, and one skilled in the art can easily make various changes or modifications according to the main concept and spirit of the present disclosure, so that the protection scope of the present disclosure should be subject to the protection scope of the claims.

Claims

1. A method for time delay estimation, comprising:

acquiring a sound signal acquired by a microphone and a far-end voice signal output by a loudspeaker;

respectively taking the sound signal and the far-end voice signal as fingerprint input signals, and extracting dynamic change characteristics of audio energy of the fingerprint input signals to obtain an audio fingerprint of the sound signal and an audio fingerprint of the far-end voice signal; wherein, carry out the dynamic change characteristic extraction of audio energy to fingerprint input signal, obtain the audio fingerprint, include: performing time-frequency transformation on the fingerprint input signal to obtain multi-frame frequency domain data; calculating an audio energy of the fingerprint input signal; dynamically adjusting the number of appointed frames according to the audio energy of the fingerprint input signal and a signal energy threshold; taking the sub-audio fingerprint corresponding to the frequency domain data of the specified frame number as the audio fingerprint;

and comparing the audio fingerprint of the sound signal with the audio fingerprint of the far-end voice signal to obtain a time delay estimation result.

2. The method of claim 1, wherein before the audio fingerprint of the sound signal and the audio fingerprint of the far-end speech signal are obtained by performing audio energy dynamic change feature extraction on a fingerprint input signal by using the sound signal and the far-end speech signal as fingerprint input signals respectively, the method further comprises:

carrying out random noise preprocessing on the far-end voice signal;

and taking the preprocessed far-end voice signal as a fingerprint input signal, skipping to execute the steps of taking the sound signal and the far-end voice signal as the fingerprint input signals respectively, and extracting the dynamic change characteristics of audio energy of the fingerprint input signals to obtain the audio fingerprint of the sound signal and the audio fingerprint of the far-end voice signal.

3. The method of claim 1 or 2, wherein prior to the taking as the audio fingerprint the sub-audio fingerprint corresponding to the specified number of frames of frequency domain data, the method further comprises:

carrying out frequency band division on current frame frequency domain data to obtain a plurality of frequency bands, and respectively calculating audio energy of the frequency bands;

aiming at a plurality of frequency bands, obtaining a plurality of audio energy change characteristic values according to the change rule between the audio energy difference values of the frequency bands and the adjacent frequency bands and the audio energy difference value of the frequency band at the corresponding position in the previous frame frequency domain data;

and generating a sub-audio fingerprint corresponding to the current frame frequency domain data according to the plurality of audio energy change characteristic values.

4. The method of claim 1, wherein the fingerprinting the audio fingerprint of the voice signal against the audio fingerprint of the far-end voice signal to obtain a delay estimation result comprises:

comparing the audio fingerprint of the sound signal with the audio fingerprint of the far-end voice signal to obtain audio fingerprint similarity;

when the similarity of the audio fingerprints exceeds a similarity threshold, acquiring the audio fingerprint position which is most similar to the sound signal in the far-end voice signal;

taking the frame time indicated by the most similar audio fingerprint position as a first similar time, and taking the current frame time of the collected sound signal as a second similar time;

and updating the time difference between the first similar time and the second similar time as the time delay estimation result.

5. The method of claim 4, wherein when the similarity of the audio fingerprint exceeds a similarity threshold, acquiring a position of the audio fingerprint in the far-end speech signal that is most similar to the sound signal comprises:

comparing the sub-audio fingerprint corresponding to the current frame frequency domain data in the voice signal with the sub-audio fingerprints corresponding to the previous frame frequency domain data in the far-end voice signal to obtain a plurality of sub-audio fingerprint similarities;

determining the sub-audio fingerprint with the maximum similarity in the far-end voice signal according to the similarity of the sub-audio fingerprints;

and taking the frequency domain data corresponding to the sub audio fingerprint with the maximum similarity as the most similar audio fingerprint position.

6. The method of claim 4, wherein the updating the time difference between the first similar time and the second similar time as the delay estimation result comprises:

calculating a time difference between the first similar time and the second similar time;

and carrying out fluctuation elimination processing on the time difference, and updating the processed time difference into the time delay estimation result.

7. A delay estimation apparatus, comprising:

the signal acquisition module is used for acquiring the sound signal acquired by the microphone and the far-end voice signal output by the loudspeaker;

the feature extraction module is used for taking the sound signal and the far-end voice signal as fingerprint input signals respectively, and performing dynamic change feature extraction on the fingerprint input signals to obtain an audio fingerprint of the sound signal and an audio fingerprint of the far-end voice signal; wherein, carry out the dynamic change characteristic extraction of audio energy to fingerprint input signal, obtain the audio fingerprint, include: performing time-frequency transformation on the fingerprint input signal to obtain multi-frame frequency domain data; calculating an audio energy of the fingerprint input signal; dynamically adjusting the number of appointed frames according to the audio energy of the fingerprint input signal and a signal energy threshold; taking the sub-audio fingerprint corresponding to the frequency domain data of the specified frame number as the audio fingerprint;

and the fingerprint comparison module is used for comparing the audio fingerprint of the sound signal with the audio fingerprint of the far-end voice signal to obtain a time delay estimation result.

8. The apparatus of claim 7, wherein the apparatus further comprises:

the voice signal processing module is used for carrying out random noise preprocessing on the far-end voice signal;

and taking the preprocessed far-end voice signal as a fingerprint input signal and informing the feature extraction module.

9. The apparatus of claim 7 or 8, wherein the feature extraction module comprises:

the frequency band dividing unit is used for carrying out frequency band division on current frame frequency domain data to obtain a plurality of frequency bands and respectively calculating audio energy of the frequency bands;

the characteristic value obtaining unit is used for obtaining a plurality of audio energy change characteristic values according to the change rule between the audio energy difference values of the frequency bands and the adjacent frequency bands and the audio energy difference value of the frequency band at the corresponding position in the previous frame frequency domain data aiming at the plurality of frequency bands;

and the sub-fingerprint generating unit is used for generating a sub-audio fingerprint corresponding to the current frame frequency domain data according to the plurality of audio energy change characteristic values.

10. The apparatus of claim 7, wherein the fingerprint comparison module comprises:

the similarity obtaining unit is used for comparing the audio fingerprint of the sound signal with the audio fingerprint of the far-end voice signal to obtain audio fingerprint similarity;

the fingerprint position acquisition unit is used for acquiring the audio fingerprint position which is most similar to the sound signal in the far-end voice signal when the similarity of the audio fingerprint exceeds a similarity threshold;

the time acquisition unit is used for taking the frame time indicated by the most similar audio fingerprint position as a first similar time and taking the current frame time of the collected sound signal as a second similar time;

and the result updating unit is used for updating the time difference between the first similar time and the second similar time into the time delay estimation result.

11. The apparatus of claim 10, wherein the result update unit comprises:

a time difference calculation unit for calculating a time difference between the first similar time and the second similar time;

and the fluctuation elimination unit is used for carrying out fluctuation elimination processing on the time difference and updating the processed time difference into the time delay estimation result.

12. An electronic device, comprising:

a processor; and

a memory having stored thereon computer readable instructions which, when executed by the processor, implement the latency estimation method of any one of claims 1 to 6.

13. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the latency estimation method according to any one of claims 1 to 6.