CN110648679B

CN110648679B - Method and device for determining echo suppression parameters, storage medium and electronic device

Info

Publication number: CN110648679B
Application number: CN201910913057.3A
Authority: CN
Inventors: 赵珺
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-09-25
Filing date: 2019-09-25
Publication date: 2023-07-14
Anticipated expiration: 2039-09-25
Also published as: CN110648679A

Abstract

The invention discloses a method and a device for determining echo suppression parameters, a storage medium and an electronic device. Wherein the method comprises the following steps: generating a first mask vector of the first speech signal using frequencies of the first speech signal; generating a second mask vector of the second speech signal using the frequencies of the second speech signal; determining a first weight value between the first mask vector and the third speech signal and a second weight value between the second mask vector and the third speech signal, respectively; and determining an echo suppression control parameter matched with the second voice signal based on the first weight value and the second weight value, wherein the echo suppression control parameter is used for indicating the result of echo suppression on the echo signal in the first voice signal. The invention solves the technical problem of echo suppression effect evaluation in the related art.

Description

Method and device for determining echo suppression parameters, storage medium and electronic device

Technical Field

The present invention relates to the field of echo processing, and in particular, to a method and apparatus for determining an echo suppression parameter, a storage medium, and an electronic apparatus.

Background

Whether for public information switched network (Public Switched Telephone Network, abbreviated as PSTN) calls or voice over IP (Voice over Internet Protocol, abbreviated as VOIP) calls, echo is easily generated in the hands-free mode, which can greatly affect the user experience. Echo cancellation is required, both at the terminal device hardware and software algorithm level.

The echo cancellation needs to be satisfied that the back-extracted echo needs to be suppressed as much as possible, and the scene sound outside the echo needs to be not affected as much as possible after passing through an algorithm; the effect of echo cancellation in the prior art is measured only from a signal point of view, and the structural characteristics of human ears are not considered.

In view of the above problems, no effective solution has been proposed at present.

Disclosure of Invention

The embodiment of the invention provides a method and a device for determining echo suppression parameters, a storage medium and an electronic device, which are used for at least solving the technical problem of echo suppression effect evaluation in the related art.

According to an aspect of an embodiment of the present invention, there is provided a method for determining an echo suppression parameter, including: generating a first mask vector of a first voice signal by using the frequency of the first voice signal, wherein the first mask vector is used for identifying the relation between the frequency of a voice segment contained in the first voice signal and the frequency of an adjacent voice segment, and the first voice signal is a voice signal sent by a first client to a second client; generating a second mask vector of a second voice signal by using a frequency of the second voice signal, wherein the second mask vector is used for identifying a relationship between a frequency of a voice segment included in the second voice signal and a frequency of an adjacent voice segment, and the second voice signal is a voice signal sent to the first client by the second client; respectively determining a first weight value between the first mask vector and a third voice signal and a second weight value between the second mask vector and the third voice signal, wherein the third voice signal is a voice signal obtained by performing echo suppression processing on an echo signal in the first voice signal; and determining an echo suppression control parameter matched with the second voice signal based on the first weight value and the second weight value, wherein the echo suppression control parameter is used for indicating a result of echo suppression on the echo signal in the first voice signal.

According to another aspect of the embodiment of the present invention, there is also provided a device for determining an echo suppression parameter, including: a first generating module, configured to generate a first mask vector of a first voice signal by using a frequency of the first voice signal, where the first mask vector is used to identify a relationship between a frequency of a voice segment included in the first voice signal and a frequency of an adjacent voice segment, and the first voice signal is a voice signal sent by a first client to a second client; a second generating module, configured to generate a second mask vector of a second voice signal by using a frequency of the second voice signal, where the second mask vector is used to identify a relationship between a frequency of a voice segment included in the second voice signal and a frequency of an adjacent voice segment, and the second voice signal is a voice signal sent by the second client to the first client; a first determining module, configured to determine a first weight value between the first mask vector and a third speech signal, and a second weight value between the second mask vector and the third speech signal, where the third speech signal is a speech signal obtained by performing echo suppression processing on an echo signal in the first speech signal; and a second determining module, configured to determine an echo suppression control parameter that matches the second speech signal based on the first weight value and the second weight value, where the echo suppression control parameter is used to indicate a result of echo suppression on the echo signal in the first speech signal.

Optionally, the first generating module includes: a first determining unit, configured to perform windowing segmentation on the first speech signal according to a preset signal duration, to obtain N segments of speech segments, where N is a natural number; a second determining unit, configured to perform fast fourier transform on the N segments of speech segments, respectively, so as to extract frequencies in the N segments of speech segments, and obtain N frequencies; and a third determining unit, configured to compare features between frequencies of adjacent speech segments in the N frequencies, respectively, to obtain a first mask vector of the first speech signal.

Optionally, the second generating module includes: a third determining unit, configured to perform windowing segmentation on the second speech signal according to a preset signal duration, to obtain M segments of speech segments, where M is a natural number; a fourth determining unit, configured to perform fast fourier transform on the M segments of speech segments, respectively, so as to extract frequencies in the M segments of speech segments, thereby obtaining M frequencies; and a fifth determining unit, configured to compare features between frequencies of adjacent speech segments in the M frequencies, respectively, to obtain a second mask vector of the second speech signal.

Optionally, the first determining module includes: a sixth determining unit, configured to perform a first weighting operation on the first speech signal, the first mask vector, and the third speech signal, to obtain the first weight value; and a seventh determining unit, configured to perform a second weighting operation on the second speech signal, the second mask vector, and the third speech signal, to obtain the second weight value.

Optionally, the second determining module includes: an eighth determining unit, configured to perform a classification duty ratio operation on the first weight value and the second weight value to obtain a classification duty ratio table corresponding to the second speech signal, where an attribute parameter in the classification duty ratio table is used to represent a result of echo suppression on the echo signal in the first speech signal; and a ninth determining unit configured to determine the attribute parameter in the classification duty ratio table as the echo suppression control parameter.

Optionally, the apparatus further includes: a third determining module, configured to determine, after determining an echo suppression control parameter that matches the second speech signal based on the first weight value and the second weight value, that an effect of the echo suppression satisfies a first mode in a case where the result is a first level, where the first mode is used to identify that a signal in the third speech signal that does not include the first speech signal is the same as a signal in the second speech signal; and a fourth determining module, configured to determine that the effect of echo suppression does not satisfy the second mode when the result is the second level, where the suppression rate of echo suppression in the first mode is greater than the suppression rate of echo suppression in the second mode.

Optionally, the apparatus further includes: a fifth determining module, configured to determine a transmission time stamp and a target delay of the first voice signal before generating a first mask vector of the first voice signal using a frequency of the first voice signal; and a sixth determining module, configured to perform delay compensation on the transmission timestamp of the first voice signal according to the target delay, so as to obtain a reception timestamp of the first voice signal.

Optionally, the fifth determining module includes: a tenth determining unit configured to determine a first time delay of the first speech signal and the third speech signal in a time domain; an eleventh determining unit, configured to determine a second time delay of the second speech signal and the third speech signal in a time domain; a twelfth determining unit, configured to determine a third delay of the first speech signal and the third speech signal in a frequency domain; a thirteenth determining unit configured to determine a fourth delay of the second speech signal and the third speech signal in a frequency domain; a fourteenth determining unit, configured to determine a mean variance of the first delay, the second delay, the third delay, and the fourth delay, to obtain the target delay.

Optionally, the apparatus further includes: a seventh determining module, configured to add a target voice segment to the first voice signal before determining the sending timestamp and the target delay of the first voice signal, to obtain a fourth voice signal, where the target voice segment is a voice segment in the first voice signal under a preset frequency; and an eighth determining module, configured to determine a delay range of the first speech signal from a frequency change between the target speech segment and the fourth speech signal, where the target delay is included in the delay range.

According to still another aspect of the embodiments of the present invention, there is also provided a computer-readable storage medium having a computer program stored therein, wherein the computer program is configured to perform the above-described method of determining an echo suppression parameter when run.

According to still another aspect of the embodiments of the present invention, there is further provided an electronic device including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor executes the method for determining the echo suppression parameter described above through the computer program.

In the embodiment of the invention, a first mask vector of a first voice signal is generated by utilizing the frequency of the first voice signal, wherein the first mask vector is used for identifying the relation between the frequency of a voice segment contained in the first voice signal and the frequency of an adjacent voice segment, and the first voice signal is a voice signal sent by a first client to a second client; generating a second mask vector of the second voice signal by using the frequency of the second voice signal, wherein the second mask vector is used for identifying the relation between the frequency of the voice fragment contained in the second voice signal and the frequency of the adjacent voice fragment, and the second voice signal is the voice signal sent to the first client by the second client; respectively determining a first weight value between the first mask vector and a third voice signal and a second weight value between the second mask vector and the third voice signal, wherein the third voice signal is a voice signal obtained by performing echo suppression processing on an echo signal in the first voice signal; and determining an echo suppression control parameter matched with the second voice signal based on the first weight value and the second weight value, wherein the echo suppression control parameter is used for indicating the result of echo suppression on the echo signal in the first voice signal, and the purpose that the suppression of the echo by using the echo suppression control parameter is closer to the characteristics of human ears is achieved by determining the echo suppression control parameter, so that the technical effect of effectively evaluating the echo suppression is achieved, and the technical problem of evaluating the echo suppression effect in the related art is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiments of the invention and together with the description serve to explain the invention and do not constitute a limitation on the invention. In the drawings:

FIG. 1 is a schematic illustration of an application environment of an alternative method for determining echo suppression parameters according to an embodiment of the present invention;

FIG. 2 is a flow chart of an alternative method of determining echo suppression parameters according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an alternative echo suppression according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of an alternative determination of a preset vector according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of an alternative classification duty cycle table according to an embodiment of the invention;

FIG. 6 is an overall flow chart of an alternative echo suppression assessment according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of an alternative echo suppression parameter determination device according to an embodiment of the present invention;

fig. 8 is a schematic structural view of an alternative electronic device according to an embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

According to an aspect of the embodiment of the present invention, there is provided a method for determining an echo suppression parameter, optionally, as an optional implementation manner, the method for determining an echo suppression parameter may be, but is not limited to, applied to an environment as shown in fig. 1.

The first client 102 in fig. 1 may be running in a user device 104. The user equipment 104 includes a memory 106 for storing the voice signal sent by the first client 102, and a processor 108 for processing the voice signal of the first client 102. The user device 104 sends the voice signal to the server 112 over the network 110. User device 120 includes a memory 122 for storing voice signals and a processor 124 for processing voice signals, and user device 102 and user device 118 may communicate with server 112 via network 110. The server 112 includes a database 114 for storing voice signals and a processing engine 116 for processing the voice signals. As shown in fig. 1, a first voice signal sent on a first client 102 may be sent over a network to a user device 120 where a second client 118 is located. The second voice signal sent on the second client 118 is sent over the network to the user device where the first client 102 is located. The user equipment 120 performs echo suppression processing on the echo signal in the first voice signal to obtain a third voice signal; the server 112 generates a first mask vector of the first speech signal using the frequency of the first speech signal and a second mask vector of the second speech signal using the frequency of the second speech signal, wherein the mask vector is used to identify the relationship of the frequency of the speech signal to the frequency of the adjacent speech segments; the server 112 determines a first weight value between the first mask vector and a third speech signal, and a second weight value between the second mask vector and the third speech signal, where the third speech signal is a speech signal obtained by performing echo suppression processing on an echo signal in the first speech signal. And determining an echo suppression control parameter matched with the second voice signal based on the first weight value and the second weight value, wherein the echo suppression control parameter is used for indicating the result of echo suppression on the echo signal in the first voice signal.

Alternatively, in this embodiment, the user device 102 and the user device 118 may be, but are not limited to, terminal devices supporting running application clients, such as mobile phones, tablet computers, notebook computers, and PCs. The server 112 and the user device 102 may implement data interactions over, but are not limited to, a network, which may include, but is not limited to, a wireless network or a wired network. Wherein the wireless network comprises: bluetooth, WIFI, and other networks that enable wireless communications. The wired network may include, but is not limited to: wide area network, metropolitan area network, local area network. The above is merely an example, and is not limited in any way in the present embodiment.

Optionally, as an optional implementation manner, as shown in fig. 2, the method for determining the echo suppression parameter includes:

s202: generating a first mask vector of the first voice signal by using the frequency of the first voice signal, wherein the first mask vector is used for identifying the relation between the frequency of the voice fragments contained in the first voice signal and the frequency of the adjacent voice fragments, and the first voice signal is the voice signal sent by the first client to the second client;

s204: generating a second mask vector of the second voice signal by using the frequency of the second voice signal, wherein the second mask vector is used for identifying the relation between the frequency of the voice fragment contained in the second voice signal and the frequency of the adjacent voice fragment, and the second voice signal is the voice signal sent to the first client by the second client;

S206: respectively determining a first weight value between the first mask vector and a third voice signal and a second weight value between the second mask vector and the third voice signal, wherein the third voice signal is a voice signal obtained by performing echo suppression processing on an echo signal in the first voice signal;

s208: and determining an echo suppression control parameter matched with the second voice signal based on the first weight value and the second weight value, wherein the echo suppression control parameter is used for indicating the result of echo suppression on the echo signal in the first voice signal.

Alternatively, in this embodiment, the method for determining the echo suppression parameter may be applied to, but not limited to, a scene of two-way call or one-way call. The first client and the second client may be, but not limited to, various terminals with call functions, for example, call between mobile phones, network call, video call, etc. Specifically, the method and The device can be applied to, but not limited to, PSTN call scenes, namely, call scenes for providing various application services (OTT) to users through The internet, so as to improve The accuracy of echo suppression effect evaluation. The above is merely an example, and there is no limitation in this embodiment.

Alternatively, in this embodiment, the first voice signal and the second voice signal include, but are not limited to, contents of various voice calls, and the voice call may be accomplished by voice, which is generally accomplished by a terminal device such as a mobile phone or a fixed phone. The first mask vector, the second mask vector, includes, but is not limited to, a vector in the "0101" format. The value of each dimension represents the relation between the frequency of the contained voice segment and the frequency of the adjacent segment, and the first voice signal comprises 5 voice segments which are respectively 3Hz, 2Hz, 1Hz, 4Hz and 6Hz; the frequency value of the first voice segment is larger than that of the second voice segment, the evaluation value of the first voice segment is set to be 1, the evaluation value of the second voice segment is set to be 1, and the evaluation value of the third voice segment is calculated in the following manner: the frequency value of the first speech segment-the frequency value of the third speech segment=2 Hz, if the set target threshold is 1Hz,2 is greater than 1, the evaluation value of the third speech segment is determined to be "1", and so on, the evaluation value of the fourth speech segment is "0", the evaluation value of the fifth speech segment is "0", and the first mask vector is "11100". The calculation mode of the second mask vector is the same as that of the first mask vector, and will not be described again.

Alternatively, the echo signal may be a repetition caused by reflection of the acoustic wave. The echo signal in this embodiment may refer to an echo phenomenon induced by the microphone and the speaker due to the feedback path generated by the air. Echo suppression, which may be also referred to as echo cancellation (Acoustic Echo Cancelling, abbreviated as AEC), is to cancel noise generated by the microphone and the loudspeaker due to the air generating return path through the acoustic interference method. Echo leakage may exist after echo suppression, i.e. the residual echo sound after the echo cancellation algorithm.

Optionally, the third speech signal is an echo-suppressed speech signal. As shown in fig. 3, the processing scenario is a processing scenario for suppressing an echo signal. For example, the device at the end a (near end) is in speaker mode, and after the device at the end B (far end) emits the first voice signal, when the first voice signal is transmitted to the end a, the echo signal of the first voice signal and the second voice signal emitted by the end a are collected together by the microphone at the end a, and if no processing is performed, the sound is heard by the end B, that is, the end B hears the sound emitted by itself at the previous time sequence, which is the echo phenomenon. In this embodiment, echo cancellation (Acoustic Echo Cancelling, abbreviated as AEC) is introduced to suppress echo signals, that is, noise generated by the microphone and the loudspeaker due to the air generating return paths is eliminated through the acoustic wave interference mode. In combination with the sound emitted by the speaker, the echo signal component in the sound collected by the a terminal is suppressed, and a third speech signal (processed signal after echo processing) is output.

Through the determination of the echo suppression control parameters, the purpose that the suppression of the echo by using the echo suppression control parameters is closer to the characteristics of human ears is achieved, and therefore the technical effect of effectively evaluating the echo suppression is achieved.

In an alternative embodiment, generating a first mask vector of the first speech signal using the frequencies of the first speech signal includes:

s1, windowing and dividing a first voice signal according to a preset signal duration to obtain N sections of voice fragments, wherein N is a natural number;

s2, respectively performing fast Fourier transform on the N sections of voice fragments to extract the frequencies in the N sections of voice fragments so as to obtain N frequencies;

s3, respectively comparing the characteristics between the frequencies of the adjacent voice fragments in the N frequencies to obtain a first mask vector of the first voice signal.

Optionally, in the present embodiment, the purpose of the windowed segmentation is to segment the first speech signal into N speech segments, each of which has a different frequency. Evaluating each of the speech segments, e.g., 5 speech segments, 3Hz, 2Hz, 1Hz, 4Hz, 6Hz, respectively, in the first speech signal; the frequency value of the first voice segment is larger than that of the second voice segment, the evaluation value of the first voice segment is set to be 1, the evaluation value of the second voice segment is set to be 1, and the evaluation value of the third voice segment is calculated in the following manner: the frequency value of the first speech segment-the frequency value of the third speech segment=2 Hz, if the set target threshold is 1Hz,2 is greater than 1, the evaluation value of the third speech segment is determined to be "1", and so on, the evaluation value of the fourth speech segment is "0", the evaluation value of the fifth speech segment is "0", the first mask vector is "11100", i.e., each dimension value of the first mask vector is composed of the evaluation value of each speech segment.

According to the embodiment, the mask vector can be accurately determined by determining the estimated value of each voice segment in the form of the voice signal segment.

In an alternative embodiment, generating a second mask vector of the second speech signal using the frequencies of the second speech signal includes:

s1, windowing and dividing a second voice signal according to a preset signal duration to obtain M sections of voice fragments, wherein M is a natural number;

s2, respectively performing fast Fourier transform on the M sections of voice fragments to extract frequencies in the M sections of voice fragments so as to obtain M frequencies;

s3, respectively comparing the characteristics between the frequencies of the adjacent voice fragments in the M frequencies to obtain a second mask vector of the second voice signal.

Optionally, in this embodiment, the determination manner of the second mask vector is the same as the determination manner of the first mask vector, which is not described herein. The specific flow is shown in part (c) of fig. 6.

In an alternative embodiment, determining a first weight value between the first mask vector and a third speech signal and a second weight value between the second mask vector and the third speech signal, respectively, comprises:

S1, performing first weighting operation on the first voice signal, the first mask vector and the third voice signal to obtain the first weight value;

s2, performing a second weighting operation on the second voice signal, the second mask vector and the third voice signal to obtain the second weight value.

Optionally, in the present embodiment, the first weighting operation includes, but is not limited to, an and operation between the first speech signal, the first mask vector, and the third speech signal. The second weighting operation includes, but is not limited to, an and operation between the second speech signal, the second mask vector, and the third speech signal.

According to the embodiment, through the determination of the weighted value, the influence of the mask vector on the echo leakage degree and the error shearing degree of the normal voice can be determined, so that whether the echo estimation effect is closer to the human ear mode or not is determined.

In an alternative embodiment, determining an echo suppression control parameter matching the second speech signal based on the first weight value and the second weight value comprises:

s1, performing classification duty ratio operation on a first weight value and a second weight value to obtain a classification duty ratio table corresponding to a second voice signal, wherein attribute parameters in the classification duty ratio table are used for representing the result of echo suppression on an echo signal in the first voice signal;

S2, determining the attribute parameters in the classification duty ratio table as echo suppression control parameters.

Optionally, in this embodiment, the classification duty ratio operation performed on the first weight value and the second weight value is shown in fig. 4, the first weight value and the second weight value are input into Level time statics according to the flow shown in fig. 4 to perform calculation, a classification duty ratio table (shown in fig. 5) is output, the classification duty ratio table is used for indicating the degree of echo leakage and the degree of normal speech that is erroneously clipped, the degree of echo leakage and the degree of normal speech that is erroneously clipped are determined as preset vectors, and the overall flow chart is shown in fig. 6.

By outputting the classification duty ratio table, the echo suppression degree represented by each attribute in the classification duty ratio table can be determined.

In an alternative embodiment, after determining the echo suppression control parameter matching the second speech signal based on the first weight value and the second weight value, the method further comprises:

s1, under the condition that a first level is obtained as a result, determining that the effect of echo suppression meets a first mode, wherein the first mode is used for identifying signals which are not included in the first voice signal in the third voice signal and are identical to signals in the second voice signal;

And S2, under the condition that the result is the second level, determining that the effect of echo suppression does not meet the second mode, wherein the suppression rate of echo suppression in the first mode is larger than that in the second mode.

Alternatively, in the present embodiment, the first rank may be A1, A2 in the classification duty table shown in fig. 5. The second ranking may be B, C, D, E, F, G in the classification duty cycle table as shown in fig. 5. A1, A2 are used to indicate that the effect of echo suppression is close to the human ear mode, i.e. the third speech signal does not include the signal in the first speech signal and is identical to the signal in the second speech signal.

According to the embodiment, through the determination of different modes, the influence of the mask vector on the echo leakage degree and the error shearing degree of normal voice can be determined, so that whether the echo estimation effect is closer to the human ear mode or not is determined.

In an alternative embodiment, the method further comprises, prior to generating the first mask vector of the first speech signal using the frequencies of the first speech signal:

s1, determining a first voice signal sending time stamp and a target time delay;

and S2, performing time delay compensation on the sending time stamp of the first voice signal according to the target time delay to obtain the receiving time stamp of the first voice signal.

Optionally, in this embodiment, in order to ensure that the first voice signal sent by the first client and the first voice signal received by the second client are aligned in time sequence, delay compensation is required for receiving the first voice signal. And the target delay compensation value is set in the receiving time of the first voice signal.

By the method, the device and the system, the target time delay is used for time delay compensation of receiving the voice signal, and alignment of bidirectional communication in time sequence is guaranteed. And the user experience is improved.

In an alternative embodiment, determining the target delay includes:

s1, determining a first time delay of a first voice signal and a third voice signal in a time domain;

s2, determining a second time delay of a second voice signal and a third voice signal in a time domain, wherein the third voice signal is sent by a second client;

s3, determining a third time delay of the first voice signal and the third voice signal on a frequency domain;

s4, determining a fourth time delay of the second voice signal and the third voice signal on a frequency domain;

s5, determining the mean variances of the first time delay, the second time delay, the third time delay and the fourth time delay to obtain the target time delay.

Alternatively, in this embodiment, as shown in fig. 6, the overall flowchart in this embodiment is shown, where the process of determining the preset vector in the part (a) is not described herein. (b) Part of the process is to determine the target delay, and calculate the target delay by using the first voice signal (far end data), the second voice signal (Processed end data) and the third voice signal (near end data). Calculating a first delay_time (processed, near_end) by using a time domain cross correlation method, and calculating a second delay_time (processed, far_end); the speech signal is subjected to fast fourier transform FFT, and a frequency domain cross correlation technique is used to calculate a third delay_freq (processed, near_end) and a fourth delay_freq (processed, far_end). And outputting the delay_time and delay_freq to a consistency & voting unit to determine the target time delay.

It should be noted that, if the consistency is low, the target time delay estimated by the characteristic frequency method indicates that the subsequent calculation condition cannot be satisfied, and the whole process is interrupted. If the consistency meets the requirement, determining the time delay and performing time delay compensation operation on the data.

By determining the target time delay, the time delay compensation can be accurately performed on the time sequence of the voice signal.

In an alternative embodiment, before determining the first voice signal transmission time stamp and the target delay, the method further comprises:

s1, adding a target voice segment into a first voice signal to obtain a fourth voice signal, wherein the target voice segment is a voice segment in the first voice signal under a preset frequency;

s2, determining a time delay range of the first voice signal from the frequency change between the target voice segment and the fourth voice signal, wherein the target time delay is included in the time delay range.

Optionally, in the related art, a manner is adopted in which the time delay is estimated by calculating the time sequence cross correlation degree of two data of the near end data and the processed data, and the time delay is judged within a time window of ±0.5s by default. If the signal delay is too large or the AEC processing capability is poor, the near end voice is suppressed by mistake, and far end voice leaks more, the evaluation of the echo signal is affected. And calculating time delay, and determining a time delay range of the first voice signal, namely a maximum time delay range, from the frequency change between the target voice fragment and the fourth voice signal.

It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present invention is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present invention. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present invention.

According to another aspect of the embodiment of the present invention, there is also provided a device for determining an echo suppression parameter for implementing the method for determining an echo suppression parameter. As shown in fig. 7, the apparatus includes:

a first generating module 72, configured to generate a first mask vector of a first voice signal by using a frequency of the first voice signal, where the first mask vector is used to identify a relationship between a frequency of a voice segment included in the first voice signal and a frequency of an adjacent voice segment, and the first voice signal is a voice signal sent by a first client to a second client;

a second generating module 74, configured to generate a second mask vector of a second voice signal by using a frequency of the second voice signal, where the second mask vector is used to identify a relationship between a frequency of a voice segment included in the second voice signal and a frequency of an adjacent voice segment, and the second voice signal is a voice signal sent by a second client to the first client;

A first determining module 76, configured to determine a first weight value between the first mask vector and a third speech signal, and a second weight value between the second mask vector and the third speech signal, where the third speech signal is a speech signal obtained by performing echo suppression processing on an echo signal in the first speech signal;

a second determining module 78 for determining an echo suppression control parameter matched with the second voice signal based on the first weight value and the second weight value, wherein the echo suppression control parameter is used for indicating the result of echo suppression on the echo signal in the first voice signal

Optionally, the first generating module includes:

a first determining unit, configured to perform windowing segmentation on the first speech signal according to a preset signal duration, to obtain N segments of speech segments, where N is a natural number;

a second determining unit, configured to perform fast fourier transform on the N segments of speech segments, respectively, so as to extract frequencies in the N segments of speech segments, and obtain N frequencies;

and a third determining unit, configured to compare features between frequencies of adjacent speech segments in the N frequencies, respectively, to obtain a first mask vector of the first speech signal.

Optionally, the second generating module includes:

a third determining unit, configured to perform windowing segmentation on the second speech signal according to a preset signal duration, to obtain M segments of speech segments, where M is a natural number;

a fourth determining unit, configured to perform fast fourier transform on the M segments of speech segments, respectively, so as to extract frequencies in the M segments of speech segments, thereby obtaining M frequencies;

and a fifth determining unit, configured to compare features between frequencies of adjacent speech segments in the M frequencies, respectively, to obtain a second mask vector of the second speech signal.

Optionally, the first determining module includes:

a sixth determining unit, configured to perform a first weighting operation on the first speech signal, the first mask vector, and the third speech signal, to obtain the first weight value;

and a seventh determining unit, configured to perform a second weighting operation on the second speech signal, the second mask vector, and the third speech signal, to obtain the second weight value.

Optionally, the second determining module includes:

an eighth determining unit, configured to perform a classification duty ratio operation on the first weight value and the second weight value to obtain a classification duty ratio table corresponding to the second speech signal, where an attribute parameter in the classification duty ratio table is used to represent a result of echo suppression on the echo signal in the first speech signal;

And a ninth determining unit configured to determine the attribute parameter in the classification duty ratio table as the echo suppression control parameter.

Optionally, the apparatus further includes:

a third determining module, configured to determine, after determining an echo suppression control parameter that matches the second speech signal based on the first weight value and the second weight value, that an effect of the echo suppression satisfies a first mode in a case where the result is a first level, where the first mode is used to identify that a signal in the third speech signal that does not include the first speech signal is the same as a signal in the second speech signal;

and a fourth determining module, configured to determine that the effect of echo suppression does not satisfy the second mode when the result is the second level, where the suppression rate of echo suppression in the first mode is greater than the suppression rate of echo suppression in the second mode.

Optionally, the apparatus further includes:

a fifth determining module, configured to determine a transmission time stamp and a target delay of the first voice signal before generating a first mask vector of the first voice signal using a frequency of the first voice signal;

And a sixth determining module, configured to perform delay compensation on the transmission timestamp of the first voice signal according to the target delay, so as to obtain a reception timestamp of the first voice signal.

Optionally, the fifth determining module includes:

a tenth determining unit configured to determine a first time delay of the first speech signal and the third speech signal in a time domain;

an eleventh determining unit, configured to determine a second time delay of the second speech signal and the third speech signal in a time domain;

a twelfth determining unit, configured to determine a third delay of the first speech signal and the third speech signal in a frequency domain;

a thirteenth determining unit configured to determine a fourth delay of the second speech signal and the third speech signal in a frequency domain;

a fourteenth determining unit, configured to determine a mean variance of the first delay, the second delay, the third delay, and the fourth delay, to obtain the target delay.

Optionally, the apparatus further includes:

a seventh determining module, configured to add a target voice segment to the first voice signal before determining the sending timestamp and the target delay of the first voice signal, to obtain a fourth voice signal, where the target voice segment is a voice segment in the first voice signal under a preset frequency;

And an eighth determining module, configured to determine a delay range of the first speech signal from a frequency change between the target speech segment and the fourth speech signal, where the target delay is included in the delay range.

According to a further aspect of the embodiments of the present invention, there is also provided an electronic device for implementing the above-mentioned method of determining echo suppression parameters, as shown in fig. 8, the electronic device comprising a memory 802 and a processor 804, the memory 802 storing a computer program, the processor 804 being arranged to perform the steps of any of the method embodiments described above by means of the computer program.

Alternatively, in this embodiment, the electronic apparatus may be located in at least one network device of a plurality of network devices of the computer network.

Alternatively, in the present embodiment, the above-described processor may be configured to execute the following steps by a computer program:

s1: generating a first mask vector of the first voice signal by using the frequency of the first voice signal, wherein the first mask vector is used for identifying the relation between the frequency of the voice fragments contained in the first voice signal and the frequency of the adjacent voice fragments, and the first voice signal is the voice signal sent by the first client to the second client;

S2: generating a second mask vector of the second voice signal by using the frequency of the second voice signal, wherein the second mask vector is used for identifying the relation between the frequency of the voice fragment contained in the second voice signal and the frequency of the adjacent voice fragment, and the second voice signal is the voice signal sent to the first client by the second client;

s3: respectively determining a first weight value between the first mask vector and a third voice signal and a second weight value between the second mask vector and the third voice signal, wherein the third voice signal is a voice signal obtained by performing echo suppression processing on an echo signal in the first voice signal;

and S4, determining an echo suppression control parameter matched with the second voice signal based on the first weight value and the second weight value, wherein the echo suppression control parameter is used for indicating the result of echo suppression on the echo signal in the first voice signal.

Alternatively, it will be understood by those skilled in the art that the structure shown in fig. 8 is only schematic, and the electronic device may also be a terminal device such as a smart phone (e.g. an Android phone, an iOS phone, etc.), a tablet computer, a palm computer, and a mobile internet device (Mobile Internet Devices, MID), a PAD, etc. Fig. 8 is not limited to the structure of the electronic device. For example, the electronic device may also include more or fewer components (e.g., network interfaces, etc.) than shown in FIG. 8, or have a different configuration than shown in FIG. 8.

The memory 802 may be used to store software programs and modules, such as program instructions/modules corresponding to the method and apparatus for determining echo suppression parameters in the embodiment of the present invention, and the processor 804 executes the software programs and modules stored in the memory 802 to perform various functional applications and data processing, that is, implement the method for determining echo suppression parameters described above. Memory 802 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, memory 802 may further include memory remotely located relative to processor 804, which may be connected to the terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof. The memory 802 may be used for storing information such as voice signals, among other things. As an example, as shown in fig. 8, the memory 802 may include, but is not limited to, the first generating module 72, the second generating module 74, the first determining module 76, and the second determining module 78 in the determining device including the echo suppression parameter. In addition, other module units in the above-mentioned echo suppression parameter determination device may be further included, but are not limited thereto, and are not described in detail in this example.

Optionally, the transmission device 806 is used to receive or transmit data via a network. Specific examples of the network described above may include wired networks and wireless networks. In one example, the transmission means 806 includes a network adapter (Network Interface Controller, NIC) that can connect to other network devices and routers via a network cable to communicate with the internet or a local area network. In one example, the transmission device 806 is a Radio Frequency (RF) module for communicating wirelessly with the internet.

In addition, the electronic device further includes: a display 808 for displaying the voice signal; and a connection bus 810 for connecting the respective module parts in the above-described electronic device.

According to a further aspect of embodiments of the present invention, there is also provided a computer readable storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.

Alternatively, in the present embodiment, the above-described computer-readable storage medium may be configured to store a computer program for executing the steps of:

Alternatively, in this embodiment, it will be understood by those skilled in the art that all or part of the steps in the methods of the above embodiments may be performed by a program for instructing a terminal device to execute the steps, where the program may be stored in a computer readable storage medium, and the storage medium may include: flash disk, read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), magnetic or optical disk, and the like.

The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

The integrated units in the above embodiments may be stored in the above-described computer-readable storage medium if implemented in the form of software functional units and sold or used as separate products. Based on such understanding, the technical solution of the present invention may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing one or more computer devices (which may be personal computers, servers or network devices, etc.) to perform all or part of the steps of the method described in the embodiments of the present invention.

In the foregoing embodiments of the present invention, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.

In several embodiments provided in the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, such as the division of the units, is merely a logical function division, and may be implemented in another manner, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

Claims

1. A method for determining echo suppression parameters, comprising:

generating a first mask vector of a first voice signal by utilizing the frequency of the first voice signal, wherein the first mask vector is used for identifying the relation between the frequency of a voice segment contained in the first voice signal and the frequency of an adjacent voice segment, and the first voice signal is a voice signal sent by a first client to a second client;

generating a second mask vector of a second voice signal by using the frequency of the second voice signal, wherein the second mask vector is used for identifying the relation between the frequency of a voice segment contained in the second voice signal and the frequency of an adjacent voice segment, and the second voice signal is a voice signal sent to the first client by the second client;

Respectively determining a first weight value between the first mask vector and a third voice signal and a second weight value between the second mask vector and the third voice signal, wherein the third voice signal is a voice signal obtained by performing echo suppression processing on an echo signal in the first voice signal;

and determining an echo suppression control parameter matched with the second voice signal based on the first weight value and the second weight value, wherein the echo suppression control parameter is used for indicating a result of echo suppression on the echo signal in the first voice signal.

2. The method of claim 1, wherein generating a first mask vector for the first speech signal using the frequency of the first speech signal comprises:

windowing and dividing the first voice signal according to a preset signal duration to obtain N sections of voice fragments, wherein N is a natural number;

respectively performing fast Fourier transform on the N sections of voice fragments to extract frequencies in the N sections of voice fragments so as to obtain N frequencies;

and respectively comparing the characteristics between the frequencies of the adjacent voice fragments in the N frequencies to obtain a first mask vector of the first voice signal.

3. The method of claim 1, wherein generating a second mask vector for the second speech signal using the frequency of the second speech signal comprises:

windowing and dividing the second voice signal according to a preset signal duration to obtain M sections of voice fragments, wherein M is a natural number;

respectively performing fast Fourier transform on the M sections of voice fragments to extract frequencies in the M sections of voice fragments so as to obtain M frequencies;

and respectively comparing the characteristics between the frequencies of the adjacent voice fragments in the M frequencies to obtain a second mask vector of the second voice signal.

4. The method of claim 1, wherein determining a first weight value between the first mask vector and a third speech signal and a second weight value between the second mask vector and the third speech signal, respectively, comprises:

performing a first weighting operation on the first voice signal, the first mask vector and the third voice signal to obtain the first weight value;

and performing a second weighting operation on the second voice signal, the second mask vector and the third voice signal to obtain the second weight value.

5. The method of claim 1, wherein determining an echo suppression control parameter that matches the second speech signal based on the first weight value and the second weight value comprises:

performing classification duty ratio operation on the first weight value and the second weight value to obtain a classification duty ratio table corresponding to the second voice signal, wherein attribute parameters in the classification duty ratio table are used for representing the result of echo suppression on the echo signal in the first voice signal;

and determining the attribute parameters in the classification duty ratio table as the echo suppression control parameters.

6. The method of claim 1, wherein after determining an echo suppression control parameter that matches the second speech signal based on the first weight value and the second weight value, the method further comprises:

determining that the effect of echo suppression satisfies a first mode in the case that the result is a first level, wherein the first mode is used for identifying signals which are not included in the first voice signal in the third voice signal and are the same as signals in the second voice signal;

And under the condition that the result is the second level, determining that the effect of the echo suppression does not meet a second mode, wherein the suppression rate of the echo suppression in the first mode is larger than that in the second mode.

7. The method of claim 1, wherein prior to generating the first mask vector for the first speech signal using the frequency of the first speech signal, the method further comprises:

determining the first voice signal sending time stamp and target time delay;

and performing time delay compensation on the sending time stamp of the first voice signal according to the target time delay to obtain a receiving time stamp of the first voice signal.

8. The method of claim 7, wherein determining the target delay comprises:

determining a first time delay of the first voice signal and the third voice signal in a time domain;

determining a second time delay of the second voice signal and the third voice signal in a time domain;

determining a third time delay of the first voice signal and the third voice signal in a frequency domain;

determining a fourth time delay of the second voice signal and the third voice signal in a frequency domain;

And determining the mean variances of the first time delay, the second time delay, the third time delay and the fourth time delay to obtain the target time delay.

9. The method of claim 7, wherein prior to determining the first voice signal transmission time stamp and the target delay, the method further comprises:

adding a target voice segment into the first voice signal to obtain a fourth voice signal, wherein the target voice segment is a voice segment of the first voice signal under a preset frequency;

and determining a delay range of the first voice signal from the frequency change between the target voice segment and the fourth voice signal, wherein the target delay is included in the delay range.

10. An apparatus for determining an echo suppression parameter, comprising:

a first generating module, configured to generate a first mask vector of a first voice signal by using a frequency of the first voice signal, where the first mask vector is used to identify a relationship between a frequency of a voice segment included in the first voice signal and a frequency of an adjacent voice segment, and the first voice signal is a voice signal sent by a first client to a second client;

A second generating module, configured to generate a second mask vector of a second voice signal by using a frequency of the second voice signal, where the second mask vector is used to identify a relationship between a frequency of a voice segment included in the second voice signal and a frequency of an adjacent voice segment, and the second voice signal is a voice signal sent by the second client to the first client;

the first determining module is used for respectively determining a first weight value between the first mask vector and a third voice signal and a second weight value between the second mask vector and the third voice signal, wherein the third voice signal is a voice signal obtained by performing echo suppression processing on an echo signal in the first voice signal;

and a second determining module, configured to determine an echo suppression control parameter that matches the second speech signal based on the first weight value and the second weight value, where the echo suppression control parameter is used to indicate a result of echo suppression on the echo signal in the first speech signal.

11. A computer readable storage medium comprising a stored program, wherein the program when run performs the method of any one of the preceding claims 1 to 9.

12. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method according to any of the claims 1 to 9 by means of the computer program.