WO2022227932A1 - 声音信号处理方法、装置和电子设备 - Google Patents

声音信号处理方法、装置和电子设备 Download PDF

Info

Publication number
WO2022227932A1
WO2022227932A1 PCT/CN2022/081979 CN2022081979W WO2022227932A1 WO 2022227932 A1 WO2022227932 A1 WO 2022227932A1 CN 2022081979 W CN2022081979 W CN 2022081979W WO 2022227932 A1 WO2022227932 A1 WO 2022227932A1
Authority
WO
WIPO (PCT)
Prior art keywords
spectrum
signal
end signal
signal spectrum
microphone
Prior art date
Application number
PCT/CN2022/081979
Other languages
English (en)
French (fr)
Inventor
周楠
徐杨飞
Original Assignee
北京有竹居网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京有竹居网络技术有限公司 filed Critical 北京有竹居网络技术有限公司
Publication of WO2022227932A1 publication Critical patent/WO2022227932A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M9/00Arrangements for interconnection not involving centralised switching
    • H04M9/08Two-way loud-speaking telephone systems with means for conditioning the signal, e.g. for suppressing echoes for one or both directions of traffic
    • H04M9/082Two-way loud-speaking telephone systems with means for conditioning the signal, e.g. for suppressing echoes for one or both directions of traffic using echo cancellers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M9/00Arrangements for interconnection not involving centralised switching
    • H04M9/08Two-way loud-speaking telephone systems with means for conditioning the signal, e.g. for suppressing echoes for one or both directions of traffic
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain

Definitions

  • Embodiments of the present disclosure relate to the field of computer technology, and in particular, to a sound signal processing method, apparatus, and electronic device.
  • the sound signals from the other terminals to the terminal can generate an echo signal after being played through the speaker.
  • the microphone of the terminal can collect the generated echo signal. Therefore, when the terminal sends the sound signal collected by the microphone to other terminals, the sound signal received by the other terminal may be mixed with echo signals.
  • the quality of the voice call may be poor.
  • Embodiments of the present disclosure provide a sound signal processing method, apparatus, and electronic device, which improve the quality of a voice call between a first terminal and a second terminal by removing linear echo signals and nonlinear echo signals contained in a microphone signal.
  • an embodiment of the present disclosure provides a sound signal processing method, the method includes: based on a remote signal from a second terminal, linearly filtering a microphone signal spectrum of a microphone signal collected by a first terminal to generate a linear Filtering the signal spectrum, wherein the microphone signal is the sound signal collected after playing the far-end signal; based on the far-end signal spectrum, the microphone signal spectrum and the linearly filtered signal spectrum, determine the echo signal masking value of at least one frequency point in the linearly filtered signal spectrum; Using the determined at least one echo signal masking value to mask the echo signal spectrum superimposed in the linear filter signal spectrum to generate the target near-end signal spectrum; convert the target near-end signal spectrum into the target near-end signal.
  • an embodiment of the present disclosure provides a sound signal processing apparatus, the apparatus includes: a first generating unit configured to, based on a remote signal from a second terminal, generate a microphone signal of a microphone signal collected by the first terminal The spectrum is linearly filtered to generate a linearly filtered signal spectrum, where the microphone signal is the sound signal collected after playing the far-end signal; the determining unit is used to determine the linearly filtered signal based on the far-end signal spectrum, the microphone signal spectrum, and the linearly filtered signal spectrum.
  • the echo signal masking value of at least one frequency point in the spectrum; the second generating unit is used to mask the echo signal spectrum superimposed in the linear filter signal spectrum by using the determined at least one echo signal masking value to determine the target near-end signal spectrum;
  • the conversion unit is used for converting the target near-end signal spectrum into the target near-end signal.
  • embodiments of the present disclosure provide an electronic device, comprising: one or more processors; and a storage device for storing one or more programs, when the one or more programs are stored by the one or more programs A plurality of processors execute, such that the one or more processors implement the sound signal processing method as described in the first aspect.
  • an embodiment of the present disclosure provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processor, implements the steps of the sound signal processing method described in the first aspect.
  • the sound signal processing method, apparatus and electronic device remove the superimposed linear echo signal spectrum in the microphone signal spectrum by linearly filtering the microphone signal spectrum, that is, remove the superimposed linear echo signal in the microphone signal.
  • the echo signal masking value of at least one frequency point in the spectrum of the linearly filtered signal the spectrum of the superimposed nonlinear echo signal and the spectrum of the residual linear echo signal in the spectrum of the linearly filtered signal are masked, that is, to remove the superimposed nonlinear echo signal and residual microphone signal. Therefore, by removing the linear echo signal and the nonlinear echo signal superimposed in the microphone signal, a target near-end signal containing less echo signals is obtained. Thus, the quality of the voice call between the first terminal and the second terminal is improved.
  • FIG. 1 is a flowchart of some embodiments of a sound signal processing method according to the present disclosure
  • FIG. 2 is a flowchart of determining an echo signal masking value in accordance with some embodiments of the present disclosure
  • FIG. 3 is a schematic structural diagram of a spectrum separation structure according to some embodiments of the present disclosure.
  • FIG. 4 is a schematic structural diagram of some embodiments of a sound signal processing apparatus according to the present disclosure.
  • FIG. 5 is an exemplary system architecture to which the sound signal processing method of some embodiments of the present disclosure may be applied;
  • FIG. 6 is a schematic diagram of the basic structure of an electronic device provided according to some embodiments of the present disclosure.
  • the term “including” and variations thereof are open to include, i.e., “including but not limited to”.
  • the term “based on” is “based at least in part on.”
  • the term “one embodiment” means “at least one embodiment”; the term “another embodiment” means “at least one additional embodiment”; the term “some embodiments” means “at least some embodiments”. Relevant definitions of other terms will be given in the description below.
  • FIG. 1 shows the flow of some embodiments of the sound signal processing method according to the present disclosure.
  • the sound signal processing method includes the following steps:
  • Step 101 Perform linear filtering on the microphone signal spectrum of the microphone signal collected by the first terminal based on the far-end signal from the second terminal to generate a linearly filtered signal spectrum.
  • the first terminal may turn on the speaker.
  • the first terminal may acquire the far-end signal from the second terminal and the microphone signal collected by the first terminal.
  • the far-end signal may be a sound signal sent by the second terminal to the first terminal.
  • the first terminal can play the far-end signal through the speaker.
  • the microphone signal may be a sound signal collected by the first terminal through the microphone. It can be seen that when the speaker is turned on by the first terminal, the sound signal played by the speaker may be superimposed on the microphone signal.
  • the far-end signal collected by the first terminal and played through the speaker is called an echo signal.
  • the echo signal collected by the first terminal through the microphone includes a linear echo signal and a non-linear echo signal.
  • the speaker of the second terminal may or may not be turned on. Therefore, when the second terminal turns on the speaker, the echo signal collected by the second terminal through the microphone may be superimposed on the above-mentioned remote signal, and when the second terminal does not turn on the speaker, the above-mentioned remote signal may not be superimposed with the second terminal. The echo signal collected by the microphone.
  • the first terminal may linearly filter the spectrum of the microphone signal based on the far-end signal to generate the spectrum of the linearly filtered signal.
  • the microphone signal spectrum may be the spectrum of the microphone signal.
  • the linearly filtered signal spectrum may be a spectrum formed by linearly filtering the microphone signal spectrum.
  • the first terminal may input the above-mentioned far-end signal into an echo signal spectrum prediction model to obtain a predicted echo signal spectrum.
  • the predicted echo signal spectrum may be the spectrum of the predicted echo signal.
  • the first terminal may eliminate the predicted echo signal spectrum from the microphone signal spectrum to obtain the linearly filtered signal spectrum.
  • the echo signal spectrum prediction model generates a predicted echo signal spectrum by processing the above-mentioned far-end signal.
  • the spectrum of the microphone signal may be superimposed with the spectrum of the nonlinear echo signal and the spectrum of the residual linear echo signal.
  • Step 102 based on the spectrum of the far-end signal, the spectrum of the microphone signal, and the spectrum of the linearly filtered signal, determine an echo signal masking value of at least one frequency point in the spectrum of the linearly filtered signal.
  • the first terminal may determine the echo signal masking value of at least one frequency point in the spectrum of the linearly filtered signal based on the spectrum of the far-end signal, the spectrum of the microphone signal, and the spectrum of the linearly filtered signal.
  • the far-end signal spectrum may be the spectrum of the far-end signal.
  • the echo signal masking value can mask the echo signal spectrum superimposed in the linear filtered signal spectrum.
  • the spectrum of the echo signal superimposed in the spectrum of the linearly filtered signal may include the spectrum of the nonlinear echo signal and the spectrum of the residual linear echo signal.
  • Step 103 mask the spectrum of the echo signal superimposed in the spectrum of the linearly filtered signal by using the determined at least one echo signal masking value to generate the spectrum of the target near-end signal.
  • the first terminal may use the at least one echo signal masking value to mask the echo signal spectrum superimposed in the linearly filtered signal spectrum, thereby generating the target near-end signal spectrum.
  • the target near-end signal may be a sound signal collected by the first terminal that does not contain an echo signal.
  • the target near-end signal spectrum may be the frequency spectrum of the target near-end signal.
  • Step 104 Convert the target near-end signal into the target near-end signal.
  • the first terminal may convert the spectrum of the target near-end signal into the target near-end signal.
  • the first terminal may perform an inverse short-time Fourier transform on the target near-end signal spectrum to obtain the target near-end signal.
  • the spectrum of the far-end signal can be obtained by Fourier transform of the far-end signal
  • the spectrum of the microphone signal can be obtained by the Fourier transform of the microphone signal.
  • the Fourier transform may be a short-time Fourier transform.
  • the terminal can remove the linear echo signal mixed in the microphone signal. Due to the nonlinear echo still mixed in the microphone signal, the effect of removing the echo signal is poor. Further, the quality of voice calls made by users is still poor.
  • the spectrum of the linear echo signal superimposed in the spectrum of the microphone signal is removed, that is, the linear echo signal superimposed in the microphone signal is removed.
  • the echo signal masking value of at least one frequency point in the spectrum of the linearly filtered signal the spectrum of the superimposed nonlinear echo signal and the spectrum of the residual linear echo signal in the spectrum of the linearly filtered signal are masked, that is, to remove the superimposed nonlinear echo signal and residual microphone signal. Therefore, the final target near-end signal contains less echo signals.
  • the quality of the voice call between the first terminal and the second terminal is improved.
  • the first terminal may execute the foregoing step 102 according to the process shown in FIG. 2 .
  • the process includes:
  • Step 201 Input the spectrum of the far-end signal, the spectrum of the microphone signal, and the spectrum of the linearly filtered signal into a masking value determination model to obtain an echo signal masking value of at least one frequency point in the spectrum of the linearly filtered signal.
  • the masking value determination model generates an echo signal masking value of at least one frequency point in the linearly filtered signal spectrum by processing the far-end signal spectrum, the microphone signal spectrum and the linearly filtered signal spectrum. In some scenarios, the masking value determination model may output the echo signal masking value for each frequency bin in the spectrum of the linearly filtered signal.
  • the far-end signal spectrum, the microphone signal spectrum, and the linearly filtered signal spectrum can be processed by using the machine learning model to determine the echo signal masking value of at least one frequency point in the linearly filtered signal spectrum. Therefore, the echo signal masking value of at least one frequency point in the spectrum of the linearly filtered signal can be determined with higher accuracy and faster speed.
  • the executive body that trains the mask value determination model may train and generate the mask value determination model in the following manner.
  • the first step is to obtain a sample set.
  • the samples in the sample set include the sample far-end signal spectrum, the sample microphone signal spectrum, the sample linearly filtered signal spectrum, and the sampled echo signal masking value of at least one frequency point in the sampled linearly filtered signal spectrum.
  • a sample far-end signal and a sample microphone signal can be collected. Further, in a manner similar to that described in other embodiments, the sample far-end signal is converted into a sample far-end signal spectrum, and the sample microphone signal is converted into a sample microphone signal spectrum. Also, the sample linearly filtered signal spectrum is generated in a manner similar to the generation of the linearly filtered signal spectrum.
  • the sample microphone signal may be a sound signal collected after the terminal plays the sample far-end signal through the speaker.
  • the sample microphone signal may be superimposed with an echo signal formed after the sample far-end signal is played by the speaker of the terminal.
  • the sample far-end signal spectrum, the sample microphone signal spectrum and the sample linear filter signal spectrum included in the selected samples from the sample set are used as the input of the initial model, and at least one sample echo signal masking value included in the selected sample is used as The expected output of the initial model, trained to generate mask values to determine the model.
  • the executor for training the mask value determination model may train and generate the mask value determination model according to steps L1 to L6 shown below.
  • Step L1 select samples from the sample set.
  • Step L2 Input the sample far-end signal spectrum, the sample microphone signal spectrum and the sample linear filter signal spectrum included in the selected sample into the initial model to obtain at least one echo signal masking value output by the initial model.
  • the initial model may be a neural network model built for training the generated mask value determination model.
  • the initial model can generate at least one echo signal masking value by processing the input sample far-end signal spectrum, sample microphone signal spectrum and sample linear filter signal spectrum. There is a difference between at least one echo signal masking value output by the initial model and at least one sample echo signal masking value included in the selected sample.
  • Step L3 using a preset loss function to calculate the degree of difference between at least one echo signal masking value output by the initial model and at least one sample echo signal masking value included in the selected sample.
  • the above loss function may include at least one of the following types of loss functions: 0-1 loss function, absolute value loss function, squared loss function, exponential loss function, logarithmic loss function, and the like.
  • step L4 the model parameters of the initial model are adjusted according to the calculated difference degree.
  • the execution body of the training mask value determination model can use the BP (Back Propgation, back propagation) algorithm, GD (Gradient Descent, gradient descent) algorithm, etc. to adjust the model parameters of the initial model.
  • BP Back Propgation, back propagation
  • GD Gradient Descent, gradient descent
  • step L5 in response to reaching the preset training end condition, the trained initial model is used as the mask value to determine the model.
  • the above training end condition may include at least one of the following: the training time exceeds the preset time length, the number of training times exceeds the preset number of times, and the calculated difference degree is less than or equal to the preset difference threshold.
  • Step L6 in response to not reaching the above training end condition, continue to perform steps L1 to L5.
  • using enough samples to train the initial model can improve the calculation accuracy and calculation speed of the final generated mask value determination model.
  • using the samples in the sample set to train the initial model can improve the accuracy and speed of calculating the masking value of the echo signal by the finally generated masking value determination model.
  • the accuracy and speed of calculating the echo signal masking value of at least one frequency point in the spectrum of the linearly filtered signal by the first terminal can be improved.
  • the masking value determination model includes a spectral separation structure.
  • the spectrum separation structure determines the processing of the far-end signal spectrum, the microphone signal spectrum and the linearly filtered signal spectrum of the model based on the input to the masking value, and fits the first near-end signal spectrum and the residual signal spectrum contained in the linearly filtered signal spectrum.
  • the input of the spectrum separation structure includes the far-end signal spectrum, the microphone signal spectrum and the linear filter signal spectrum input to the mask value determination model.
  • the output of the spectral separation structure includes fitting the first near-end signal spectrum and the residual signal spectrum contained in the linearly filtered signal spectrum.
  • the fitted first near-end signal spectrum may still be superimposed with a certain echo signal spectrum.
  • a certain near-end signal spectrum may still be superimposed on the fitted residual signal spectrum.
  • the masking value determination model can use the spectrum separation structure included in the model to fit the spectrum of the first near-end signal included in the spectrum of the linearly filtered signal and the residual signal spectrum.
  • the spectral separation structure described above includes a plurality of spectral separation blocks connected in sequence.
  • the first-order spectrum separation block fits the first near-end signal spectrum and the residual signal spectrum contained in the linearly filtered signal spectrum based on the processing of the input far-end signal spectrum, microphone signal spectrum, and linearly filtered signal spectrum.
  • the spectral separation block of greater than or equal to the second order fits the first near-end signal spectrum and the residual signal spectrum contained in the linearly filtered signal spectrum based on the processing of the input spectrum and output spectrum of the spectral separation block of the previous order.
  • the first-order spectrum separation block its input includes the far-end signal spectrum, the microphone signal spectrum and the linearly filtered signal spectrum input to the above-mentioned spectrum separation structure, and its output includes the fitted linearly filtered signal spectrum. Contains the first near-end signal spectrum and the remaining signal spectrum.
  • the spectrum separation block of greater than or equal to the second order its input includes the input spectrum and output spectrum of the spectrum separation block of the previous order, and its output includes the first near-end signal included in the fitted linearly filtered signal spectrum spectrum and residual signal spectrum.
  • the spectrum separation structure shown in FIG. 3 includes spectrum separation block A, spectrum separation block B, and spectrum separation block C.
  • spectrum separation block A its input includes far-end signal spectrum 301 , microphone signal spectrum 302 and linear filter signal spectrum 303 , and its output includes first near-end signal spectrum 304 and residual signal spectrum 305 .
  • spectrum separation block B its input includes far-end signal spectrum 301, microphone signal spectrum 302, linear filtered signal spectrum 303, first near-end signal spectrum 304 and residual signal spectrum 305, and its output includes first near-end signal spectrum 306 and the residual signal spectrum 307.
  • For spectrum separation block C its input includes far-end signal spectrum 301, microphone signal spectrum 302, linear filtered signal spectrum 303, first near-end signal spectrum 304, residual signal spectrum 305, first near-end signal spectrum 306 and residual Signal spectrum 307 , whose output includes a first near-end signal spectrum 308 and a residual signal spectrum 309 .
  • N is an integer greater than or equal to 1.
  • the next-order spectrum separation block can fit the first near-end signal spectrum and the residual signal spectrum included in the linearly filtered signal spectrum on the basis of comprehensively considering the input and output of the previous-order spectrum separation block. Therefore, the spectrum separation block in the lower order can more accurately fit the spectrum of the first near-end signal and the spectrum of the residual signal contained in the spectrum of the linearly filtered signal. Therefore, the spectrum of the far-end signal, the spectrum of the microphone signal, and the spectrum of the linearly filtered signal input to the spectrum-separating structure are sequentially processed by multiple spectrum separation blocks, and the spectrum of the first near-end signal and the spectrum of the linearly filtered signal contained in the spectrum of the linearly filtered signal are more accurately fitted. residual signal spectrum.
  • each spectral separation block includes a first feature upscaling layer and a first feature compression layer.
  • the first feature upscaling layer is used to perform feature upscaling on the spectrum input to the spectrum separation block
  • the first feature compression layer is used for feature compression of partial frequency bands on the frequency spectrum output by the first feature upscaling layer.
  • the first feature compression layer included in different spectrum separation blocks may perform feature compression in part of the frequency band, which may be the same or different.
  • the first feature compression layers included in different spectrum separation blocks may have overlapping parts.
  • the width of a partial frequency band for feature compression performed by the first feature compression layer included in each spectrum separation block may be set according to specific requirements.
  • the first feature scaling layer can be used to perform feature scaling on the spectrum input to the spectrum splitting block, and then the first feature compression layer can be used to perform feature compression on the upscaled spectrum.
  • the noise features contained in the spectrum can be reduced.
  • firstly performing feature dimension enhancement on the spectrum, and then performing feature compression on the spectrum after feature enhancement the noise features contained in the spectrum can be more accurately reduced.
  • the spectrum separation block can be assisted to more accurately fit the spectrum of the first near-end signal and the spectrum of the residual signal contained in the spectrum of the linearly filtered signal.
  • the masking value determination model includes a spectral synthesis layer.
  • the spectrum synthesis layer is used for synthesizing the spectrum of the first near-end signal and the spectrum of the residual signal output by the spectrum separation structure into the spectrum of the second near-end signal.
  • the second near-end signal spectrum may be a spectrum formed by synthesizing the near-end signal spectrum and the remaining signal spectrum input to the spectrum synthesis layer.
  • the spectrum of the first near-end signal and the spectrum of the remaining signals input to the spectrum synthesis layer may be integrated into the second near-end signal spectrum according to corresponding weights.
  • the input of the spectrum synthesis layer includes the first near-end signal spectrum F1 and the remaining signal spectrum F2.
  • the first near-end signal spectrum F1 and the remaining signal spectrum can be combined according to the formula "a1*F1+a2*F2" F2 is synthesized into the second near-end signal spectrum.
  • a1 is the weight corresponding to the first near-end signal spectrum F1
  • a2 is the weight corresponding to the remaining signal spectrum F2.
  • the weight corresponding to the first near-end signal spectrum may include a weight corresponding to each frequency point in the first near-end signal spectrum
  • the weight corresponding to the remaining signal spectrum may include a weight corresponding to each frequency point in the remaining signal spectrum . It should be noted that, the weight corresponding to the spectrum of the first near-end signal and the weight corresponding to the spectrum of the remaining signal may be set according to actual requirements, which are not specifically limited here.
  • a certain near-end signal spectrum may still be superimposed on the residual signal spectrum output by the spectrum separation structure.
  • the masking value determination model includes a second feature compression layer.
  • the second feature compression layer fits the third near-end signal spectrum by performing full-band feature compression on the second near-end signal spectrum output by the spectrum synthesis layer.
  • the second feature compression layer its input includes the second near-end signal spectrum output by the spectrum synthesis layer, and its output includes the third near-end signal spectrum.
  • the full-band feature compression is performed on the spectrum of the second near-end signal, that is, feature compression is performed on the spectrum of the second near-end signal in the entire frequency range.
  • the spectrum of the echo signal superimposed in the spectrum of the second near-end signal can be further reduced.
  • the first feature compression layer and the second feature compression layer are Gated Recurrent Unit (GRU) layers.
  • GRU Gated Recurrent Unit
  • the gated recurrent unit combines the input data of the model and the intermediate data generated by the model to process data. Therefore, by using the first feature compression layer and the second feature compression layer, the input spectrum and the fitted spectrum of the model can be determined in combination with the masking value, so as to realize feature compression, thereby improving the accuracy of feature compression.
  • the masking value determination model includes a fully connected layer. Based on the third near-end signal spectrum output by the second feature compression layer, the fully connected layer determines the echo signal masking value of at least one frequency point in the linearly filtered signal spectrum input to the masking value determination model.
  • the echo signal spectrum superimposed in the spectrum of the third near-end signal is first reduced by the second feature compression layer, and then the echo signal masking value of at least one frequency point in the spectrum of the linearly filtered signal is determined by the fully connected layer, which can be determined more accurately The echo signal masking value of the above at least one frequency point is obtained.
  • the echo signal masking value is the ratio of the amplitude modulo of the spectrum of the third near-end signal output by the second feature compression layer to the spectrum of the linear filtered signal at the same frequency point.
  • the frequency point f1 corresponds to the amplitude m1
  • the frequency point f1 corresponds to the amplitude m2.
  • the echo signal masking value of the frequency point f1 may be the ratio of the modulus of m1 to the modulus of m2.
  • the amplitude corresponding to the frequency point may be a complex number.
  • the first terminal may perform the foregoing step 101 in the following manner.
  • short-time Fourier transform is performed on the microphone signal and the far-end signal, respectively, to generate the microphone signal spectrum and the far-end signal spectrum.
  • the second step is to input the far-end signal spectrum into the linear filter to obtain the predicted echo signal spectrum.
  • the predicted echo signal spectrum may be a linear echo signal spectrum predicted by a linear filter.
  • the predicted echo signal spectrum is removed from the microphone signal spectrum to generate a linear filtered signal spectrum.
  • the spectrum extracted by the short-time Fourier transform has high stability. Therefore, performing short-time Fourier transform on the far-end signal is beneficial for the linear filter to predict the spectrum of the linear echo signal. Further, it is advantageous to generate a linearly filtered signal spectrum.
  • the first terminal may perform the foregoing step 103 in the following manner.
  • the amplitude of each frequency point in the at least one frequency point is multiplied by the corresponding echo signal masking value to generate the target near-end signal spectrum.
  • the non-linear echo signal spectrum and the residual linear echo signal spectrum superimposed in the linearly filtered signal spectrum can be removed.
  • the present disclosure provides some embodiments of a sound signal processing apparatus.
  • the apparatus embodiments correspond to the method embodiments shown in FIG. 1 .
  • the sound signal processing apparatus of this embodiment includes: a first generating unit 401 , a determining unit 402 , a second generating unit 403 and a converting unit 404 .
  • the first generating unit 401 is configured to: based on the far-end signal from the second terminal, perform linear filtering on the microphone signal spectrum of the microphone signal collected by the first terminal, and generate a linearly filtered signal spectrum, wherein the microphone signal is obtained after playing the far-end signal.
  • the determining unit 402 is configured to: determine an echo signal masking value of at least one frequency point in the spectrum of the linearly filtered signal based on the spectrum of the far-end signal, the spectrum of the microphone signal and the spectrum of the linearly filtered signal.
  • the second generating unit 403 is configured to: mask the spectrum of the echo signal superimposed in the spectrum of the linearly filtered signal by using the determined at least one echo signal masking value to determine the spectrum of the target near-end signal.
  • the converting unit 404 is configured to: convert the target near-end signal spectrum into the target near-end signal.
  • the specific processing of the first generating unit 401 , the determining unit 402 , the second generating unit 403 , and the converting unit 404 of the sound signal processing apparatus and the technical effects brought about by them may refer to the corresponding embodiments in FIG. 1 , respectively.
  • the related descriptions of step 101 , step 102 , step 103 and step 104 will not be repeated here.
  • the determining unit 402 is further configured to: input the spectrum of the far-end signal, the spectrum of the microphone signal, and the spectrum of the linearly filtered signal into the masking value determination model to obtain the masking value of the echo signal of at least one frequency point in the spectrum of the linearly filtered signal .
  • the mask value determination model is trained and generated by acquiring a sample set, wherein the samples in the sample set include sample far-end signal spectrum, sample microphone signal spectrum, sample linear filter signal spectrum and sample linear filter signal spectrum
  • the sample echo signal masking value of at least one frequency point in the sample set; the sample far-end signal spectrum, the sample microphone signal spectrum and the sample linear filter signal spectrum included in the selected samples from the sample set are used as the input of the initial model, and the selected samples include At least one sample echo signal masking value is used as the expected output of the initial model, and the model is trained to generate the masking value to determine the model.
  • the masking value determination model includes a spectral separation structure, wherein the spectral separation structure fits the linearly filtered signal spectrum based on processing of the far-end signal spectrum, the microphone signal spectrum and the linearly filtered signal spectrum input to the masking value determination model The first near-end signal spectrum and the remaining signal spectrum contained in .
  • the spectrum separation structure includes a plurality of spectrum separation blocks connected in sequence, wherein the first order spectrum separation block is based on the processing of the input far-end signal spectrum, the microphone signal spectrum and the linear filter signal spectrum.
  • the first near-end signal spectrum and the residual signal spectrum contained in the linearly filtered signal spectrum, and the spectrum separation block of greater than or equal to the second order is based on the processing of the input spectrum and output spectrum of the previous order of spectrum separation block, fitting the linear filter The first near-end signal spectrum and the remaining signal spectrum included in the signal spectrum.
  • each spectral separation block includes a first feature upscaling layer and a first feature compression layer, wherein the first feature upscaling layer is used to perform feature upscaling on the spectrum input to the spectral separation block, and the first feature upscaling layer is used for feature upscaling.
  • the feature compression layer is used to perform feature compression of partial frequency bands on the frequency spectrum output by the first feature upscaling layer.
  • the masking value determination model includes a spectral integration layer, wherein the spectral integration layer is configured to integrate the first near-end signal spectrum and the residual signal spectrum output by the spectral separation structure into a second near-end signal spectrum.
  • the masking value determination model includes a second feature compression layer, wherein the second feature compression layer fits the third near-end signal spectrum by performing full-band feature compression on the second near-end signal spectrum output by the spectrum synthesis layer. signal spectrum.
  • the masking value determination model includes a fully connected layer, wherein the fully connected layer determines the echo signal masking value of at least one frequency point in the spectrum of the linearly filtered signal based on the third near-end signal spectrum output by the second feature compression layer .
  • the echo signal masking value is the ratio of the amplitude modulo of the spectrum of the third near-end signal output by the second feature compression layer to the spectrum of the linear filtered signal at the same frequency point.
  • the first feature compression layer and the second feature compression layer are gated recurrent unit layers.
  • the first generating unit 401 is further configured to: perform short-time Fourier transform on the microphone signal and the far-end signal, respectively, to generate the microphone signal spectrum and the far-end signal spectrum; input the far-end signal spectrum into the linear filter In the filter, the predicted echo signal spectrum is obtained; the predicted echo signal spectrum is removed from the microphone signal spectrum to generate a linear filtered signal spectrum.
  • the second generating unit 403 is further configured to: for the linearly filtered signal spectrum, multiply the amplitude of each frequency point in the at least one frequency point by the corresponding echo signal masking value to generate the target near-end signal spectrum.
  • FIG. 5 illustrates an exemplary system architecture in which the sound signal processing method of some embodiments of the present disclosure may be applied.
  • the system architecture may include a terminal 501 and a terminal 502 .
  • the terminal 501 and the terminal 502 may interact through the network.
  • a network may include various connection types such as wired, wireless communication links, or fiber optic cables.
  • Various applications may be installed on the terminal 501 and the terminal 502 .
  • the terminal 501 and the terminal 502 may have voice calling applications installed.
  • the terminal 501 and the terminal 502 may send the sound signal collected by the microphone to each other.
  • the terminal 501 and the terminal 502 may be hardware or software.
  • terminal 501 and terminal 502 are hardware, they may be various electronic devices equipped with microphones and speakers, including but not limited to smart phones, tablet computers, laptop computers, desktop computers, and the like.
  • the terminal 501 and the terminal 502 are software, they can be installed in the electronic devices listed above. It can be implemented as a plurality of software or software modules, and can also be implemented as a single software or software module. There is no specific limitation here.
  • the terminal 501 may perform linear filtering on the microphone signal spectrum of the collected microphone signal based on the far-end signal from the terminal 502 to generate the linearly filtered signal spectrum. Then, the terminal 501 may determine an echo signal masking value of at least one frequency point in the linearly filtered signal spectrum based on the far-end signal spectrum, the microphone signal spectrum, and the linearly filtered signal spectrum. Further, the terminal 501 may use the determined at least one echo signal masking value to mask the echo signal spectrum superimposed in the linear filtered signal spectrum to determine the target near-end signal spectrum. Finally, the terminal 501 can convert the target near-end signal spectrum into the target near-end signal.
  • the sound signal processing method provided by the embodiments of the present disclosure may be executed by the terminal 501 or the terminal 502 , and correspondingly, the sound signal processing apparatus may be provided in the terminal 501 or the terminal 502 .
  • FIG. 5 is merely illustrative. There can be any number of terminals according to implementation needs.
  • FIG. 6 it shows a schematic structural diagram of an electronic device (eg, the terminal in FIG. 5 ) suitable for implementing some embodiments of the present disclosure.
  • Terminals in some embodiments of the present disclosure may include, but are not limited to, such as mobile phones, notebook computers, digital broadcast receivers, PDAs (Personal Digital Assistants), PADs (Tablet Computers), PMPs (Portable Multimedia Players), vehicle-mounted terminals ( Mobile terminals such as car navigation terminals), etc., and stationary terminals such as digital TVs, desktop computers, and the like.
  • the electronic device shown in FIG. 6 is only an example, and should not impose any limitation on the function and scope of use of the embodiments of the present disclosure.
  • the electronic device shown in FIG. 6 is only an example, and should not impose any limitation on the function and scope of use of the embodiments of the present disclosure.
  • the electronic device may include a processing device (eg, a central processing unit, a graphics processor, etc.) 601 that may be loaded into a random access memory according to a program stored in a read only memory (ROM) 602 or from a storage device 608
  • the program in the (RAM) 603 executes various appropriate operations and processes.
  • various programs and data required for the operation of the electronic device are also stored.
  • the processing device 601, the ROM 602, and the RAM 603 are connected to each other through a bus 604.
  • An input/output (I/O) interface 605 is also connected to bus 604 .
  • I/O interface 605 input devices 606 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speakers, vibration An output device 607 of a computer, etc.; a storage device 608 including, for example, a magnetic tape, a hard disk, etc.; and a communication device 609.
  • Communication means 609 may allow electronic devices to communicate wirelessly or by wire with other devices to exchange data. While FIG. 6 illustrates an electronic device having various means, it should be understood that not all of the illustrated means are required to be implemented or available. More or fewer devices may alternatively be implemented or provided. Each block shown in FIG. 6 may represent one device, or may represent multiple devices as required.
  • the processes described above with reference to the flowcharts may be implemented as computer software programs.
  • some embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated in the flowchart.
  • the computer program may be downloaded and installed from the network via the communication device 609, or from the storage device 608, or from the ROM 602.
  • the processing apparatus 601 the above-mentioned functions defined in the methods of the embodiments of the present disclosure are executed.
  • the computer-readable medium described in some embodiments of the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two.
  • the computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples of computer readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Programmable read only memory (EPROM or flash memory), fiber optics, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
  • a computer-readable storage medium can be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code therein. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device .
  • Program code embodied on a computer readable medium may be transmitted using any suitable medium including, but not limited to, electrical wire, optical fiber cable, RF (radio frequency), etc., or any suitable combination of the foregoing.
  • the client and server can use any currently known or future developed network protocol such as HTTP (HyperText Transfer Protocol) to communicate, and can communicate with digital data in any form or medium Communication (eg, a communication network) interconnects.
  • HTTP HyperText Transfer Protocol
  • Examples of communication networks include local area networks (“LAN”), wide area networks (“WAN”), the Internet (eg, the Internet), and peer-to-peer networks (eg, ad hoc peer-to-peer networks), as well as any currently known or future development network of.
  • the above-mentioned computer-readable medium may be included in the above-mentioned electronic device, or may exist independently without being assembled into the electronic device.
  • the above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device: based on the remote signal from the second terminal, the microphone signal collected by the first terminal is The spectrum of the microphone signal is linearly filtered to generate a linearly filtered signal spectrum, where the microphone signal is the sound signal collected after playing the far-end signal; The echo signal masking value of at least one frequency point; using the determined at least one echo signal masking value to mask the echo signal spectrum superimposed in the linear filter signal spectrum to generate the target near-end signal spectrum; convert the target near-end signal spectrum into the target near-end signal.
  • Computer program code for carrying out operations of some embodiments of the present disclosure may be written in one or more programming languages, including but not limited to object-oriented programming languages—such as Java, Smalltalk, or a combination thereof. , C++, and also conventional procedural programming languages - such as the "C" language or similar programming languages.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server.
  • the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (eg, using an Internet service provider through Internet connection).
  • LAN local area network
  • WAN wide area network
  • Internet service provider Internet service provider
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more logical functions for implementing the specified functions executable instructions.
  • the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented in dedicated hardware-based systems that perform the specified functions or operations , or can be implemented in a combination of dedicated hardware and computer instructions.
  • the units described in some embodiments of the present disclosure may be implemented by means of software, and may also be implemented by means of hardware. Wherein, the names of these units do not limit the unit itself in some cases, for example, the conversion unit can also be described as a unit that "converts the target near-end signal spectrum into the target near-end signal".
  • exemplary types of hardware logic components include: Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), Systems on Chips (SOCs), Complex Programmable Logical Devices (CPLDs) and more.
  • FPGAs Field Programmable Gate Arrays
  • ASICs Application Specific Integrated Circuits
  • ASSPs Application Specific Standard Products
  • SOCs Systems on Chips
  • CPLDs Complex Programmable Logical Devices
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with the instruction execution system, apparatus or device.
  • the machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any suitable combination of the foregoing.
  • machine-readable storage media would include one or more wire-based electrical connections, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), fiber optics, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read only memory
  • EPROM or flash memory erasable programmable read only memory
  • CD-ROM compact disk read only memory
  • magnetic storage or any suitable combination of the foregoing.

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Human Computer Interaction (AREA)
  • Quality & Reliability (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephone Function (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)

Abstract

本公开的实施例公开了声音信号处理方法、装置和电子设备。该方法的一具体实施方式包括:基于来自第二终端的远端信号,对第一终端所采集麦克风信号的麦克风信号频谱进行线性滤波,生成线性滤波信号频谱,其中,麦克风信号是播放远端信号后采集的声音信号;基于远端信号频谱、麦克风信号频谱和线性滤波信号频谱,确定线性滤波信号频谱中至少一个频点的回声信号掩蔽值;利用所确定的至少一个回声信号掩蔽值对线性滤波信号频谱中所叠加回声信号频谱进行掩蔽,生成目标近端信号频谱;将目标近端信号频谱转换为目标近端信号。由此,通过去除麦克风信号中包含的线性回声信号和非线性回声信号,提升第一终端与第二终端进行语音通话的质量。

Description

声音信号处理方法、装置和电子设备
相关申请的交叉引用
本申请要求于2021年04月26日提交的、申请号为202110456216.9、发明名称为“声音信号处理方法、装置和电子设备”的中国专利申请的优先权,该申请的全文通过引用结合在本申请中。
技术领域
本公开的实施例涉及计算机技术领域,尤其涉及一种声音信号处理方法、装置和电子设备。
背景技术
不同的终端进行语音通话的过程中,若其中一个终端打开扬声器,那么其它终端发生至该终端的声音信号在经过扬声器播放后,能够产生回声信号。此时,该终端的麦克风可以采集到产生的回声信号。由此,当该终端将麦克风采集到的声音信号发送至其它终端时,可能造成其它终端接收到的声音信号中夹杂着回声信号
若向用户提供的声音信号中夹着较多的回声信号,可能造成语音通话的质量较差。
发明内容
提供该公开内容部分以便以简要的形式介绍构思,这些构思将在后面的具体实施方式部分被详细描述。该公开内容部分并不旨在标识要求保护的技术方案的关键特征或必要特征,也不旨在用于限制所要求的保护的技术方案的范围。
本公开的实施例提供了一种声音信号处理方法、装置和电子设备,通过去除麦克风信号中包含的线性回声信号和非线性回声信号,提升第一终端与第二终端进行语音通话的质量。
第一方面,本公开的实施例提供了一种声音信号处理方法,该方法包括:基于来自第二终端的远端信号,对第一终端所采集麦克风信号的麦克风信号频谱进行线性滤波,生成线性滤波信号频谱,其中,麦克风信号是播放远端信号后采集的声音信号;基于远端信号频谱、麦克风信号频谱和线性滤波信号频谱,确定线性滤波信号频谱中至少一个频点的回声信号掩蔽值;利用所确定的至少一个回声信号掩蔽值对线性滤波信号频谱中所叠加回声信号频谱进行掩蔽,生成目标近端信号频谱;将目标近端信号频谱转换为目标近端信号。
第二方面,本公开的实施例提供了一种声音信号处理装置,该装置包括:第一生成单元,用于基于来自第二终端的远端信号,对第一终端所采集麦克风信号的麦克风信号频谱进行线性滤波,生成线性滤波信号频谱,其中,麦克风信号是播放远端信号后采集的声音信号;确定单元,用于基于远端信号频谱、麦克风信号频谱和线性滤波信号频谱,确定线性滤波信号频谱中至少一个频点的回声信号掩蔽值;第二生成单元,用于利用所确定的至少一个回声信号掩蔽值对线性滤波信号频谱中所叠加回声信号频谱进行掩蔽,确定目标近端信号频谱;转换单元,用于将目标近端信号频谱转换为目标近端信号。
第三方面,本公开的实施例提供了一种电子设备,包括:一个或多个处理器;存储装置,用于存储一个或多个程序,当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如第一方面所述的声音信号处理方法。
第四方面,本公开的实施例提供了一种计算机可读介质,其上存储有计算机程序,该程序被处理器执行时实现如第一方面所述的声音信号处理方法的步骤。
本公开的实施例提供的声音信号处理方法、装置和电子设备,通过对麦克风信号频谱进行线性滤波,去除麦克风信号频谱中叠加的线性回声信号频谱,也即去除麦克风信号中叠加的线性回声信号。通过线性滤波信号频谱中至少一个频点的回声信号掩蔽值,掩蔽线性滤波信号频谱中叠加的非线性回声信号频谱和残余的线性回声信号频谱,也即去除麦克风信号中叠加的非线性回声信号和残余的麦克风信号。 由此,通过去除麦克风信号中叠加的线性回声信号和非线性回声信号,得到包含回声信号较少的目标近端信号。从而,提升第一终端与第二终端进行语音通话的质量。
附图说明
结合附图并参考以下具体实施方式,本公开各实施例的上述和其它特征、优点及方面将变得更加明显。贯穿附图中,相同或相似的附图标记表示相同或相似的元素。应当理解附图是示意性的,原件和元素不一定按照比例绘制。
图1是根据本公开的声音信号处理方法的一些实施例的流程图;
图2是根据本公开的一些实施例中确定回声信号掩蔽值的流程图;
图3是根据本公开的一些实施例中频谱分离结构的结构示意图;
图4是根据本公开的声音信号处理装置的一些实施例的结构示意图;
图5是本公开的一些实施例的声音信号处理方法可以应用于其中的示例性系统架构;
图6是根据本公开的一些实施例提供的电子设备的基本结构的示意图。
具体实施方式
下面将参照附图更详细地描述本公开的实施例。虽然附图中显示了本公开的某些实施例,然而应当理解的是,本公开可以通过各种形式来实现,而且不应该被解释为限于这里阐述的实施例,相反提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是,本公开的附图及实施例仅用于示例性作用,并非用于限制本公开的保护范围。
应当理解,本公开的方法实施方式中记载的各个步骤可以按照不同的顺序执行,和/或并行执行。此外,方法实施方式可以包括附加的步骤和/或省略执行示出的步骤。本公开的范围在此方面不受限制。
本文使用的术语“包括”及其变形是开放性包括,即“包括但不 限于”。术语“基于”是“至少部分地基于”。术语“一个实施例”表示“至少一个实施例”;术语“另一实施例”表示“至少一个另外的实施例”;术语“一些实施例”表示“至少一些实施例”。其它术语的相关定义将在下文描述中给出。
需要注意,本公开中提及的“第一”、“第二”等概念仅用于对不同的装置、模块或单元进行区分,并非用于限定这些装置、模块或单元所执行的功能的顺序或者相互依存关系。
需要注意,本公开中提及的“一个”、“多个”的修饰是示意性而非限制性的,本领域技术人员应当理解,除非在上下文另有明确指出,否则应该理解为“一个或多个”。
本公开实施方式中的多个装置之间所交互的消息或者信息的名称仅用于说明性的目的,而并不是用于对这些消息或信息的范围进行限制。
请参考图1,其示出了根据本公开的声音信号处理方法的一些实施例的流程。如图1所示,该声音信号处理方法,包括以下步骤:
步骤101,基于来自第二终端的远端信号,对第一终端所采集麦克风信号的麦克风信号频谱进行线性滤波,生成线性滤波信号频谱。
在本实施例中,在第一终端(例如,图5中所示终端501)与第二终端(例如,图5中所示终端502)的通话过程中,第一终端可以开启扬声器。第一终端可以获取来自第二终端的远端信号和第一终端采集的麦克风信号。远端信号可以是第二终端发送至第一终端的声音信号。第一终端可以通过扬声器播放远端信号。
麦克风信号可以是第一终端通过麦克风采集到的声音信号。可见,在第一终端开启扬声器时,麦克风信号中可能叠加有扬声器播放的声音信号。
在实际应用中,第一终端采集到的经过扬声器播放的远端信号称之为回声信号。第一终端通过麦克风采集到的回声信号包括线性回声信号和非线性回声信号。
需要说明的是,在通话过程中,第二终端可能开启扬声器,也可能未开启扬声器。因此,当第二终端开启扬声器时,上述远端信号中 可能叠加有第二终端通过麦克风采集到的回声信号,当第二终端未开启扬声器时,上述远端信号中可能未叠加有第二终端通过麦克风采集到的回声信号。
在本实施例中,第一终端可以基于远端信号,对麦克风信号频谱进行线性滤波,生成线性滤波信号频谱。
麦克风信号频谱可以是麦克风信号的频谱。
线性滤波信号频谱可以是对麦克风信号频谱进行线性滤波后形成的频谱。
在一些场景中,第一终端可以将上述远端信号输入至回声信号频谱预测模型中,得到预测回声信号频谱。在这里,预测回声信号频谱可以是预测回声信号的频谱。进一步,第一终端可以从麦克风信号频谱中消除预测回声信号频谱,得到线性滤波信号频谱。在这里,回声信号频谱预测模型通过对上述远端信号进行处理,生成预测回声信号频谱。
在实际应用中,经过线性滤波,麦克风信号频谱中可能叠加有非线性回声信号频谱和残余的线性回声信号频谱。
步骤102,基于远端信号频谱、麦克风信号频谱和线性滤波信号频谱,确定线性滤波信号频谱中至少一个频点的回声信号掩蔽值。
在本实施例中,第一终端可以基于远端信号频谱、麦克风信号频谱和线性滤波信号频谱,确定线性滤波信号频谱中至少一个频点的回声信号掩蔽值。
远端信号频谱可以是远端信号的频谱。
回声信号掩蔽值可以掩蔽线性滤波信号频谱中叠加的回声信号频谱。在实际应用中,线性滤波信号频谱中叠加的回声信号频谱可以包括非线性回声信号频谱和残余的线性回声信号频谱。
步骤103,利用所确定的至少一个回声信号掩蔽值对线性滤波信号频谱中所叠加回声信号频谱进行掩蔽,生成目标近端信号频谱。
在本实施例中,第一终端可以利用上述至少一个回声信号掩蔽值,对线性滤波信号频谱中叠加的回声信号频谱进行掩蔽,进而生成目标近端信号频谱。
目标近端信号可以是第一终端采集到的不包含回声信号的声音信号。目标近端信号频谱可以是目标近端信号的频谱。
步骤104,将目标近端信号转换为目标近端信号。
在本实施例中,第一终端可以将目标近端信号频谱转换为目标近端信号。
在一些场景中,第一终端可以将目标近端信号频谱进行逆短时傅里叶变换,得到目标近端信号。
需要说明的是,远端信号频谱可以通过远端信号的傅里叶变换得到,麦克风信号频谱可以通过麦克风信号的傅里叶变换得到。在一些场景中,傅里叶变换可以是短时傅里叶变换。
在相关技术中,采集到的麦克风信号以后,终端可以去除麦克风信号中夹杂的线性回声信号。由于麦克风信号中仍然夹杂的非线性回声,导致去除回声信号的效果较差。进一步,用户进行语音通话的质量仍然较差。
在本实施例中,通过对麦克风信号频谱进行线性滤波,去除麦克风信号频谱中叠加的线性回声信号频谱,也即去除麦克风信号中叠加的线性回声信号。通过线性滤波信号频谱中至少一个频点的回声信号掩蔽值,掩蔽线性滤波信号频谱中叠加的非线性回声信号频谱和残余的线性回声信号频谱,也即去除麦克风信号中叠加的非线性回声信号和残余的麦克风信号。由此,最终得到的目标近端信号中包含的回声信号较少。从而,提升第一终端与第二终端进行语音通话的质量。
在一些实施例中,第一终端可以按照图2所示的流程,执行上述步骤102。该流程包括:
步骤201,将远端信号频谱、麦克风信号频谱和线性滤波信号频谱输入至掩蔽值确定模型中,得到线性滤波信号频谱中至少一个频点的回声信号掩蔽值。
掩蔽值确定模型通过对远端信号频谱、麦克风信号频谱和线性滤波信号频谱进行处理,生成线性滤波信号频谱中至少一个频点的回声信号掩蔽值。在一些场景中,掩蔽值确定模型可以输出线性滤波信号频谱中每个频点的回声信号掩蔽值。
由此,可以利用机器学习模型对远端信号频谱、麦克风信号频谱和线性滤波信号频谱进行处理,确定线性滤波信号频谱中至少一个频点的回声信号掩蔽值。从而,能够以更高的精度和更快的速度,确定线性滤波信号频谱中至少一个频点的回声信号掩蔽值。
在一些实施例中,训练掩蔽值确定模型的执行主体可以通过以下方式,训练生成掩蔽值确定模型。
第一步,获取样本集合。
样本集合中的样本包括样本远端信号频谱、样本麦克风信号频谱、样本线性滤波信号频谱和样本线性滤波信号频谱中至少一个频点的样本回声信号掩蔽值。
在实际应用中,在两个终端通话过程中,可以采集样本远端信号和样本麦克风信号。进一步,采用类似其它实施例中描述的方式,将样本远端信号转换为样本远端信号频谱,将样本麦克风信号转换为样本麦克风信号频谱。并且,采用与生成线性滤波信号频谱类似的方式,生成样本线性滤波信号频谱。
不难理解,样本麦克风信号可以是终端通过扬声器播放样本远端信号后采集到的声音信号。样本麦克风信号中可能叠加有样本远端信号经过终端的扬声器播放后形成的回声信号。
第二步,将从样本集合中所选取样本包括的样本远端信号频谱、样本麦克风信号频谱和样本线性滤波信号频谱作为初始模型的输入,将所选取样本包括的至少一个样本回声信号掩蔽值作为初始模型的期望输出,训练生成掩蔽值确定模型。
具体来说,训练掩蔽值确定模型的执行主体可以按照以下所示的步骤L1至步骤L6,训练生成掩蔽值确定模型。
步骤L1,从样本集合中选取样本。
步骤L2,将所选取样本包括的样本远端信号频谱、样本麦克风信号频谱和样本线性滤波信号频谱输入至初始模型,得到初始模型输出的至少一个回声信号掩蔽值。
初始模型可以是为了训练生成掩蔽值确定模型所搭建的神经网络模型。
在实际应用中,初始模型通过对输入的样本远端信号频谱、样本麦克风信号频谱和样本线性滤波信号频谱进行处理,可以生成至少一个回声信号掩蔽值。初始模型输出的至少一个回声信号掩蔽值与所选取样本包括的至少一个样本回声信号掩蔽值存在差异。
步骤L3,利用预设的损失函数,计算初始模型输出的至少一个回声信号掩蔽值与所选取样本包括的至少一个样本回声信号掩蔽值之间的差异程度。
上述损失函数可以包括以下至少一类损失函数:0-1损失函数,绝对值损失函数,平方损失函数,指数损失函数,对数损失函数等。
步骤L4,根据计算所得的差异程度,调整初始模型的模型参数。
在一些场景中,训练掩蔽值确定模型的执行主体可以采用BP(Back Propgation,反向传播)算法、GD(Gradient Descent,梯度下降)算法等调整初始模型的模型参数。
步骤L5,响应于达到预设的训练结束条件,将训练后的初始模型作为掩蔽值确定模型。
上述训练结束条件可以包括以下至少一项:训练时间超过预设时长,训练次数超过预设次数,计算所得的差异程度小于或者等于预设的差异阈值。
步骤L6,响应于未达到上述训练结束条件,继续执行步骤L1至步骤L5。
在实际应用中,利用足够多的样本训练初始模型,能够提升最终生成的掩蔽值确定模型的计算精度和计算速度。由此,利用样本集合中的样本训练初始模型,能够提升最终生成的掩蔽值确定模型计算回声信号掩蔽值的精度和速度。进一步,可以提升第一终端计算线性滤波信号频谱中至少一个频点的回声信号掩蔽值的精度和速度。
在一些实施例中,掩蔽值确定模型包括频谱分离结构。频谱分离结构基于输入至掩蔽值确定模型的远端信号频谱、麦克风信号频谱和线性滤波信号频谱的处理,拟合线性滤波信号频谱中包含的第一近端信号频谱和剩余信号频谱。
可见,频谱分离结构的输入包括输入至掩蔽值确定模型的远端信 号频谱、麦克风信号频谱和线性滤波信号频谱。频谱分离结构的输出包括拟合出线性滤波信号频谱中包含的第一近端信号频谱和剩余信号频谱。
在实际应用中,拟合出的第一近端信号频谱可能仍然叠加有一定的回声信号频谱。相应地,拟合出的剩余信号频谱中可能仍然叠加有一定的近端信号频谱。
由此,掩蔽值确定模型在确定线性滤波信号频谱中至少一个频点的回声信号掩蔽值过程中,能够利用其包含的频谱分离结构,拟合线性滤波信号频谱中包含的第一近端信号频谱和剩余信号频谱。
在一些实施例中,上述频谱分离结构包括依次连接的多个频谱分离块。第一位次的频谱分离块基于所输入的远端信号频谱、麦克风信号频谱和线性滤波信号频谱的处理,拟合线性滤波信号频谱中包含的第一近端信号频谱和剩余信号频谱。大于等于第二位次的频谱分离块基于上一位次频谱分离块的输入频谱和输出频谱的处理,拟合线性滤波信号频谱中包含的第一近端信号频谱和剩余信号频谱。
可见,对于第一位次的频谱分离块来说,其输入包括输入至上述频谱分离结构的远端信号频谱、麦克风信号频谱和线性滤波信号频谱,其输出包括拟合出的线性滤波信号频谱中包含的第一近端信号频谱和剩余信号频谱。对于大于等于第二位次的频谱分离块来说,其输入包括上一位次频谱分离块的输入频谱和输出频谱,其输出包括拟合出的线性滤波信号频谱中包含的第一近端信号频谱和剩余信号频谱。
作为示例,图3示出的频谱分离结构包括频谱分离块A、频谱分离块B和频谱分离块C。其中,对于频谱分离块A来说,其输入包括远端信号频谱301、麦克风信号频谱302和线性滤波信号频谱303,其输出包括第一近端信号频谱304和剩余信号频谱305。对于频谱分离块B来说,其输入包括远端信号频谱301、麦克风信号频谱302、线性滤波信号频谱303、第一近端信号频谱304和剩余信号频谱305,其输出包括第一近端信号频谱306和剩余信号频谱307。对于频谱分离块C来说,其输入包括远端信号频谱301、麦克风信号频谱302、线性滤波信号频谱303、第一近端信号频谱304、剩余信号频谱305、第 一近端信号频谱306和剩余信号频谱307,其输出包括第一近端信号频谱308和剩余信号频谱309。
不难发现,第N位次的频谱分离块输入的频谱总数为2N+1。其中,N为大于等于1的整数。
可见,下一位次频谱分离块可以在综合考虑上一位次频谱分离块的输入和输出的基础上,拟合出线性滤波信号频谱中包含的第一近端信号频谱和剩余信号频谱。由此,位次靠后的频谱分离块可以更加准确地拟合出线性滤波信号频谱中包含的第一近端信号频谱和剩余信号频谱。从而,通过多个频谱分离块依次处理输入至频谱分离结构的远端信号频谱、麦克风信号频谱和线性滤波信号频谱,更加准确地拟合出线性滤波信号频谱中包含的第一近端信号频谱和剩余信号频谱。
在一些实施例中,每个频谱分离块包括第一特征升维层和第一特征压缩层。第一特征升维层用于对输入至频谱分离块的频谱进行特征升维,第一特征压缩层用于对第一特征升维层所输出频谱进行部分频带的特征压缩。
在实际应用中,不同频谱分离块所包括的第一特征压缩层,进行特征压缩的部分频带,可以相同,也可以不同。在一些场景中,不同频谱分离块所包括的第一特征压缩层可以存在重叠部分。在实际应用中,可以根据具体需求,设置每个频谱分离块所包括的第一特征压缩层进行特征压缩的部分频带的宽度。
由此,在频谱分离块中,能够先利用第一特征升维层对输入至频谱分离块的频谱进行特征升维,再利用第一特征压缩层对升维后的频谱进行特征压缩。在实际应用中,通过特征压缩,可以减少频谱中包含的噪声特征。并且,先对频谱进行特征升维,再对特征升维后的频谱进行特征压缩,可以更加准确地减少频谱中包含的噪声特征。
进一步,借助于第一特征升维层和第一特征压缩层,可以辅助频谱分离块更加准确地拟合出线性滤波信号频谱中包含的第一近端信号频谱和剩余信号频谱。
在一些实施例中,掩蔽值确定模型包括频谱综合层。频谱综合层用于将频谱分离结构输出的第一近端信号频谱和剩余信号频谱综合为 第二近端信号频谱。
可见,对于频谱压缩层来说,其输入包括频谱分离结构输出的第一近端信号频谱和剩余信号频谱,其输出包括第二近端信号频谱。
第二近端信号频谱可以是将输入至频谱综合层的近端信号频谱和剩余信号频谱进行综合后形成的频谱。
在一些场景中,可以按照相应的权重,将输入至频谱综合层的第一近端信号频谱和剩余信号频谱,综合为第二近端信号频谱。作为示例,频谱综合层的输入包括第一近端信号频谱F1和剩余信号频谱F2,此时,可以按照公式“a1×F1+a2×F2”,将第一近端信号频谱F1和剩余信号频谱F2综合为第二近端信号频谱。在这里,a1是第一近端信号频谱F1对应的权重,a2是剩余信号频谱F2对应的权重。在一些场景中,第一近端信号频谱对应的权重可以包括第一近端信号频谱中每个频点对应的权重,剩余信号频谱对应的权重可以包括剩余信号频谱中每个频点对应的权重。需要说明的是,第一近端信号频谱对应的权重和剩余信号频谱对应的权重,可以根据实际需求进行设置,此处不做具体限定。
参见前述分析,频谱分离结构输出的剩余信号频谱中可能仍然叠加有一定的近端信号频谱。通过频谱综合层对频谱分离结构输出的第一近端信号频谱和剩余信号频谱进行综合,可以更加准确地拟合出线性滤波信号中叠加的第二近端信号频谱。
在一些实施例中,掩蔽值确定模型包括第二特征压缩层。第二特征压缩层通过对频谱综合层输出的第二近端信号频谱进行全频带的特征压缩,拟合第三近端信号频谱。
可见,对于第二特征压缩层来说,其输入包括频谱综合层输出的第二近端信号频谱,其输出包括第三近端信号频谱。
在实际应用中,对第二近端信号频谱进行全频带的特征压缩,也即在整个频率范围内对第二近端信号频谱进行特征压缩。
由此,通过第二特征压缩层对第二近端信号频谱进行全频带的特征压缩,可以进一步减少第二近端信号频谱中叠加的回声信号频谱。
在一些实施例中,第一特征压缩层和第二特征压缩层是门控循环 单元(Gated Recurrent Unit,GRU)层。
在实际应用中,门控循环单元结合模型的输入数据和模型产生的中间数据,进行数据处理。由此,利用第一特征压缩层和第二特征压缩层,能够结合掩蔽值确定模型的输入频谱和拟合出的频谱,实现特征压缩,以此提升特征压缩的准确度。
在一些实施例中,掩蔽值确定模型包括全连接层。全连接层基于第二特征压缩层输出的第三近端信号频谱,确定输入至掩蔽值确定模型的线性滤波信号频谱中至少一个频点的回声信号掩蔽值。
由此,先通过第二特征压缩层减少第三近端信号频谱中叠加的回声信号频谱,再通过全连接层确定线性滤波信号频谱中至少一个频点的回声信号掩蔽值,可以更加准确地确定出上述至少一个频点的回声信号掩蔽值。
在一些实施例中,回声信号掩蔽值是第二特征压缩层所输出第三近端信号频谱与线性滤波信号频谱在相同频点的幅值模之比。
作为示例,在线性滤波信号频谱中,频点f1对应幅值m1,在第二特征压缩层输出的第三近端信号频谱中,频点f1对应幅值m2。此时,频点f1的回声信号掩蔽值可以是m1的模与m2的模之比。
在实际应用中,在频谱中,频点对应的幅值可以是复数。
在一些实施例中,第一终端可以按照以下方式,执行上述步骤101。
第一步,分别对麦克风信号和远端信号进行短时傅里叶变换,生成麦克风信号频谱和远端信号频谱。
第二步,将远端信号频谱输入至线性滤波器中,得到预测回声信号频谱。
预测回声信号频谱可以是线性滤波器预测的线性回声信号频谱。
第三步,从麦克风信号频谱中去除预测回声信号频谱,生成线性滤波信号频谱。
在实际应用中,通过短时傅里叶变换提取到的频谱具有较高的稳定性。因此,对远端信号进行短时傅里叶变换,有利于线性滤波器预测线性回声信号频谱。进一步,有利于生成线性滤波信号频谱。
在一些实施例中,第一终端可以按照以下方式,执行上述步骤103。
具体地,对于线性滤波信号频谱,将至少一个频点中各个频点的幅值与对应的回声信号掩蔽值相乘,生成目标近端信号频谱。
由此,通过将线性滤波信号频谱中至少一个频点的幅值乘上对应的回声信号掩蔽值,能够去除线性滤波信号频谱中叠加的非线性回声信号频谱和残余的线性回声信号频谱。
进一步参考图4,作为对上述各图所示方法的实现,本公开提供了一种声音信号处理装置的一些实施例,该装置实施例与图1所示的方法实施例相对应,该装置具体可以应用于各种电子设备中。
如图4所示,本实施例的声音信号处理装置包括:第一生成单元401、确定单元402、第二生成单元403和转换单元404。第一生成单元401用于:基于来自第二终端的远端信号,对第一终端所采集麦克风信号的麦克风信号频谱进行线性滤波,生成线性滤波信号频谱,其中,麦克风信号是播放远端信号后采集的声音信号。确定单元402用于:基于远端信号频谱、麦克风信号频谱和线性滤波信号频谱,确定线性滤波信号频谱中至少一个频点的回声信号掩蔽值。第二生成单元403用于:利用所确定的至少一个回声信号掩蔽值对线性滤波信号频谱中所叠加回声信号频谱进行掩蔽,确定目标近端信号频谱。转换单元404用于:将目标近端信号频谱转换为目标近端信号。
在本实施例中,声音信号处理装置的第一生成单元401、确定单元402、第二生成单元403和转换单元404的具体处理及其所带来的技术效果可分别参考图1对应实施例中步骤101、步骤102、步骤103和步骤104的相关说明,在此不再赘述。
在一些实施例中,确定单元402进一步用于:将远端信号频谱、麦克风信号频谱和线性滤波信号频谱输入至掩蔽值确定模型中,得到线性滤波信号频谱中至少一个频点的回声信号掩蔽值。
在一些实施例中,掩蔽值确定模型通过以下方式训练生成:获取样本集合,其中,样本集合中的样本包括样本远端信号频谱、样本麦克风信号频谱、样本线性滤波信号频谱和样本线性滤波信号频谱中至少一个频点的样本回声信号掩蔽值;将从样本集合中所选取样本包括 的样本远端信号频谱、样本麦克风信号频谱和样本线性滤波信号频谱作为初始模型的输入,将所选取样本包括的至少一个样本回声信号掩蔽值作为初始模型的期望输出,训练生成掩蔽值确定模型。
在一些实施例中,掩蔽值确定模型包括频谱分离结构,其中,频谱分离结构基于输入至掩蔽值确定模型的远端信号频谱、麦克风信号频谱和线性滤波信号频谱的处理,拟合线性滤波信号频谱中包含的第一近端信号频谱和剩余信号频谱。
在一些实施例中,频谱分离结构包括依次连接的多个频谱分离块,其中,第一位次的频谱分离块基于所输入的远端信号频谱、麦克风信号频谱和线性滤波信号频谱的处理,拟合线性滤波信号频谱中包含的第一近端信号频谱和剩余信号频谱,大于等于第二位次的频谱分离块基于上一位次频谱分离块的输入频谱和输出频谱的处理,拟合线性滤波信号频谱中包含的第一近端信号频谱和剩余信号频谱。
在一些实施例中,每个频谱分离块包括第一特征升维层和第一特征压缩层,其中,第一特征升维层用于对输入至频谱分离块的频谱进行特征升维,第一特征压缩层用于对第一特征升维层所输出频谱进行部分频带的特征压缩。
在一些实施例中,掩蔽值确定模型包括频谱综合层,其中,频谱综合层用于将频谱分离结构输出的第一近端信号频谱和剩余信号频谱综合为第二近端信号频谱。
在一些实施例中,掩蔽值确定模型包括第二特征压缩层,其中,第二特征压缩层通过对频谱综合层输出的第二近端信号频谱进行全频带的特征压缩,拟合第三近端信号频谱。
在一些实施例中,掩蔽值确定模型包括全连接层,其中,全连接层基于第二特征压缩层输出的第三近端信号频谱,确定线性滤波信号频谱中至少一个频点的回声信号掩蔽值。
在一些实施例中,回声信号掩蔽值是第二特征压缩层所输出第三近端信号频谱与线性滤波信号频谱在相同频点的幅值模之比。
在一些实施例中,第一特征压缩层和第二特征压缩层是门控循环单元层。
在一些实施例中,第一生成单元401进一步用于:分别对麦克风信号和远端信号进行短时傅里叶变换,生成麦克风信号频谱和远端信号频谱;将远端信号频谱输入至线性滤波器中,得到预测回声信号频谱;从麦克风信号频谱中去除预测回声信号频谱,生成线性滤波信号频谱。
在一些实施例中,第二生成单元403进一步用于:对于线性滤波信号频谱,将至少一个频点中各个频点的幅值与对应的回声信号掩蔽值相乘,生成目标近端信号频谱。
进一步参考图5,图5示出了本公开的一些实施例的声音信号处理方法可以应用于其中的示例性系统架构。
如图5所示,系统架构可以包括终端501和终端502。在实际应用中,终端501与终端502可以通过网络进行交互。网络可以包括有线、无线通信链路或者光纤电缆等各种连接类型。
终端501和终端502上可以安装有各种应用(Application,App)。例如,终端501和终端502上可以安装有语音通话类应用。
在实际应用中,终端501和终端502可以将麦克风采集到的声音信号发送至对方。
终端501和终端502可以是硬件,也可以是软件。当终端501和终端502为硬件时,可以是安装有麦克风和扬声器的各种电子设备,包括但不限于智能手机、平板电脑、膝上型便携计算机和台式计算机等等。当终端501和终端502为软件时,可以安装在上述所列举的电子设备中。其可以实现成多个软件或软件模块,也可以实现成单个软件或软件模块。在此不做具体限定。
在一些场景中,终端501可以基于来自终端502的远端信号,对所采集麦克风信号的麦克风信号频谱进行线性滤波,生成线性滤波信号频谱。然后,终端501可以基于远端信号频谱、麦克风信号频谱和线性滤波信号频谱,确定线性滤波信号频谱中至少一个频点的回声信号掩蔽值。进一步,终端501可以利用所确定的至少一个回声信号掩蔽值对线性滤波信号频谱中所叠加回声信号频谱进行掩蔽,确定目标 近端信号频谱。最后,终端501可以将目标近端信号频谱转换为目标近端信号。
需要说明的是,本公开的实施例所提供的声音信号处理方法可以由终端501或者终端502执行,相应地,声音信号处理装置可以设置在终端501或者终端502中。
应该理解,图5中的终端的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端。
下面参考图6,其示出了适于用来实现本公开的一些实施例的电子设备(例如,图5中的终端)的结构示意图。本公开的一些实施例中的终端可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、PDA(个人数字助理)、PAD(平板电脑)、PMP(便携式多媒体播放器)、车载终端(例如车载导航终端)等等的移动终端,以及诸如数字TV、台式计算机等等的固定终端。图6示出的电子设备仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。图6示出的电子设备仅仅是一个示例,不应对本公开的实施例的功能和使用范围带来任何限制。
如图6所示,电子设备可以包括处理装置(例如中央处理器、图形处理器等)601,其可以根据存储在只读存储器(ROM)602中的程序或者从存储装置608加载到随机访问存储器(RAM)603中的程序而执行各种适当的动作和处理。在RAM 603中,还存储有电子设备操作所需的各种程序和数据。处理装置601、ROM 602以及RAM 603通过总线604彼此相连。输入/输出(I/O)接口605也连接至总线604。
通常,以下装置可以连接至I/O接口605:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置606;包括例如液晶显示器(LCD)、扬声器、振动器等的输出装置607;包括例如磁带、硬盘等的存储装置608;以及通信装置609。通信装置609可以允许电子设备与其它设备进行无线或有线通信以交换数据。虽然图6示出了具有各种装置的电子设备,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或 更少的装置。图6中示出的每个方框可以代表一个装置,也可以根据需要代表多个装置。
特别地,根据本公开的一些实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的一些实施例包括一种计算机程序产品,其包括承载在非暂态计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置609从网络上被下载和安装,或者从存储装置608被安装,或者从ROM 602被安装。在该计算机程序被处理装置601执行时,执行本公开实施例的方法中限定的上述功能。
需要说明的是,本公开的一些实施例所述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开的一些实施例中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开的一些实施例中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、RF(射频)等等,或者上述的任意合适的组合。
在一些实施方式中,客户端、服务器可以利用诸如HTTP(HyperText Transfer Protocol,超文本传输协议)之类的任何当前已知或未来研发的网络协议进行通信,并且可以与任意形式或介质的数字数据通信(例如,通信网络)互连。通信网络的示例包括局域网(“LAN”),广域网(“WAN”),网际网(例如,互联网)以及端对端网络(例如,ad hoc端对端网络),以及任何当前已知或未来研发的网络。
上述计算机可读介质可以是上述电子设备中所包含的,也可以是单独存在,而未装配入该电子设备中。上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备:基于来自第二终端的远端信号,对第一终端所采集麦克风信号的麦克风信号频谱进行线性滤波,生成线性滤波信号频谱,其中,麦克风信号是播放远端信号后采集的声音信号;基于远端信号频谱、麦克风信号频谱和线性滤波信号频谱,确定线性滤波信号频谱中至少一个频点的回声信号掩蔽值;利用所确定的至少一个回声信号掩蔽值对线性滤波信号频谱中所叠加回声信号频谱进行掩蔽,生成目标近端信号频谱;将目标近端信号频谱转换为目标近端信号。
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的一些实施例的操作的计算机程序代码,上述程序设计语言包括但不限于面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实 现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本公开的一些实施例中的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,这些单元的名称在某种情况下并不构成对该单元本身的限定,例如,转换单元还可以被描述为“将目标近端信号频谱转换为目标近端信号”的单元。
本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,非限制性地,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、片上系统(SOC)、复杂可编程逻辑设备(CPLD)等等。
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。
以上描述仅为本公开的一些较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本公开的实施例中所涉及的公开范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离上述公开构思的情况下,由上述技术特征或其等同特征进 行任意组合而形成的其它技术方案。例如上述特征与本公开中所公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。
此外,虽然采用特定次序描绘了各操作,但是这不应当理解为要求这些操作以所示出的特定次序或以顺序次序执行来执行。在一定环境下,多任务和并行处理可能是有利的。同样地,虽然在上面论述中包含了若干具体实现细节,但是这些不应当被解释为对本公开的范围的限制。在单独的实施例的上下文中描述的某些特征还可以组合地实现在单个实施例中。相反地,在单个实施例的上下文中描述的各种特征也可以单独地或以任何合适的子组合的方式实现在多个实施例中。
尽管已经采用特定于结构特征和/或方法逻辑动作的语言描述了本主题,但是应当理解所附权利要求书中所限定的主题未必局限于上面描述的特定特征或动作。相反,上面所描述的特定特征和动作仅仅是实现权利要求书的示例形式。

Claims (16)

  1. 一种声音信号处理方法,应用于第一终端,其特征在于,包括:
    基于来自第二终端的远端信号,对第一终端所采集麦克风信号的麦克风信号频谱进行线性滤波,生成线性滤波信号频谱,其中,所述麦克风信号是播放所述远端信号后采集的声音信号;
    基于远端信号频谱、所述麦克风信号频谱和所述线性滤波信号频谱,确定所述线性滤波信号频谱中至少一个频点的回声信号掩蔽值;
    利用所确定的至少一个回声信号掩蔽值对所述线性滤波信号频谱中所叠加回声信号频谱进行掩蔽,生成目标近端信号频谱;
    将所述目标近端信号频谱转换为目标近端信号。
  2. 根据权利要求1所述的方法,其特征在于,所述基于远端信号频谱、所述麦克风信号频谱和所述线性滤波信号频谱,确定所述线性滤波信号频谱中至少一个频点的回声信号掩蔽值,包括:
    将所述远端信号频谱、所述麦克风信号频谱和所述线性滤波信号频谱输入至掩蔽值确定模型中,得到所述线性滤波信号频谱中至少一个频点的回声信号掩蔽值。
  3. 根据权利要求2所述的方法,其特征在于,所述掩蔽值确定模型通过以下方式训练生成:
    获取样本集合,其中,所述样本集合中的样本包括样本远端信号频谱、样本麦克风信号频谱、样本线性滤波信号频谱和样本线性滤波信号频谱中至少一个频点的样本回声信号掩蔽值;
    将从所述样本集合中所选取样本包括的样本远端信号频谱、样本麦克风信号频谱和样本线性滤波信号频谱作为初始模型的输入,将所选取样本包括的至少一个样本回声信号掩蔽值作为所述初始模型的期望输出,训练生成所述掩蔽值确定模型。
  4. 根据权利要求2所述的方法,其特征在于,所述掩蔽值确定模 型包括频谱分离结构,其中,所述频谱分离结构基于输入至所述掩蔽值确定模型的远端信号频谱、麦克风信号频谱和线性滤波信号频谱的处理,拟合线性滤波信号频谱中包含的第一近端信号频谱和剩余信号频谱。
  5. 根据权利要求4所述的方法,其特征在于,所述频谱分离结构包括依次连接的多个频谱分离块,其中,第一位次的频谱分离块基于所输入的远端信号频谱、麦克风信号频谱和线性滤波信号频谱的处理,拟合线性滤波信号频谱中包含的第一近端信号频谱和剩余信号频谱,大于等于第二位次的频谱分离块基于上一位次频谱分离块的输入频谱和输出频谱的处理,拟合线性滤波信号频谱中包含的第一近端信号频谱和剩余信号频谱。
  6. 根据权利要求5所述的方法,其特征在于,每个频谱分离块包括第一特征升维层和第一特征压缩层,其中,所述第一特征升维层用于对输入至频谱分离块的频谱进行特征升维,所述第一特征压缩层用于对第一特征升维层所输出频谱进行部分频带的特征压缩。
  7. 根据权利要求2所述的方法,其特征在于,所述掩蔽值确定模型包括频谱综合层,其中,所述频谱综合层用于将频谱分离结构输出的第一近端信号频谱和剩余信号频谱综合为第二近端信号频谱。
  8. 根据权利要求2所述的方法,其特征在于,所述掩蔽值确定模型包括第二特征压缩层,其中,所述第二特征压缩层通过对频谱综合层输出的第二近端信号频谱进行全频带的特征压缩,拟合第三近端信号频谱。
  9. 根据权利要求2所述的方法,其特征在于,所述掩蔽值确定模型包括全连接层,其中,所述全连接层基于第二特征压缩层输出的第三近端信号频谱,确定线性滤波信号频谱中至少一个频点的回声信号 掩蔽值。
  10. 根据权利要求9所述的方法,其特征在于,回声信号掩蔽值是所述第二特征压缩层所输出第三近端信号频谱与所述线性滤波信号频谱在相同频点的幅值模之比。
  11. 根据权利要求6或8所述的方法,其特征在于,所述第一特征压缩层和所述第二特征压缩层是门控循环单元层。
  12. 根据权利要求1所述的方法,其特征在于,所述基于来自第二终端的远端信号,对第一终端所采集麦克风信号的麦克风信号频谱进行线性滤波,生成线性滤波信号频谱,包括:
    分别对所述麦克风信号和所述远端信号进行短时傅里叶变换,生成所述麦克风信号频谱和远端信号频谱;
    将所述远端信号频谱输入至线性滤波器中,得到预测回声信号频谱;
    从所述麦克风信号频谱中去除所述预测回声信号频谱,生成所述线性滤波信号频谱。
  13. 根据权利要求1-12中任一所述的方法,其特征在于,所述利用所确定的至少一个回声信号掩蔽值对所述线性滤波信号频谱中所叠加回声信号频谱进行掩蔽,生成目标近端信号频谱,包括:
    对于所述线性滤波信号频谱,将所述至少一个频点中各个频点的幅值与对应的回声信号掩蔽值相乘,生成所述目标近端信号频谱。
  14. 一种声音信号处理装置,应用于第一终端,其特征在于,包括:
    第一生成单元,用于基于来自第二终端的远端信号,对第一终端所采集麦克风信号的麦克风信号频谱进行线性滤波,生成线性滤波信号频谱,其中,所述麦克风信号是播放所述远端信号后采集的声音信 号;
    确定单元,用于基于远端信号频谱、所述麦克风信号频谱和所述线性滤波信号频谱,确定所述线性滤波信号频谱中至少一个频点的回声信号掩蔽值;
    第二生成单元,用于利用所确定的至少一个回声信号掩蔽值对所述线性滤波信号频谱中所叠加回声信号频谱进行掩蔽,确定目标近端信号频谱;
    转换单元,用于将所述目标近端信号频谱转换为目标近端信号。
  15. 一种电子设备,其特征在于,包括:
    一个或多个处理器;
    存储装置,用于存储一个或多个程序,
    当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求1-13中任一所述的方法。
  16. 一种计算机可读介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时实现如权利要求1-13中任一所述的方法。
PCT/CN2022/081979 2021-04-26 2022-03-21 声音信号处理方法、装置和电子设备 WO2022227932A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110456216.9A CN113179354B (zh) 2021-04-26 2021-04-26 声音信号处理方法、装置和电子设备
CN202110456216.9 2021-04-26

Publications (1)

Publication Number Publication Date
WO2022227932A1 true WO2022227932A1 (zh) 2022-11-03

Family

ID=76926295

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/081979 WO2022227932A1 (zh) 2021-04-26 2022-03-21 声音信号处理方法、装置和电子设备

Country Status (2)

Country Link
CN (1) CN113179354B (zh)
WO (1) WO2022227932A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116612778B (zh) * 2023-07-18 2023-11-14 腾讯科技(深圳)有限公司 回声及噪声抑制方法、相关装置和介质

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109841206A (zh) * 2018-08-31 2019-06-04 大象声科(深圳)科技有限公司 一种基于深度学习的回声消除方法
US20190222691A1 (en) * 2018-01-18 2019-07-18 Knowles Electronics, Llc Data driven echo cancellation and suppression
CN111951819A (zh) * 2020-08-20 2020-11-17 北京字节跳动网络技术有限公司 回声消除方法、装置及存储介质

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6833616B2 (ja) * 2017-05-29 2021-02-24 株式会社トランストロン エコー抑圧装置、エコー抑圧方法及びエコー抑圧プログラム
CN111341336B (zh) * 2020-03-16 2023-08-08 北京字节跳动网络技术有限公司 一种回声消除方法、装置、终端设备及介质

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190222691A1 (en) * 2018-01-18 2019-07-18 Knowles Electronics, Llc Data driven echo cancellation and suppression
CN109841206A (zh) * 2018-08-31 2019-06-04 大象声科(深圳)科技有限公司 一种基于深度学习的回声消除方法
CN111951819A (zh) * 2020-08-20 2020-11-17 北京字节跳动网络技术有限公司 回声消除方法、装置及存储介质

Also Published As

Publication number Publication date
CN113179354A (zh) 2021-07-27
CN113179354B (zh) 2023-10-10

Similar Documents

Publication Publication Date Title
CN111583903B (zh) 语音合成方法、声码器训练方法、装置、介质及电子设备
WO2022121799A1 (zh) 声音信号处理方法、装置和电子设备
CN111724807B (zh) 音频分离方法、装置、电子设备及计算机可读存储介质
US8615394B1 (en) Restoration of noise-reduced speech
CN112259116B (zh) 一种音频数据的降噪方法、装置、电子设备及存储介质
WO2020015270A1 (zh) 语音信号分离方法、装置、计算机设备以及存储介质
CN111798821A (zh) 声音转换方法、装置、可读存储介质及电子设备
WO2022135130A1 (zh) 语音提取方法、装置和电子设备
WO2022227932A1 (zh) 声音信号处理方法、装置和电子设备
WO2013121749A1 (ja) エコー消去装置、エコー消去方法、及び、通話装置
Shankar et al. Efficient two-microphone speech enhancement using basic recurrent neural network cell for hearing and hearing aids
WO2023005729A1 (zh) 语音信息处理方法、装置和电子设备
CN113223545A (zh) 一种语音降噪方法、装置、终端及存储介质
CN111369968B (zh) 语音合成方法、装置、可读介质及电子设备
CN112562633A (zh) 一种歌唱合成方法、装置、电子设备及存储介质
CN112599147B (zh) 音频降噪传输方法、装置、电子设备和计算机可读介质
CN113674752B (zh) 音频信号的降噪方法、装置、可读介质和电子设备
CN113571080A (zh) 语音增强方法、装置、设备及存储介质
CN111653261A (zh) 语音合成方法、装置、可读存储介质及电子设备
CN113823312B (zh) 语音增强模型生成方法和装置、语音增强方法和装置
JP2024502287A (ja) 音声強調方法、音声強調装置、電子機器、及びコンピュータプログラム
CN116137153A (zh) 一种语音降噪模型的训练方法以及语音增强方法
CN113763976A (zh) 音频信号的降噪方法、装置、可读介质和电子设备
CN113096679A (zh) 音频数据处理方法和装置
CN109378012B (zh) 用于单通道语音设备录制音频的降噪方法及系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22794408

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22794408

Country of ref document: EP

Kind code of ref document: A1