CN115472176A - Voice signal enhancement method and device - Google Patents

Voice signal enhancement method and device Download PDF

Info

Publication number
CN115472176A
CN115472176A CN202211110479.5A CN202211110479A CN115472176A CN 115472176 A CN115472176 A CN 115472176A CN 202211110479 A CN202211110479 A CN 202211110479A CN 115472176 A CN115472176 A CN 115472176A
Authority
CN
China
Prior art keywords
signal
microphone signal
microphone
enhancement
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211110479.5A
Other languages
Chinese (zh)
Inventor
韩润强
赵昊然
吕新亮
李楠
郑羲光
张晨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dajia Internet Information Technology Co Ltd
Original Assignee
Beijing Dajia Internet Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dajia Internet Information Technology Co Ltd filed Critical Beijing Dajia Internet Information Technology Co Ltd
Priority to CN202211110479.5A priority Critical patent/CN115472176A/en
Publication of CN115472176A publication Critical patent/CN115472176A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/56Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities
    • H04M3/568Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities audio processing specific to telephonic conferencing, e.g. spatial distribution, mixing of participants
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02165Two microphones, one receiving mainly the noise signal and the other one mainly the speech signal

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The disclosure relates to a speech signal enhancement method and device. The speech signal enhancement method comprises the following steps: selecting a first microphone signal from the at least one original microphone signal based on the audio quality data of the at least one original microphone signal; inputting a first microphone signal and a reference signal into an echo cancellation model to obtain a second microphone signal, wherein the reference signal is an original signal corresponding to a predetermined audio signal in the first microphone signal, and the predetermined audio signal is an audio signal played by a player; an enhanced speech signal acquisition unit configured to derive a target enhanced speech signal based on the second microphone signal.

Description

Voice signal enhancement method and device
Technical Field
The present disclosure relates to the field of audio processing, and in particular, to a method and an apparatus for enhancing a speech signal.
Background
Under the condition of meeting by a plurality of people, if a common computer or a mobile phone is adopted, the sound receiving effect is not ideal generally, because the sound pickup of the equipment is designed according to a close-range scene, the sound pickup audio quality is low due to a long distance, so that participants at opposite ends of the meeting are difficult to hear clearly, and the meeting experience is poor. There are some conference microphone hardware on the market, which uses the traditional signal processing method, i.e. directional microphones are pointed in different directions, when a person in the conference room speaks, the microphone pointed in this direction can obtain the microphone signal with the highest audio quality, and the microphone is selected to obtain the microphone signal with higher audio quality through a microphone selection algorithm. Still other systems employ additional extension microphones or a cascade of multiple conference microphone hardware. However, in such schemes, a microphone signal is selected by the audio quality alone and transmitted to the opposite terminal as the voice of the speaker, and the voice quality of the speaker cannot be effectively improved.
Disclosure of Invention
The present disclosure provides a method and an apparatus for enhancing a speech signal, so as to at least solve the problem that the speech quality of a speaker cannot be effectively improved in the related art.
According to a first aspect of the embodiments of the present disclosure, there is provided a speech signal enhancement method, including: selecting a first microphone signal from the at least one original microphone signal in dependence on the audio quality data of the at least one original microphone signal; inputting a first microphone signal and a reference signal into an echo cancellation model to obtain a second microphone signal, wherein the reference signal is an original signal corresponding to a predetermined audio signal in the first microphone signal, and the predetermined audio signal is an audio signal played by a player; based on the second microphone signal, a target enhanced speech signal is obtained.
Optionally, selecting a first microphone signal from the at least one original microphone signal based on the audio quality data of the at least one original microphone signal comprises: acquiring at least one original microphone signal; respectively carrying out voice enhancement processing on at least one original microphone signal to obtain at least one enhanced microphone signal; a target enhancement microphone signal is selected from the at least one enhancement microphone signal according to the audio quality data of the at least one enhancement microphone signal, and an original microphone signal corresponding to the target enhancement microphone signal is determined as the first microphone signal.
Optionally, the speech enhancement processing includes at least one of linear echo cancellation processing, non-linear echo cancellation processing, and noise reduction processing.
Optionally, inputting the first microphone signal and the reference signal into an echo cancellation model to obtain a second microphone signal, including: and inputting the first microphone signal, a third microphone signal and a reference signal into an echo cancellation model to obtain a second microphone signal, wherein the third microphone signal is the microphone signal obtained by performing voice enhancement processing on the first microphone signal.
Optionally, in a case where the speech enhancement processing includes linear echo cancellation processing, the third microphone signal is a microphone signal of the first microphone signal after the linear echo cancellation processing.
Optionally, deriving the target enhanced speech signal based on the second microphone signal comprises: taking the second microphone signal as a target enhanced speech signal; or gain processing is carried out on the second microphone signal to obtain the target enhanced voice signal.
Optionally, inputting the first microphone signal and the reference signal into an echo cancellation model to obtain a second microphone signal, including: respectively acquiring frequency domain signals of a first microphone signal and a reference signal; carrying out logarithmic processing on the frequency domain signal to obtain a processed frequency domain signal; and inputting the processed frequency domain signal into an echo cancellation model to obtain a second microphone signal.
According to a second aspect of the embodiments of the present disclosure, there is provided a speech signal enhancement apparatus including: a microphone signal acquisition unit configured to select a first microphone signal from the at least one original microphone signal in dependence on audio quality data of the at least one original microphone signal; the enhancement processing unit is configured to input the first microphone signal and a reference signal into an echo cancellation model to obtain a second microphone signal, wherein the reference signal is an original signal corresponding to a predetermined audio signal in the first microphone signal, and the predetermined audio signal is an audio signal played by a player; an enhanced speech signal acquisition unit configured to derive a target enhanced speech signal based on the second microphone signal.
Optionally, a microphone signal acquisition unit, further configured to acquire at least one original microphone signal; respectively carrying out voice enhancement processing on at least one original microphone signal to obtain at least one enhanced microphone signal; a target enhancement microphone signal is selected from the at least one enhancement microphone signal based on the audio quality data for the at least one enhancement microphone signal, and an original microphone signal corresponding to the target enhancement microphone signal is determined as the first microphone signal.
Optionally, the speech enhancement processing comprises at least one of linear echo cancellation processing, non-linear echo cancellation processing and noise reduction processing.
Optionally, the enhancement processing unit is further configured to input the first microphone signal, a third microphone signal and a reference signal into an echo cancellation model to obtain a second microphone signal, where the third microphone signal is a microphone signal obtained by performing speech enhancement processing on the first microphone signal.
Optionally, the third microphone signal is a microphone signal of the first microphone signal after linear echo cancellation processing.
Optionally, the enhanced speech signal obtaining unit is further configured to take the second microphone signal as the target enhanced speech signal; or gain processing is carried out on the second microphone signal to obtain the target enhanced voice signal.
Optionally, the enhancement processing unit is further configured to obtain frequency domain signals of the first microphone signal and the reference signal, respectively; carrying out logarithmic processing on the frequency domain signal to obtain a processed frequency domain signal; and inputting the processed frequency domain signal into an echo cancellation model to obtain a second microphone signal.
According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to execute the instructions to implement a speech signal enhancement method according to the present disclosure.
According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium, wherein instructions, when executed by at least one processor, cause the at least one processor to perform a speech signal enhancement method as described above according to the present disclosure.
According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product comprising computer instructions which, when executed by a processor, implement a speech signal enhancement method according to the present disclosure.
The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:
according to the speech signal enhancement method and the speech signal enhancement device, a first microphone signal is selected according to audio quality data of at least one original microphone signal, on the basis, the first microphone signal and an original signal corresponding to a preset audio signal in the first microphone signal are input into an echo cancellation model together to obtain a second microphone signal, and a target enhanced speech signal can be obtained based on the second microphone signal. According to the method and the device, the echo cancellation is carried out by adding the echo cancellation model, namely, the echo cancellation is carried out by introducing the deep learning technology, so that the voice quality of the enhanced microphone signal can be effectively improved, and the voice quality of a speaker is also improved. Therefore, the present disclosure solves the problem that the related art cannot effectively improve the voice quality of the speaker.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.
Fig. 1 is a schematic diagram illustrating an implementation scenario of a speech signal enhancement method according to an exemplary embodiment of the present disclosure;
FIG. 2 is a flow diagram illustrating a method of speech signal enhancement according to an exemplary embodiment;
FIG. 3 is a schematic diagram illustrating a conference microphone in accordance with an exemplary embodiment;
FIG. 4 is a diagram illustrating an echo cancellation model according to an exemplary embodiment;
FIG. 5 is a schematic diagram illustrating a microphone processing system in accordance with an exemplary embodiment;
FIG. 6 is a block diagram illustrating a speech signal enhancement apparatus according to an exemplary embodiment;
fig. 7 is a block diagram of an electronic device 700 according to an embodiment of the disclosure.
Detailed Description
In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.
It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The embodiments described in the following examples do not represent all embodiments consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
In this case, the expression "at least one of the items" in the present disclosure means a case where three types of parallel expressions "any one of the items", "a combination of any plural ones of the items", and "the entirety of the items" are included. For example, "include at least one of a and B" includes the following three cases in parallel: (1) comprises A; (2) comprises B; and (3) comprises A and B. For another example, "at least one of the first step and the second step is performed", which means that the following three cases are juxtaposed: (1) executing the step one; (2) executing the step two; and (3) executing the step one and the step two.
The present disclosure provides a speech signal enhancement method, which can effectively improve the speech quality of a speaker, and is described below by taking a video conference as an example.
Fig. 1 is a schematic diagram illustrating an implementation scenario of a voice signal enhancement method according to an exemplary embodiment of the present disclosure, and as shown in fig. 1, the implementation scenario includes a server 100, a user terminal 110, and a user terminal 120, where the number of the user terminals is not limited to 2, and includes but not limited to a mobile phone, a personal computer, and the like, the user terminal may install an application program for video, and the server may be one server, or several servers form a server cluster, or a cloud computing platform or a virtualization center.
An application on the user terminal 110 or the user terminal 120 acquires at least one original microphone signal through a microphone and then selects a first microphone signal from the at least one original microphone signal according to audio quality data of the at least one original microphone signal; the first microphone signal and the reference signal are input into an echo cancellation model to obtain a second microphone signal, and a target enhanced voice signal, namely a voice signal of a speaker, can be obtained based on the second microphone signal and is transmitted to an opposite terminal, wherein the reference signal is an original signal corresponding to a predetermined audio signal in the first microphone signal, and the predetermined audio signal is an audio signal played by a player. It should be noted that the user terminal 110 and the user terminal 120 may perform this task independently, or may provide data services for them through the server 100, which is not limited in this disclosure. When the server 100 provides data service for the user terminal 110 and/or the application program on the user terminal 120, the server 100 obtains at least one original microphone signal through a microphone, sends the at least one original microphone signal to the server 100, the server 100 selects a first microphone signal from the at least one original microphone signal according to audio quality data of the at least one original microphone signal, inputs the first microphone signal and a reference signal into an echo cancellation model to obtain a second microphone signal, and can obtain a target enhanced speech signal, namely a speech signal of a speaker, based on the second microphone signal and transmit the target enhanced speech signal to an opposite terminal, wherein the reference signal is an original signal corresponding to a predetermined audio signal in the first microphone signal, and the predetermined audio signal is an audio signal played by a player.
Hereinafter, a voice signal enhancement method and apparatus according to an exemplary embodiment of the present disclosure will be described in detail with reference to the accompanying drawings.
Fig. 2 is a flowchart illustrating a speech signal enhancement method according to an exemplary embodiment, and as shown in fig. 2, the speech signal enhancement method includes the steps of:
in step S201, a first microphone signal is selected from the at least one original microphone signal in dependence on the audio quality data of the at least one original microphone signal. In this step, the microphone signal with the best audio quality of the at least one original microphone signal may be selected as the first microphone signal, and the audio quality may be determined according to audio quality data, which may include, but is not limited to, signal-to-noise ratio, signal energy.
According to an exemplary embodiment of the present disclosure, selecting a first microphone signal from at least one original microphone signal according to audio quality data of the at least one original microphone signal may include: acquiring at least one original microphone signal; respectively carrying out voice enhancement processing on at least one original microphone signal to obtain at least one enhanced microphone signal; a target enhancement microphone signal is selected from the at least one enhancement microphone signal based on the audio quality data for the at least one enhancement microphone signal, and an original microphone signal corresponding to the target enhancement microphone signal is determined as the first microphone signal. According to the embodiment, before the first microphone signal is acquired, the original microphone signal is subjected to voice enhancement processing, so that useless signals in the microphone signal are reduced, that is, the audio quality of all the original microphone signals is integrally improved, one microphone signal is selected from at least one microphone signal subjected to enhancement processing to acquire the first microphone signal, and the first microphone signal containing richer speaker voices can be acquired.
For example, taking an audio conference as an example, the at least one original microphone signal may be obtained by a conference microphone hardware, fig. 3 is a schematic diagram of a conference microphone according to an exemplary embodiment, as shown in fig. 3, the conference microphone hardware may include two parts, one part is a main microphone array, a circular microphone array is provided thereon, there may be 3 or more microphones, the main microphone array has speakers, i.e., a player, at the same time to play sound at an opposite end, and the other part is an extension microphone, which is optionally capable of working alone or working in conjunction with one or more microphones, and the disclosure is not limited thereto. After the at least one original microphone signal is acquired by the conference microphone shown in fig. 3, a part of unnecessary signals in the at least one original microphone signal may be removed, for example, speech enhancement processing (removing noise signals, echo signals, etc.) is performed to obtain at least one enhanced microphone signal. According to the audio quality data of at least one enhanced microphone signal, an enhanced microphone signal with high audio quality is selected from the at least one enhanced microphone signal to serve as a target enhanced microphone signal, and an original microphone signal corresponding to the target enhanced microphone signal is determined as a first microphone signal.
According to an exemplary embodiment of the present disclosure, the speech enhancement processing includes at least one of linear echo cancellation processing, nonlinear echo cancellation processing, and noise reduction processing. According to the embodiment, the voice enhancement processing can be quickly and conveniently carried out through echo cancellation and noise reduction processing, and useless signals can be removed.
For example, still taking an audio conference as an example, each microphone signal acquired by a conference microphone may be subjected to linear echo cancellation processing, nonlinear echo cancellation processing, and noise reduction processing. It should be noted that the three processes may be executed separately, may also be executed simultaneously, and may also be executed simultaneously by selecting two processes, which is not limited to this disclosure. For example, when the three processes are executed simultaneously, the conventional Linear Echo Cancellation process (e.g., conventional Acoustic Echo Cancellation, abbreviated as AEC), the Non-Linear Echo Cancellation process (e.g., non-Linear Processing, abbreviated as NLP), and the conventional Noise reduction process (e.g., noise Suppression, abbreviated as NS) may be performed first, and the sequence of the three processes is not limited in the present disclosure. After removing the echo and the noise, a microphone with high audio quality may be selected based on the microphone signal from which the echo and the noise are removed, for example, a corresponding microphone index may be obtained, and then an original microphone signal corresponding to the index may be found as the first microphone signal.
According to an exemplary embodiment of the present disclosure, selecting a first microphone signal from at least one original microphone signal in dependence of audio quality data of the at least one original microphone signal comprises: acquiring a signal-to-noise ratio of at least one original microphone signal according to the audio quality data of the at least one original microphone signal; the original microphone signal with the largest signal-to-noise ratio is determined as the first microphone signal. According to the embodiment, the microphone signals with high audio quality can be determined quickly and accurately by comparing the signal-to-noise ratios of the microphone signals.
According to an exemplary embodiment of the present disclosure, selecting a target enhancement microphone signal from at least one enhancement microphone signal in dependence of audio quality data of the at least one enhancement microphone signal comprises: acquiring a signal-to-noise ratio of the at least one enhanced microphone signal according to the audio quality data of the at least one enhanced microphone signal; and determining the enhanced microphone signal with the largest signal-to-noise ratio as the target enhanced microphone signal. According to the embodiment, the microphone signals with high audio quality can be determined quickly and accurately by comparing the signal-to-noise ratios of the microphone signals.
For example, after removing the echo and the noise, the signal-to-noise ratio of each microphone signal may be determined based on the microphone signal from which the echo and the noise are removed, and the microphone signal with the highest signal-to-noise ratio, i.e., the microphone signal with the highest audio quality, is determined. It should be noted that finding out the microphone signal with the highest audio quality through the signal-to-noise ratio is an optional scheme, and may also find out the microphone signal with the highest audio quality by using the signal energy, and of course, other schemes may also be used to find out the microphone signal with the highest audio quality, which is not limited in this disclosure.
Returning to fig. 2, in step S202, a first microphone signal and a reference signal are input into the echo cancellation model to obtain a second microphone signal, where the reference signal is an original signal corresponding to a predetermined audio signal in the first microphone signal, and the predetermined audio signal is an audio signal played by the player. The echo cancellation model in this step may adopt a deep learning network, for example, as shown in fig. 4, where fig. 4 is a schematic diagram of an echo cancellation model shown according to an exemplary embodiment, a first microphone signal and a reference signal are respectively subjected to short-time Fourier transform (STFT), converted into a frequency-domain power spectrum signal, and then Log is performed on the frequency-domain power spectrum signal to obtain I shown in fig. 4 16 Is shown by 16 Inputting the signal into a convolution network, and obtaining an amplitude spectrum M for eliminating echo and noise simultaneously through a GRU network and a Dense network 16speech (i.e., mask). Will magnitude spectrum M 16speech And multiplying the original microphone signal corresponding to the index, performing short-time inverse Fourier transform, and outputting an enhanced microphone signal. It should be noted that when the first microphone collects a signal, an audio signal played by the player is collected, the audio signal is generally transmitted from an opposite terminal, and the reference signal is an original signal that is not played by the player, because once the audio signal is played by the player, the original signal has a certain loss, the reference information selects the original signal that is not played, and the original signal may be obtained through an internal recording interface or copied through a copy mode, which is not limited in this disclosure.
According to an exemplary embodiment of the present disclosure, inputting a first microphone signal and a reference signal into an echo cancellation model to obtain a second microphone signal includes: and inputting the first microphone signal, a third microphone signal and a reference signal into an echo cancellation model to obtain a second microphone signal, wherein the third microphone signal is the microphone signal obtained by performing voice enhancement processing on the first microphone signal. According to the embodiment, the third microphone signal is added as the input of the model, and the auxiliary reference signal can perform echo cancellation on the first microphone signal together, so that a microphone signal with a better enhancement effect can be obtained.
For example, the target enhancement microphone signal may be input together into the model so that the auxiliary reference signal jointly echo-cancels the first microphone signal. Specifically, the target enhancement microphone may be the first microphone signal after linear echo cancellation, may also be the first microphone signal after nonlinear echo cancellation, and may also be the first microphone signal after noise reduction, which is not limited in this disclosure.
According to an exemplary embodiment of the disclosure, the third microphone signal is a microphone signal of the first microphone signal after linear echo cancellation processing. According to the embodiment, because the distortion of the microphone signal after the linear echo cancellation processing is relatively minimum, the auxiliary reference signal can perform better echo cancellation on the first microphone signal together, so that the microphone signal with a further improved enhancement effect can be obtained. For example, in general, a linear AEC processed signal may be preferred because the signal distortion is relatively small.
According to an exemplary embodiment of the present disclosure, inputting a first microphone signal and a reference signal into an echo cancellation model to obtain a second microphone signal includes: respectively acquiring frequency domain signals of a first microphone signal and a reference signal; carrying out logarithmic processing on the frequency domain signal to obtain a processed frequency domain signal; and inputting the processed frequency domain signal into an echo cancellation model to obtain a second microphone signal. According to the embodiment, before the model is input, the frequency domain signals of the first microphone signal and the reference signal are obtained, the logarithm of the frequency domain signals is taken, and the processed frequency domain signals are input into the model, so that part of processing operation of the model can be reduced, and the calculation amount of the model is reduced.
For example, the first microphone signal and the reference signal may be respectively subjected to short-time Fourier transform (abbreviated as STFT) to be converted into a frequency-domain power spectrum signal (i.e., the frequency-domain signal), and then Log is determined on the frequency-domain power spectrum signal to obtain the processed frequency-domain signal.
Returning to fig. 2, in step S203, a target enhanced speech signal is obtained based on the second microphone signal.
According to an exemplary embodiment of the present disclosure, deriving a target enhanced speech signal based on the second microphone signal comprises: taking the second microphone signal as a target enhanced speech signal; or gain processing is carried out on the second microphone signal to obtain the target enhanced voice signal. The target enhanced voice signal, i.e., the voice signal of the speaker, transmits the voice signal of the speaker to the opposite terminal. According to the embodiment, the second microphone signal is directly used as the target enhanced voice signal, so that the acquisition process can be simplified.
For example, for performing Gain processing on the second microphone signal to obtain the target enhanced speech signal, automatic Gain Control (AGC) may be used to perform Gain processing on the second microphone signal to obtain a speech signal of the speaker, and then transmit the speech signal of the speaker to the opposite end.
To facilitate understanding of the above embodiments, the system is described below with reference to fig. 5, where fig. 5 is a schematic diagram of a microphone processing system according to an exemplary embodiment. As shown in fig. 5, a plurality of microphone signals are collected from the conference microphone hardware, and the plurality of microphone signals are subjected to conventional linear AEC processing, NLP processing, and conventional noise reduction NS processing, to finally obtain a plurality of microphone signals from which echoes and noise are removed. Next, selecting the microphone with the highest signal-to-noise ratio from the plurality of microphone signals with echo and noise removed, and obtaining the corresponding microphoneIndex (index). Then, an original microphone signal (i.e., the first microphone signal in the above-mentioned embodiment), a linear AEC output signal (i.e., the first microphone signal from which part of the unwanted signal is removed), and a reference signal corresponding to this index are found, an echo cancellation Deep AEC module based on Deep learning is output to obtain an enhanced microphone signal, and finally, after AGC is added, a set of conference microphone processing system with low computational complexity is finally formed. For the echo cancellation Deep AEC module, STFT transformation is first performed on the original microphone signal, the linear AEC output signal and the reference signal corresponding to index to obtain a frequency domain power spectrum signal, and then Log is determined on the frequency domain power spectrum signal to obtain I shown in fig. 4 16 A first reaction of 16 Inputting the signals into a convolution network, obtaining an amplitude spectrum mask for eliminating echo and noise simultaneously through a GRU network and a Dense network, multiplying the amplitude spectrum mask by an original microphone signal corresponding to the index, and performing short-time inverse Fourier transform to obtain an enhanced signal.
The embodiment of the disclosure selects a microphone signal with the highest signal-to-noise ratio by using low-computation-quantity traditional signal processing modes, namely linear AEC, NLP and NS, inputs the microphone signal with the highest signal-to-noise ratio, a corresponding linear AEC output signal and a reference signal into Deep learning-based Deep AEC to obtain a high-quality enhanced microphone signal with echo and noise removed, and finally forms a set of conference microphone processing system with low computation complexity after adding AGC.
In conclusion, the method and the device can improve the signal-to-noise ratio and obtain high-quality voice under a noise environment on the one hand, and can also utilize deep learning to perform echo cancellation on the other hand, so that the echo cancellation effect is greatly improved.
Fig. 6 is a block diagram illustrating a speech signal enhancement apparatus according to an exemplary embodiment. Referring to fig. 6, the apparatus includes:
a microphone signal acquisition unit 60 configured to select a first microphone signal from the at least one original microphone signal in dependence on the audio quality data of the at least one original microphone signal; an enhancement processing unit 62 configured to input the first microphone signal and a reference signal into an echo cancellation model to obtain a second microphone signal, where the reference signal is an original signal corresponding to a predetermined audio signal in the first microphone signal, and the predetermined audio signal is an audio signal played by a player; an enhanced speech signal obtaining unit 64 configured to obtain a target enhanced speech signal based on the second microphone signal.
According to an exemplary embodiment of the present disclosure, the microphone signal acquisition unit 60 is further configured to acquire at least one raw microphone signal; respectively carrying out voice enhancement processing on at least one original microphone signal to obtain at least one enhanced microphone signal; a target enhancement microphone signal is selected from the at least one enhancement microphone signal based on the audio quality data for the at least one enhancement microphone signal, and an original microphone signal corresponding to the target enhancement microphone signal is determined as the first microphone signal.
According to an exemplary embodiment of the present disclosure, the speech enhancement process includes at least one of a linear echo cancellation process, a non-linear echo cancellation process, and a noise reduction process.
According to an exemplary embodiment of the disclosure, the enhancement processing unit 62 is further configured to input the first microphone signal, a third microphone signal and a reference signal into the echo cancellation model to obtain the second microphone signal, wherein the third microphone signal is a microphone signal of the first microphone signal after being subjected to the speech enhancement processing.
According to an exemplary embodiment of the present disclosure, the third microphone signal is a microphone signal of the first microphone signal after linear echo cancellation processing.
According to an exemplary embodiment of the present disclosure, the enhanced speech signal acquisition unit 64 is further configured to take the second microphone signal as the target enhanced speech signal; or gain processing is carried out on the second microphone signal to obtain the target enhanced voice signal.
According to an exemplary embodiment of the present disclosure, the enhancement processing unit 62 is further configured to obtain frequency domain signals of the first microphone signal and the reference signal, respectively; carrying out logarithmic processing on the frequency domain signal to obtain a processed frequency domain signal; and inputting the processed frequency domain signal into an echo cancellation model to obtain a second microphone signal.
According to an embodiment of the present disclosure, an electronic device may be provided. Fig. 7 is a block diagram of an electronic device 700 including at least one memory 701 having a set of computer-executable instructions stored therein that, when executed by the at least one processor, perform a method of speech signal enhancement according to an embodiment of the present disclosure, and at least one processor 702, according to an embodiment of the present disclosure.
By way of example, the electronic device 700 may be a PC computer, tablet device, personal digital assistant, smartphone, or other device capable of executing the set of instructions described above. The electronic device 1000 need not be a single electronic device, but can be any collection of devices or circuits that can execute the above instructions (or sets of instructions) individually or in combination. The electronic device 700 may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with local or remote (e.g., via wireless transmission).
In the electronic device 700, the processor 702 may include a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processor 702 may also include an analog processor, a digital processor, a microprocessor, a multi-core processor, a processor array, a network processor, or the like.
The processor 702 may execute instructions or code stored in memory, where the memory 701 may also store data. The instructions and data may also be transmitted or received over a network via a network interface device, which may employ any known transmission protocol.
The memory 701 may be integrated with the processor 702, for example, by having RAM or flash memory disposed within an integrated circuit microprocessor or the like. Further, memory 701 may comprise a stand-alone device, such as an external disk drive, storage array, or any other storage device usable by a database system. The memory 701 and the processor 702 may be operatively coupled or may communicate with each other, e.g., through I/O ports, network connections, etc., such that the processor 702 can read files stored in the memory 701.
In addition, the electronic device 700 may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device may be connected to each other via a bus and/or a network.
According to an embodiment of the present disclosure, there may also be provided a computer-readable storage medium, wherein when the instructions in the computer-readable storage medium are executed by at least one processor, the at least one processor is caused to perform the speech signal enhancement method of the embodiment of the present disclosure. Examples of the computer-readable storage medium herein include: read-only memory (ROM), random-access programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random-access memory (RAM), dynamic random-access memory (DRAM), static random-access memory (SRAM), flash memory, non-volatile memory, CD-ROM, CD-R, CD + R, CD-RW, CD + RW, DVD-ROM, DVD-R, DVD + R, DVD-RW, DVD + RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, blu-ray or optical disk memory, hard Disk Drives (HDDs), solid-state hard disks (SSDs), card-type memory (such as a multimedia card, a Secure Digital (SD) card, or an extreme digital (XD) card), magnetic tape, floppy disk, magneto-optical data storage, hard disk, solid-state disk, and any other device configured to store and to enable a computer program and any associated data file, data processing structure and to be executed by a computer. The computer program in the computer-readable storage medium described above can be run in an environment deployed in a computer apparatus, such as a client, a host, a proxy device, a server, and the like, and further, in one example, the computer program and any associated data, data files, and data structures are distributed across a networked computer system such that the computer program and any associated data, data files, and data structures are stored, accessed, and executed in a distributed fashion by one or more processors or computers.
According to an embodiment of the present disclosure, there is provided a computer program product comprising computer instructions which, when executed by a processor, implement the speech signal enhancement method of an embodiment of the present disclosure.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (11)

1. A method for speech signal enhancement, comprising:
selecting a first microphone signal from at least one original microphone signal according to audio quality data of the at least one original microphone signal;
inputting the first microphone signal and a reference signal into an echo cancellation model to obtain a second microphone signal, wherein the reference signal is an original signal corresponding to a predetermined audio signal in the first microphone signal, and the predetermined audio signal is an audio signal played by a player;
based on the second microphone signal, a target enhanced speech signal is obtained.
2. The speech signal enhancement method of claim 1 wherein selecting a first microphone signal from at least one original microphone signal based on audio quality data for the at least one original microphone signal comprises:
acquiring the at least one raw microphone signal;
respectively carrying out voice enhancement processing on the at least one original microphone signal to obtain at least one enhanced microphone signal;
selecting a target enhancement microphone signal from the at least one enhancement microphone signal in dependence on the audio quality data of the at least one enhancement microphone signal, and
and determining an original microphone signal corresponding to the target enhancement microphone signal as the first microphone signal.
3. The speech signal enhancement method of claim 2 wherein the speech enhancement processing comprises at least one of linear echo cancellation processing, non-linear echo cancellation processing, and noise reduction processing.
4. The speech signal enhancement method of claim 1 wherein inputting the first microphone signal and the reference signal into an echo cancellation model to obtain a second microphone signal comprises:
and inputting the first microphone signal, a third microphone signal and the reference signal into an echo cancellation model to obtain a second microphone signal, wherein the third microphone signal is the microphone signal obtained after the first microphone signal is subjected to voice enhancement processing.
5. The speech signal enhancement method of claim 4 wherein the third microphone signal is a microphone signal after the first microphone signal has been subjected to linear echo cancellation processing.
6. The speech signal enhancement method of claim 1, wherein said deriving a target enhanced speech signal based on the second microphone signal comprises:
taking the second microphone signal as the target enhanced speech signal; alternatively, the first and second electrodes may be,
and performing gain processing on the second microphone signal to obtain the target enhanced voice signal.
7. The speech signal enhancement method of claim 1 wherein inputting the first microphone signal and the reference signal into an echo cancellation model to obtain a second microphone signal comprises:
respectively acquiring frequency domain signals of the first microphone signal and the reference signal;
carrying out logarithmic processing on the frequency domain signal to obtain a processed frequency domain signal;
and inputting the processed frequency domain signal into the echo cancellation model to obtain a second microphone signal.
8. A speech signal enhancement apparatus, comprising:
a microphone signal acquisition unit configured to select a first microphone signal from at least one original microphone signal in dependence on audio quality data of the at least one original microphone signal;
the enhancement processing unit is configured to input the first microphone signal and a reference signal into an echo cancellation model to obtain a second microphone signal, wherein the reference signal is an original signal corresponding to a predetermined audio signal in the first microphone signal, and the predetermined audio signal is an audio signal played by a player;
an enhanced speech signal acquisition unit configured to derive a target enhanced speech signal based on the second microphone signal.
9. An electronic device, comprising:
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to execute the instructions to implement the speech signal enhancement method of any one of claims 1 to 7.
10. A computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by at least one processor, cause the at least one processor to perform the method for speech signal enhancement of any of claims 1-7.
11. A computer program product comprising computer instructions, characterized in that the computer instructions, when executed by a processor, implement the speech signal enhancement method according to any one of claims 1 to 7.
CN202211110479.5A 2022-09-13 2022-09-13 Voice signal enhancement method and device Pending CN115472176A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211110479.5A CN115472176A (en) 2022-09-13 2022-09-13 Voice signal enhancement method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211110479.5A CN115472176A (en) 2022-09-13 2022-09-13 Voice signal enhancement method and device

Publications (1)

Publication Number Publication Date
CN115472176A true CN115472176A (en) 2022-12-13

Family

ID=84371218

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211110479.5A Pending CN115472176A (en) 2022-09-13 2022-09-13 Voice signal enhancement method and device

Country Status (1)

Country Link
CN (1) CN115472176A (en)

Similar Documents

Publication Publication Date Title
CN112289333B (en) Training method and device of voice enhancement model and voice enhancement method and device
WO2018188282A1 (en) Echo cancellation method and device, conference tablet computer, and computer storage medium
CN110265064B (en) Audio frequency crackle detection method, device and storage medium
CN113241088B (en) Training method and device of voice enhancement model and voice enhancement method and device
CN113470685B (en) Training method and device for voice enhancement model and voice enhancement method and device
JP2008517317A (en) Audio data processing system, method, program element, and computer readable medium
CN110782914A (en) Signal processing method and device, terminal equipment and storage medium
CN112309426A (en) Voice processing model training method and device and voice processing method and device
CN112712816A (en) Training method and device of voice processing model and voice processing method and device
CN114758668A (en) Training method of voice enhancement model and voice enhancement method
CN114038476A (en) Audio signal processing method and device
CN112652290B (en) Method for generating reverberation audio signal and training method of audio processing model
CN113257267B (en) Method for training interference signal elimination model and method and equipment for eliminating interference signal
CN113225574B (en) Signal processing method and device
CN113555031B (en) Training method and device of voice enhancement model, and voice enhancement method and device
CN115472176A (en) Voice signal enhancement method and device
CN114283833A (en) Speech enhancement model training method, speech enhancement method, related device and medium
CN113192526B (en) Audio processing method and audio processing device
CN113113046B (en) Performance detection method and device for audio processing, storage medium and electronic equipment
CN111145770B (en) Audio processing method and device
CN113707163B (en) Speech processing method and device and model training method and device
CN113436644B (en) Sound quality evaluation method, device, electronic equipment and storage medium
CN114974286A (en) Signal enhancement method, model training method, device, equipment, sound box and medium
CN113470677B (en) Audio processing method, device and system
CN112911465B (en) Signal sending method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination