CN113744762B - Signal-to-noise ratio determining method and device, electronic equipment and storage medium - Google Patents

Signal-to-noise ratio determining method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113744762B
CN113744762B CN202110908512.8A CN202110908512A CN113744762B CN 113744762 B CN113744762 B CN 113744762B CN 202110908512 A CN202110908512 A CN 202110908512A CN 113744762 B CN113744762 B CN 113744762B
Authority
CN
China
Prior art keywords
signal
noisy
domain signal
frequency domain
energy information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110908512.8A
Other languages
Chinese (zh)
Other versions
CN113744762A (en
Inventor
郝一亚
阮良
陈功
王志强
胡林艳
陈丽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Netease Zhiqi Technology Co Ltd
Original Assignee
Hangzhou Netease Zhiqi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Netease Zhiqi Technology Co Ltd filed Critical Hangzhou Netease Zhiqi Technology Co Ltd
Priority to CN202110908512.8A priority Critical patent/CN113744762B/en
Publication of CN113744762A publication Critical patent/CN113744762A/en
Application granted granted Critical
Publication of CN113744762B publication Critical patent/CN113744762B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Telephone Function (AREA)
  • Noise Elimination (AREA)

Abstract

The disclosure relates to the technical field of communication, and discloses a signal-to-noise ratio determining method, a device, electronic equipment and a storage medium, wherein the method comprises the following steps: after converting a reference time domain signal of a first audio frame into a reference frequency domain signal of a second audio frame, determining the voice category of each frequency point in the reference frequency domain signal; the reference time domain signal is obtained by framing a noiseless reference signal based on a preset framing mode; after converting the noisy time domain signal of the first audio frame into a noisy frequency domain signal of the second audio frame, determining the voice energy information and the non-voice energy information of the noisy frequency domain signal based on the voice category of each frequency point in the reference frequency domain signal of the corresponding frame; the noisy time domain signal is obtained by framing a noisy signal matched with the reference signal based on a preset framing mode; and determining the signal-to-noise ratio of the noisy frequency domain signal based on the speech energy information and the non-speech energy information of the noisy frequency domain signal.

Description

Signal-to-noise ratio determining method and device, electronic equipment and storage medium
Technical Field
The disclosure relates to the technical field of communication, and in particular relates to a signal-to-noise ratio determining method, a signal-to-noise ratio determining device, electronic equipment and a storage medium.
Background
The SIGNAL-to-NOISE RATIO (SNR), which refers to the RATIO of normal sound SIGNALs to NOISE SIGNALs, is one of the important parameters in communication. The SNR needs to be calculated during the process of measuring the impact of the processing system on noise, testing the performance of the speech noise reduction module, etc.
In the related art, a speech segment and a non-speech segment are divided in a time domain from a reference signal, the divided speech segment is used as a speech segment of a matched noisy signal, and the divided non-speech segment is used as a non-speech segment of the noisy signal; and determining the SNR according to the energy value of the voice segment and the energy value of the non-voice segment in the noisy signal.
However, when non-stationary noise occurs in the noisy signal, it is difficult to accurately determine the SNR in the above manner.
Disclosure of Invention
The present disclosure provides a signal-to-noise ratio determining method, apparatus, electronic device, and storage medium for accurately determining SNR.
In a first aspect, an embodiment of the present disclosure provides a signal-to-noise ratio determining method, including:
after converting a reference time domain signal of a first audio frame into a reference frequency domain signal of a second audio frame, determining the voice category of each frequency point in the reference frequency domain signal; wherein the speech class characterizes speech or non-speech; the reference time domain signal is obtained by framing a noiseless reference signal based on a preset framing mode; the first audio frame represents an audio frame of a time domain type, and the second audio frame represents an audio frame of a frequency domain type;
After converting the noisy time domain signal of the first audio frame into a noisy frequency domain signal of the second audio frame, determining the voice energy information and the non-voice energy information of the noisy frequency domain signal based on the voice category of each frequency point in the reference frequency domain signal of the corresponding frame; the noisy time domain signal is obtained by framing a noisy signal matched with the reference signal based on the preset framing mode;
and determining the signal-to-noise ratio of the noisy frequency domain signal based on the speech energy information and the non-speech energy information of the noisy frequency domain signal.
In some optional embodiments, determining the voice class of each frequency point in the reference frequency domain signal includes:
determining the energy value of a corresponding frequency point aiming at any frequency point of the reference frequency domain signal;
if the energy value of the corresponding frequency point is smaller than the preset energy, determining that the voice class of the corresponding frequency point represents non-voice; otherwise, determining the voice category of the corresponding frequency point to represent voice.
In some optional embodiments, determining the speech energy information and the non-speech energy information of the noisy frequency-domain signal based on the speech class of each frequency point in the reference frequency-domain signal of the corresponding frame includes:
Determining the sum of energy information in the noisy frequency domain signal as voice energy information of the noisy frequency domain signal, wherein all frequency points representing voice in the reference frequency domain signal of the corresponding frame; and
and determining the sum of energy information in the noisy frequency domain signal as non-voice energy information of the noisy frequency domain signal, wherein all frequency points representing non-voice in the reference frequency domain signal of the corresponding frame.
In some alternative embodiments, determining the signal-to-noise ratio of the noisy frequency-domain signal based on the speech energy information and the non-speech energy information of the noisy frequency-domain signal comprises:
selecting larger first energy information from the voice energy information of the noisy frequency domain signal and preset energy information, and selecting larger second energy information from the non-voice energy information of the noisy frequency domain signal and the preset energy information;
and determining the signal-to-noise ratio of the noisy frequency domain signal based on the ratio of the first energy information to the second energy information.
In some alternative embodiments, converting the reference time domain signal of the first audio frame to the reference frequency domain signal of the second audio frame comprises:
Windowing is carried out on the reference time domain signal to obtain a windowed reference signal in the time domain;
and carrying out Fourier transform on the windowed reference signal to obtain a reference frequency domain signal on a frequency domain.
In some alternative embodiments, further comprising:
the noisy signal is time-domain aligned with the reference signal.
In some alternative embodiments, time-domain aligning the noisy signal with the reference signal comprises:
determining a time difference between the noisy signal and the reference signal based on a signal cross-correlation;
the noisy signal is moved to be aligned with the reference signal in the time domain based on the time difference value.
In a second aspect, an embodiment of the present disclosure provides a signal-to-noise ratio determining apparatus, the apparatus including: the system comprises a voice category determining module, a signal conversion module, an energy information determining module and a signal-to-noise ratio determining module;
the voice class determining module is used for determining the voice class of each frequency point in the reference frequency domain signal after the signal converting module converts the reference time domain signal of the first audio frame into the reference frequency domain signal of the second audio frame; wherein the speech class characterizes speech or non-speech; the reference time domain signal is obtained by framing a noiseless reference signal based on a preset framing mode; the first audio frame represents an audio frame of a time domain type, and the second audio frame represents an audio frame of a frequency domain type;
The energy information determining module is used for determining the voice energy information and the non-voice energy information of the noisy frequency domain signal based on the voice category of each frequency point in the reference frequency domain signal of the corresponding frame after the signal converting module converts the noisy time domain signal of the first audio frame into the noisy frequency domain signal of the second audio frame; the noisy time domain signal is obtained by framing a noisy signal matched with the reference signal based on the preset framing mode;
the signal-to-noise ratio determining module is used for determining the signal-to-noise ratio of the noisy frequency domain signal based on the voice energy information and the non-voice energy information of the noisy frequency domain signal.
In some optional embodiments, the voice class determination module is specifically configured to:
determining the energy value of a corresponding frequency point aiming at any frequency point of the reference frequency domain signal;
if the energy value of the corresponding frequency point is smaller than the preset energy, determining that the voice class of the corresponding frequency point represents non-voice; otherwise, determining the voice category of the corresponding frequency point to represent voice.
In some alternative embodiments, the energy information determining module is specifically configured to:
Determining the sum of energy information in the noisy frequency domain signal as voice energy information of the noisy frequency domain signal, wherein all frequency points representing voice in the reference frequency domain signal of the corresponding frame; and
and determining the sum of energy information in the noisy frequency domain signal as non-voice energy information of the noisy frequency domain signal, wherein all frequency points representing non-voice in the reference frequency domain signal of the corresponding frame.
In some optional embodiments, the signal-to-noise ratio determining module is specifically configured to:
selecting larger first energy information from the voice energy information of the noisy frequency domain signal and preset energy information, and selecting larger second energy information from the non-voice energy information of the noisy frequency domain signal and the preset energy information;
and determining the signal-to-noise ratio of the noisy frequency domain signal based on the ratio of the first energy information to the second energy information.
In some alternative embodiments, the signal conversion module is specifically configured to:
windowing is carried out on the reference time domain signal to obtain a windowed reference signal in the time domain;
and carrying out Fourier transform on the windowed reference signal to obtain a reference frequency domain signal on a frequency domain.
In some optional embodiments, the method further includes a framing processing module for:
the noisy signal is time-domain aligned with the reference signal.
In some optional embodiments, the framing processing module is specifically configured to:
determining a time difference between the noisy signal and the reference signal based on a signal cross-correlation;
the noisy signal is moved to be aligned with the reference signal in the time domain based on the time difference value.
In a third aspect, an embodiment of the present disclosure provides an electronic device, including at least one processor and at least one memory, where the memory stores a computer program that, when executed by the processor, causes the processor to perform the signal-to-noise ratio determining method according to any one of the first aspect.
In a fourth aspect, an embodiment of the present disclosure provides a storage medium storing a computer program executable by an electronic device, which when run on the electronic device, causes the electronic device to perform a signal-to-noise ratio determination method according to any one of the first aspects.
The signal-to-noise ratio determining method, the signal-to-noise ratio determining device, the electronic equipment and the storage medium provided by the embodiment of the disclosure have the following beneficial effects:
The embodiment of the disclosure converts the reference signal after framing into a reference frequency domain signal on a frequency domain, and converts the noisy signal after framing into a noisy frequency domain signal on the frequency domain; by determining whether the voice class of each frequency point in the reference frequency domain signal represents voice or non-voice, distinguishing the non-voice from the voice in the noisy frequency domain signal based on the voice class of each frequency point in the noisy frequency domain signal of the corresponding frame, even if non-stationary noise is superimposed in the voice of the noisy signal in the time domain, distinguishing the non-stationary noise from the voice in the noisy frequency domain signal based on the voice class of the frequency point in the frequency domain can accurately determine voice energy information and non-voice energy information of the noisy frequency domain signal; thus accurately determining the signal-to-noise ratio of each frame of noisy frequency domain signal.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings that are needed in the description of the embodiments will be briefly described below, it will be apparent that the drawings in the following description are only some embodiments of the present disclosure, and that other drawings may be obtained according to these drawings without inventive effort to a person of ordinary skill in the art.
FIG. 1 is a schematic diagram of a speech segment and a non-speech segment provided by an embodiment of the present disclosure;
fig. 2 is a flowchart of a first signal-to-noise ratio determining method according to an embodiment of the disclosure;
fig. 3 is a flowchart of a second signal-to-noise ratio determining method according to an embodiment of the disclosure;
FIG. 4A is a schematic diagram of a noisy signal provided by an embodiment of the present disclosure;
FIG. 4B is a signal-to-noise ratio diagram of a frame boundary for a noisy signal provided by an embodiment of the present disclosure;
fig. 5 is a schematic diagram of a signal-to-noise ratio determining apparatus provided by an embodiment of the present disclosure;
fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the disclosure;
fig. 7 is a schematic diagram of a program product provided by an embodiment of the present disclosure.
Detailed Description
For the purpose of promoting an understanding of the principles and advantages of the disclosure, reference will now be made in detail to the drawings, in which it is apparent that the embodiments described are only some, but not all embodiments of the disclosure. Based on the embodiments in this disclosure, all other embodiments that a person of ordinary skill in the art would obtain without making any inventive effort are within the scope of protection of this disclosure.
The terms "first," "second," and the like in this disclosure are used for distinguishing between similar objects and not for describing a particular sequential or chronological order. Such as: "first audio frame" does not refer to the order of the reference time domain signal in the reference signal or the order of the noisy time domain signal in the noisy signal, but refers to the reference time domain signal and the type of the noisy time domain signal as audio frames in the time domain; likewise, "second audio frame" refers to an audio frame of which the type of reference frequency domain signal and noisy frequency domain signal are in the frequency domain.
In the description of the present disclosure, unless otherwise indicated, the meaning of "a plurality" is two or more.
The term "reference signal" in this disclosure refers to a clean speech signal that is free of noise;
the "noisy signal" is a signal obtained by superimposing noise in the matched "reference signal".
The SNR needs to be calculated during the process of measuring the impact of the processing system on noise, testing the performance of the speech noise reduction module, etc. In the related art, a speech segment and a non-speech segment are divided in a time domain from a reference signal, the divided speech segment is used as a speech segment of a matched noisy signal, and the divided non-speech segment is used as a non-speech segment of the noisy signal; the SNR is determined based on the energy value of the speech segment and the energy value of the non-speech segment (noise segment) in the noisy signal. Referring to fig. 1, A3 and A5 in the reference signal are speech segments, and A2, A4 and A6 are non-speech segments; correspondingly, A1 in the noisy signal # 、A3 # A5 # For speech segments, A2 # 、A4 # A6 # Is a non-speech segment. A1A 1 # The energy value of (2) is denoted as V1 # ,A2 # The energy value of (2) is denoted as V2 # ,A3 # The energy value of (2) is denoted as V3 # ,A4 # The energy value of (2) is denoted as V4 # ,A5 # The energy value of (2) is denoted as V5 # ,A6 # The energy value of (2) is denoted as V6 # . Energy value V of speech segment in noisy signal S =V1 # +V3 # +V5 # The method comprises the steps of carrying out a first treatment on the surface of the Energy value V of non-speech segment in noisy signal N =V2 # +V4 # +V6 #
However, non-stationary noise may occur in the noisy signal, for example: voice segment A3 # Is superimposed with non-stationary noise, the energy value of which is not added to the energy value of the non-speech segment (noise segment) but is erroneously added to the energy value of the speech segment (V3) # ) In the calculated energy value (V S ) Larger, non-speech segment energy value (V N ) Smaller results in a larger resulting SNR. Therefore, it is difficult to accurately determine the SNR in the above manner.
In view of this, an embodiment of the present disclosure provides a signal-to-noise ratio determining method, apparatus, electronic device, and storage medium, where the embodiment of the present disclosure converts a reference signal into a reference frequency domain signal on a frequency domain after framing, and converts a noisy signal into a noisy frequency domain signal on the frequency domain after framing; by determining whether the voice class of each frequency point in the reference frequency domain signal represents voice or non-voice, distinguishing the non-voice from the voice in the noisy frequency domain signal based on the voice class of each frequency point in the noisy frequency domain signal of the corresponding frame, even if non-stationary noise is superimposed in the voice of the noisy signal in the time domain, distinguishing the non-stationary noise from the voice in the noisy frequency domain signal based on the voice class of the frequency point in the frequency domain can accurately determine voice energy information and non-voice energy information of the noisy frequency domain signal; thus accurately determining the signal-to-noise ratio of each frame of noisy frequency domain signal.
Having described the basic principles of the present disclosure, the following will describe the technical solutions of the present disclosure and how the technical solutions of the present disclosure solve the above technical problems with reference to the drawings and the specific embodiments. The following embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments.
The embodiment of the disclosure provides a flow chart of a first signal-to-noise ratio determining method, as shown in fig. 2, comprising the following steps:
step S201: after converting the reference time domain signal of the first audio frame into the reference frequency domain signal of the second audio frame, determining the voice category of each frequency point in the reference frequency domain signal.
Wherein the speech class characterizes speech or non-speech; the reference time domain signal is obtained by framing a noiseless reference signal based on a preset framing mode; the first audio frame characterizes an audio frame of a time domain type and the second audio frame characterizes an audio frame of a frequency domain type.
As described above, when non-stationary noise occurs in a noisy signal, analysis in the time domain cannot accurately determine SNR. Based on this, the present embodiment transfers the noisy signal and the reference signal to the frequency domain for analysis.
Because the audio signal is stable in a short time, the reference signal is firstly framed based on a preset framing mode to obtain M frames of reference time domain signals, and the noisy signal is framed by adopting the same preset framing mode to obtain M frames of noisy time domain signals. The preset frame dividing method is not particularly limited in the present disclosure, for example, the frame length is 10ms, the frame length is 20ms, or the frame length is 30ms, etc.
Each frame of reference time domain signal and each frame of noisy time domain signal obtained by framing are relatively stable signals in the time domain, so that the signals can be converted into signals in the frequency domain.
Step S202: after the noisy time domain signal of the first audio frame is converted into the noisy frequency domain signal of the second audio frame, the voice energy information and the non-voice energy information of the noisy frequency domain signal are determined based on the voice category of each frequency point in the reference frequency domain signal of the corresponding frame.
The noisy time domain signal is obtained by framing the noisy time domain signal matched with the reference signal based on the preset framing mode.
In this embodiment, the order of the noisy time-domain signals in the noisy signals (i.e., the frame order) is the same as the order of the reference time-domain signals of the corresponding frames in the reference signals; for example: the reference time domain signal of the corresponding frame of the k frame noisy time domain signal (where the k frame refers to the order of the noisy time domain signal in the noisy signal is k) is the k frame reference time domain signal (where the k frame refers to the order of the reference time domain signal in the reference signal is k), that is, the reference frequency domain signal of the corresponding frame of the k frame noisy frequency domain signal is the k frame reference frequency domain signal.
By the steps, the voice category of each frequency point in the reference frequency domain signal is determined, so that the voice frequency point and the non-voice frequency point (the non-voice frequency point comprises the frequency point of non-stationary noise) of the noisy frequency domain signal of the corresponding frame can be distinguished, and further the voice energy information and the non-voice energy information of the noisy frequency domain signal are determined.
Step S203: and determining the signal-to-noise ratio of the noisy frequency domain signal based on the speech energy information and the non-speech energy information of the noisy frequency domain signal.
According to the scheme, the reference signal is subjected to framing treatment and then converted into the reference frequency domain signal on the frequency domain, and the noisy signal is subjected to framing treatment and then converted into the noisy frequency domain signal on the frequency domain; by determining whether the voice class of each frequency point in the reference frequency domain signal represents voice or non-voice, distinguishing the non-voice from the voice in the noisy frequency domain signal based on the voice class of each frequency point in the noisy frequency domain signal of the corresponding frame, even if non-stationary noise is superimposed in the voice of the noisy signal in the time domain, distinguishing the non-stationary noise from the voice in the noisy frequency domain signal based on the voice class of the frequency point in the frequency domain can accurately determine voice energy information and non-voice energy information of the noisy frequency domain signal; thus accurately determining the signal-to-noise ratio of each frame of noisy frequency domain signal.
The specific implementation manner of converting the signal in the time domain into the signal in the frequency domain in the step S201 is not limited, and the conversion of any reference time domain signal into the reference frequency domain signal may be implemented by the following way:
windowing is carried out on the reference time domain signal to obtain a windowed reference signal in the time domain;
and carrying out Fourier transform (such as short-time Fourier transform) on the windowed reference signal to obtain a reference frequency domain signal in a frequency domain.
Taking the k frame reference time domain signal as an example:
wherein x (k, n) is the kth frame referenceA time domain signal is considered; />Windowing a reference signal for a kth frame; h (L) is a window function, L represents window length;
wherein STFT is a short-time Fourier transform, X (k, ω) i ) Reference the frequency domain signal for the kth frame; i=1, 2, … … N, N is the total number of frequency points contained in the reference frequency domain signal of the kth frame.
Similarly, in the step S202, any of the noisy time-domain signals is converted into a noisy frequency-domain signal, which may be implemented by:
windowing the noisy time domain signal to obtain a windowed noisy signal in the time domain;
and carrying out Fourier transform (such as short-time Fourier transform) on the windowed noisy signal to obtain a noisy frequency domain signal in a frequency domain.
Because the reference signal does not contain noise, only contains voice and silence, the energy value of the frequency point representing the voice is larger, and the energy value of the frequency point representing the non-voice (silence) is smaller. Based on this, the above-mentioned determination of the voice class of each frequency point in the reference frequency domain signal in step S201 may be implemented by, but not limited to, the following ways:
determining the energy value of a corresponding frequency point aiming at any frequency point of the reference frequency domain signal;
if the energy value of the corresponding frequency point is smaller than the preset energy, determining that the voice class of the corresponding frequency point represents non-voice; otherwise, determining the voice category of the corresponding frequency point to represent voice.
In practice, a signal to noise ratio of sound pressure level, i.e. 20 x lg (s/n), may be employed, closer to the human ear. Accordingly, the energy value of each frequency point can be determined by:
X e (k,ω i )=20*lg(|X(k,ω i ) I), wherein X (k, ω) i ) For the kth frame reference frequency domain signal, X e (k,ω i ) Reference frequency domain signal intermediate frequency point omega for kth frame i Is a function of the energy value of the (c).
If X e (k,ω i )<Beta, determining the frequency point omega i Is characterized by non-speech; if X e (k,ω i ) More than or equal to beta, determining the frequency point omega i Is characterized by the speech class; wherein, beta is the preset energy, and the specific value can be set according to the actual application scene.
The implementation manner of determining the energy value of each frequency point is merely illustrative, and the disclosure is not limited thereto.
As described above, the energy value of the frequency point representing the voice is large, and the energy value of the frequency point representing the non-voice (mute) is small; after the energy value of the corresponding frequency point is determined, the energy value of the corresponding frequency point is compared with the preset energy, and if the energy value of the corresponding frequency point is smaller than the preset energy, the fact that the energy value of the corresponding frequency point is smaller is indicated, and the voice category representation non-voice of the corresponding frequency point can be accurately determined; if the energy value of the corresponding frequency point is not smaller than the preset energy, the fact that the energy value of the corresponding frequency point is larger is indicated, and the voice category representation voice of the corresponding frequency point can be accurately determined.
The above-mentioned determination of the speech energy information and the non-speech energy information of the noisy frequency-domain signal in step S202 may be achieved by, but not limited to, the following ways:
determining the sum of energy information in the noisy frequency domain signal as voice energy information of the noisy frequency domain signal, wherein all frequency points representing voice in the reference frequency domain signal of the corresponding frame; and
and determining the sum of energy information in the noisy frequency domain signal as non-voice energy information of the noisy frequency domain signal, wherein all frequency points representing non-voice in the reference frequency domain signal of the corresponding frame.
Taking the k frame noisy frequency domain signal as an example:
wherein E is N (k) Non-speech energy information for the k frame noisy frequency domain signal; e (E) S (k) Voice energy information of the frequency domain signal with noise for the kth frame; s (k, ω) i ) A noisy frequency domain signal for the kth frame; if the frequency point omega i Is characterized by non-speech, VAD (k, ω) i ) Is 0; if the frequency point omega i Is characterized by speech, VAD (k, ω) i ) 1.
The above examples are only one possible implementation of determining non-speech energy information and speech energy information, and the disclosure is not limited thereto.
The determining the signal-to-noise ratio of the noisy frequency domain signal in the step S203 based on the speech energy information and the non-speech energy information of the noisy frequency domain signal may be implemented by, but is not limited to, the following ways:
selecting larger first energy information from the voice energy information of the noisy frequency domain signal and preset energy information, and selecting larger second energy information from the non-voice energy information of the noisy frequency domain signal and the preset energy information;
and determining the signal-to-noise ratio of the noisy frequency domain signal based on the ratio of the first energy information to the second energy information.
The preset energy information may be set according to an actual application scenario, which is not specifically limited in the disclosure.
Taking the preset energy information of 1e-06 (i.e. 0.000001) as an example, the signal to noise ratio of the k frame noisy frequency domain signal can be determined by the following method:
wherein SNR (k) is the signal-to-noise ratio of the k frame noisy frequency domain signal (i.e., the signal-to-noise ratio of the k frame noisy time domain signal in the noisy signal); e (E) S (k) For the speech energy information of the k frame noisy frequency domain signal, max [ E ] S (k),1e-06]Is the first energy information; e (E) N (k) Non-speech energy information for a k-th frame noisy frequency domain signal, max [ E ] N (k),1e-06]Is the first oneTwo energy information.
The above example is only one possible implementation of determining the signal-to-noise ratio of the noisy frequency domain signal, and the disclosure is not limited thereto.
Selecting larger first energy information from the voice energy information and the preset energy information, and selecting larger second energy information from the non-voice energy information and the preset energy information; the problem that the obtained signal-to-noise ratio is not accurate enough due to the fact that preset energy information or non-voice energy information is too small is avoided; and accurately determining the signal-to-noise ratio of the frame level based on the ratio of the first energy information to the second energy information.
The embodiment of the disclosure provides a flowchart of a second signal-to-noise ratio determining method, as shown in fig. 3, including the following steps:
Step S301: the noisy signal is time-domain aligned with the reference signal.
In implementation, there may be a time difference between the noisy signal and the reference signal, for example, the noisy signal is 0.0002s later than the reference signal, if the noisy signal and the reference signal are directly subjected to framing processing, the noisy frequency domain signal is not matched with the reference frequency domain signal of the corresponding frame, the category of the frequency point in the reference frequency domain signal cannot be used as the category of the frequency point in the noisy frequency domain signal of the corresponding frame, and the determined speech energy information and the determined non-speech energy information of the noisy frequency domain signal are not accurate enough. Based on this, it is necessary to align the two signals in the time domain before framing the noisy signal with the reference signal.
In some alternative embodiments, the noisy signal may be aligned with the reference signal in the time domain by:
determining a time difference between the noisy signal and the reference signal based on a signal cross-correlation;
the noisy signal is moved to be aligned with the reference signal in the time domain based on the time difference value.
Illustratively, the time difference between the noisy signal and the reference signal is calculated by the following formula:
Delay = max [ (s Σx) (n) ] -floor { [ len(s) +len (x) -1]/2}; wherein Delay is the time difference between the noisy signal and the reference signal, +..
After Delay is obtained, the noisy signal can be moved, and finally the noisy signal aligned with the reference signal is obtained. It can be appreciated that the subsequent steps are all based on the aligned noisy signals, and the specific implementation is similar to that of the embodiment of fig. 2, and the detailed description of this embodiment is omitted.
Step S302: after converting the reference time domain signal of the first audio frame into the reference frequency domain signal of the second audio frame, determining the voice category of each frequency point in the reference frequency domain signal.
Step S303: after the noisy time domain signal of the first audio frame is converted into the noisy frequency domain signal of the second audio frame, the voice energy information and the non-voice energy information of the noisy frequency domain signal are determined based on the voice category of each frequency point in the reference frequency domain signal of the corresponding frame.
Step S304: and determining the signal-to-noise ratio of the noisy frequency domain signal based on the speech energy information and the non-speech energy information of the noisy frequency domain signal.
According to the scheme, the noisy frequency domain signal and the reference signal are aligned in the time domain, so that the noisy frequency domain signal is matched with the reference frequency domain signal of the corresponding frame, and the category of the frequency point in the reference frequency domain signal can be directly used as the category of the frequency point in the noisy frequency domain signal of the corresponding frame, so that the voice energy information and the non-voice energy information of the noisy frequency domain signal can be more accurately determined.
Referring to fig. 4A, a schematic diagram of a noisy signal is provided in this embodiment, fig. 4B is a signal-to-noise ratio schematic diagram of a frame level of the noisy signal of fig. 4A obtained by the above signal-to-noise ratio determining method, and the frame order in fig. 4B refers to the order of the noisy time domain signal in the noisy signal, where the frame order of the kth frame reference time domain signal is k.
Based on the same inventive concept, the embodiment of the disclosure also provides a signal-to-noise ratio determining device, which can inherit the descriptions of the foregoing method embodiments. Based on the foregoing embodiments, fig. 5 is a schematic structural diagram of a signal-to-noise ratio determining apparatus according to an embodiment of the disclosure, where the signal-to-noise ratio determining apparatus specifically includes: a voice class determination module 501, a signal conversion module 502, an energy information determination module 503, and a signal to noise ratio determination module 504;
The speech class determining module 501 is configured to determine a speech class of each frequency point in the reference frequency domain signal after the signal converting module 502 converts the reference time domain signal of the first audio frame into the reference frequency domain signal of the second audio frame; wherein the speech class characterizes speech or non-speech; the reference time domain signal is obtained by framing a noiseless reference signal based on a preset framing mode; the first audio frame represents an audio frame of a time domain type, and the second audio frame represents an audio frame of a frequency domain type;
the energy information determining module 503 is configured to determine, after the signal converting module 502 converts the noisy time-domain signal of the first audio frame into a noisy frequency-domain signal of the second audio frame, speech energy information and non-speech energy information of the noisy frequency-domain signal based on the speech category of each frequency point in the reference frequency-domain signal of the corresponding frame; the noisy time domain signal is obtained by framing a noisy signal matched with the reference signal based on the preset framing mode;
the signal-to-noise ratio determining module 504 is configured to determine a signal-to-noise ratio of the noisy frequency domain signal based on the speech energy information and the non-speech energy information of the noisy frequency domain signal.
In some alternative embodiments, the voice class determination module 501 is specifically configured to:
determining the energy value of a corresponding frequency point aiming at any frequency point of the reference frequency domain signal;
if the energy value of the corresponding frequency point is smaller than the preset energy, determining that the voice class of the corresponding frequency point represents non-voice; otherwise, determining the voice category of the corresponding frequency point to represent voice.
In some alternative embodiments, the energy information determining module 503 is specifically configured to:
determining the sum of energy information in the noisy frequency domain signal as voice energy information of the noisy frequency domain signal, wherein all frequency points representing voice in the reference frequency domain signal of the corresponding frame; and
and determining the sum of energy information in the noisy frequency domain signal as non-voice energy information of the noisy frequency domain signal, wherein all frequency points representing non-voice in the reference frequency domain signal of the corresponding frame.
In some alternative embodiments, the signal-to-noise ratio determining module 504 is specifically configured to:
selecting larger first energy information from the voice energy information of the noisy frequency domain signal and preset energy information, and selecting larger second energy information from the non-voice energy information of the noisy frequency domain signal and the preset energy information;
And determining the signal-to-noise ratio of the noisy frequency domain signal based on the ratio of the first energy information to the second energy information.
In some alternative embodiments, the signal conversion module 502 is specifically configured to:
windowing is carried out on the reference time domain signal to obtain a windowed reference signal in the time domain;
and carrying out Fourier transform on the windowed reference signal to obtain a reference frequency domain signal on a frequency domain.
In some optional embodiments, the method further includes a framing processing module 505 for:
the noisy signal is time-domain aligned with the reference signal.
In some optional embodiments, the framing processing module 505 is specifically configured to:
determining a time difference between the noisy signal and the reference signal based on a signal cross-correlation;
the noisy signal is moved to be aligned with the reference signal in the time domain based on the time difference value.
Since the snr determining device is the snr determining device in the method in the embodiments of the present disclosure, and the principle of the snr determining device for solving the problem is similar to that of the method, the implementation of the snr determining device may refer to the implementation of the method, and the repetition is omitted.
An electronic device 600 according to such an embodiment of the present disclosure is described below with reference to fig. 6. The electronic device shown in fig. 6 is merely an example, and does not impose any limitation on the functionality and scope of use of the embodiments of the present disclosure.
As shown in fig. 6, the electronic device 600 is embodied in the form of a general purpose computing device. Components of electronic device 600 may include, but are not limited to: at least one processor 601, at least one memory 602, a bus 603 connecting the different system components, including the memory 602 and the processor 601.
Bus 603 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, a processor, and a local bus using any of a variety of bus architectures.
The memory 602 may include readable media in the form of volatile memory such as Random Access Memory (RAM) 6021 and/or cache memory 6022, and may further include Read Only Memory (ROM) 6023.
Memory 602 may also include a program/utility 6025 having a set (at least one) of program modules 6024, such program modules 6024 include, but are not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.
The electronic device 600 may also communicate with one or more external devices 604 (e.g., keyboard, pointing device, etc.), one or more devices that enable a user to interact with the electronic device 600, and/or any devices (e.g., routers, modems, etc.) that enable the electronic device 600 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 605. Also, the electronic device 600 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet, through a network adapter 606. As shown, the network adapter 606 communicates with other modules for the electronic device 600 over the bus 603. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 600, including, but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.
In the disclosed embodiment, the memory 602 stores a computer program that, when executed by the processor 601, causes the processor 601 to perform:
After converting a reference time domain signal of a first audio frame into a reference frequency domain signal of a second audio frame, determining the voice category of each frequency point in the reference frequency domain signal; wherein the speech class characterizes speech or non-speech; the reference time domain signal is obtained by framing a noiseless reference signal based on a preset framing mode; the first audio frame represents an audio frame of a time domain type, and the second audio frame represents an audio frame of a frequency domain type;
after converting the noisy time domain signal of the first audio frame into a noisy frequency domain signal of the second audio frame, determining the voice energy information and the non-voice energy information of the noisy frequency domain signal based on the voice category of each frequency point in the reference frequency domain signal of the corresponding frame; the noisy time domain signal is obtained by framing a noisy signal matched with the reference signal based on the preset framing mode;
and determining the signal-to-noise ratio of the noisy frequency domain signal based on the speech energy information and the non-speech energy information of the noisy frequency domain signal.
In some alternative embodiments, the processor specifically performs:
determining the energy value of a corresponding frequency point aiming at any frequency point of the reference frequency domain signal;
If the energy value of the corresponding frequency point is smaller than the preset energy, determining that the voice class of the corresponding frequency point represents non-voice; otherwise, determining the voice category of the corresponding frequency point to represent voice.
In some alternative embodiments, the processor specifically performs:
determining the sum of energy information in the noisy frequency domain signal as voice energy information of the noisy frequency domain signal, wherein all frequency points representing voice in the reference frequency domain signal of the corresponding frame; and
and determining the sum of energy information in the noisy frequency domain signal as non-voice energy information of the noisy frequency domain signal, wherein all frequency points representing non-voice in the reference frequency domain signal of the corresponding frame.
In some alternative embodiments, the processor specifically performs:
selecting larger first energy information from the voice energy information of the noisy frequency domain signal and preset energy information, and selecting larger second energy information from the non-voice energy information of the noisy frequency domain signal and the preset energy information;
and determining the signal-to-noise ratio of the noisy frequency domain signal based on the ratio of the first energy information to the second energy information.
In some alternative embodiments, the processor specifically performs:
windowing is carried out on the reference time domain signal to obtain a windowed reference signal in the time domain;
and carrying out Fourier transform on the windowed reference signal to obtain a reference frequency domain signal on a frequency domain.
In some alternative embodiments, the processor further performs:
the noisy signal is time-domain aligned with the reference signal.
In some alternative embodiments, the processor specifically performs:
determining a time difference between the noisy signal and the reference signal based on a signal cross-correlation;
the noisy signal is moved to be aligned with the reference signal in the time domain based on the time difference value.
Since the electronic device is the electronic device in the method in the embodiment of the disclosure, and the principle of the electronic device for solving the problem is similar to that of the method, the implementation of the electronic device may refer to the implementation of the method, and the repetition is not repeated.
In some possible implementations, aspects of the disclosure may also be implemented in the form of a program product comprising program code for causing a processor of an electronic device to carry out the steps of any one of the signal-to-noise ratio determination methods described above, when the program product is run on the electronic device.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
As shown in fig. 7, a program product 700 in accordance with an embodiment of the present disclosure is described that may employ a portable compact disc read-only memory (CD-ROM) and include program code and may be run on an electronic device. However, the program product of the present disclosure is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the electronic device, partly on the electronic device, as a stand-alone software package, partly on the electronic device and partly on a remote device or entirely on the remote device. In the case of remote devices, the remote device may be connected to the electronic device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).
It should be noted that while several modules or sub-modules of the system are mentioned in the detailed description above, such partitioning is merely exemplary and not mandatory. Indeed, the features and functions of two or more modules described above may be embodied in one module in accordance with embodiments of the present disclosure. Conversely, the features and functions of one module described above may be further divided into a plurality of modules to be embodied.
Furthermore, while the operations of the various modules of the disclosed system are depicted in a particular order in the drawings, this is not required to either imply that the operations must be performed in that particular order or that all of the illustrated operations be performed in order to achieve desirable results. Additionally or alternatively, certain operations may be omitted, multiple operations combined into one operation execution, and/or one operation decomposed into multiple operation executions.
While the spirit and principles of the present disclosure have been described with reference to several particular embodiments, it is to be understood that this disclosure is not limited to the particular embodiments disclosed nor does it imply that features in these aspects are not to be combined to benefit from this division, which is done for convenience of description only. The disclosure is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims (14)

1. A signal-to-noise ratio determination method, the method comprising:
after converting a reference time domain signal of a first audio frame into a reference frequency domain signal of a second audio frame, determining the voice category of each frequency point in the reference frequency domain signal; wherein the speech class characterizes speech or non-speech; the reference time domain signal is obtained by framing a noiseless reference signal based on a preset framing mode; the first audio frame represents an audio frame of a time domain type, and the second audio frame represents an audio frame of a frequency domain type;
after converting the noisy time domain signal of the first audio frame into a noisy frequency domain signal of the second audio frame, determining the voice energy information and the non-voice energy information of the noisy frequency domain signal based on the voice category of each frequency point in the reference frequency domain signal of the corresponding frame; the noisy time domain signal is obtained by framing a noisy signal matched with the reference signal based on the preset framing mode;
determining a signal-to-noise ratio of the noisy frequency domain signal based on the speech energy information and the non-speech energy information of the noisy frequency domain signal;
determining a signal-to-noise ratio of the noisy frequency-domain signal based on the speech energy information and the non-speech energy information of the noisy frequency-domain signal, comprising:
Selecting larger first energy information from the voice energy information of the noisy frequency domain signal and preset energy information, and selecting larger second energy information from the non-voice energy information of the noisy frequency domain signal and the preset energy information;
and determining the signal-to-noise ratio of the noisy frequency domain signal based on the ratio of the first energy information to the second energy information.
2. The method of claim 1, wherein determining the voice class for each frequency point in the reference frequency domain signal comprises:
determining the energy value of a corresponding frequency point aiming at any frequency point of the reference frequency domain signal;
if the energy value of the corresponding frequency point is smaller than the preset energy, determining that the voice class of the corresponding frequency point represents non-voice; otherwise, determining the voice category of the corresponding frequency point to represent voice.
3. The method of claim 1, wherein determining speech energy information and non-speech energy information for the noisy frequency-domain signal based on the speech class for each frequency point in the reference frequency-domain signal for the corresponding frame comprises:
determining the sum of energy information in the noisy frequency domain signal as voice energy information of the noisy frequency domain signal, wherein all frequency points representing voice in the reference frequency domain signal of the corresponding frame; and
And determining the sum of energy information in the noisy frequency domain signal as non-voice energy information of the noisy frequency domain signal, wherein all frequency points representing non-voice in the reference frequency domain signal of the corresponding frame.
4. The method of claim 1, wherein converting the reference time domain signal of the first audio frame to the reference frequency domain signal of the second audio frame comprises:
windowing is carried out on the reference time domain signal to obtain a windowed reference signal in the time domain;
and carrying out Fourier transform on the windowed reference signal to obtain a reference frequency domain signal on a frequency domain.
5. The method according to any one of claims 1 to 4, further comprising:
the noisy signal is time-domain aligned with the reference signal.
6. The method of claim 5, wherein time-domain aligning the noisy signal with the reference signal comprises:
determining a time difference between the noisy signal and the reference signal based on a signal cross-correlation;
the noisy signal is moved to be aligned with the reference signal in the time domain based on the time difference value.
7. A signal-to-noise ratio determining apparatus, the apparatus comprising: the system comprises a voice category determining module, a signal conversion module, an energy information determining module and a signal-to-noise ratio determining module;
The voice class determining module is used for determining the voice class of each frequency point in the reference frequency domain signal after the signal converting module converts the reference time domain signal of the first audio frame into the reference frequency domain signal of the second audio frame; wherein the speech class characterizes speech or non-speech; the reference time domain signal is obtained by framing a noiseless reference signal based on a preset framing mode; the first audio frame represents an audio frame of a time domain type, and the second audio frame represents an audio frame of a frequency domain type;
the energy information determining module is used for determining the voice energy information and the non-voice energy information of the noisy frequency domain signal based on the voice category of each frequency point in the reference frequency domain signal of the corresponding frame after the signal converting module converts the noisy time domain signal of the first audio frame into the noisy frequency domain signal of the second audio frame; the noisy time domain signal is obtained by framing a noisy signal matched with the reference signal based on the preset framing mode;
the signal-to-noise ratio determining module is used for determining the signal-to-noise ratio of the noisy frequency domain signal based on the voice energy information and the non-voice energy information of the noisy frequency domain signal;
The signal-to-noise ratio determining module is specifically configured to:
selecting larger first energy information from the voice energy information of the noisy frequency domain signal and preset energy information, and selecting larger second energy information from the non-voice energy information of the noisy frequency domain signal and the preset energy information;
and determining the signal-to-noise ratio of the noisy frequency domain signal based on the ratio of the first energy information to the second energy information.
8. The apparatus according to claim 7, wherein the speech class determination module is specifically configured to:
determining the energy value of a corresponding frequency point aiming at any frequency point of the reference frequency domain signal;
if the energy value of the corresponding frequency point is smaller than the preset energy, determining that the voice class of the corresponding frequency point represents non-voice; otherwise, determining the voice category of the corresponding frequency point to represent voice.
9. The apparatus of claim 8, wherein the energy information determination module is specifically configured to:
determining the sum of energy information in the noisy frequency domain signal as voice energy information of the noisy frequency domain signal, wherein all frequency points representing voice in the reference frequency domain signal of the corresponding frame; and
And determining the sum of energy information in the noisy frequency domain signal as non-voice energy information of the noisy frequency domain signal, wherein all frequency points representing non-voice in the reference frequency domain signal of the corresponding frame.
10. The apparatus of claim 7, wherein the signal conversion module is specifically configured to:
windowing is carried out on the reference time domain signal to obtain a windowed reference signal in the time domain;
and carrying out Fourier transform on the windowed reference signal to obtain a reference frequency domain signal on a frequency domain.
11. The apparatus according to any one of claims 7 to 10, further comprising a framing processing module configured to:
the noisy signal is time-domain aligned with the reference signal.
12. The apparatus according to claim 11, wherein the framing processing module is specifically configured to:
determining a time difference between the noisy signal and the reference signal based on a signal cross-correlation;
the noisy signal is moved to be aligned with the reference signal in the time domain based on the time difference value.
13. An electronic device comprising at least one processor and at least one memory, wherein the memory stores a computer program that, when executed by the processor, causes the processor to perform the method of any of claims 1-6.
14. A storage medium storing a computer program executable by an electronic device, which when run on the electronic device causes the electronic device to perform the method of any one of claims 1 to 6.
CN202110908512.8A 2021-08-09 2021-08-09 Signal-to-noise ratio determining method and device, electronic equipment and storage medium Active CN113744762B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110908512.8A CN113744762B (en) 2021-08-09 2021-08-09 Signal-to-noise ratio determining method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110908512.8A CN113744762B (en) 2021-08-09 2021-08-09 Signal-to-noise ratio determining method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113744762A CN113744762A (en) 2021-12-03
CN113744762B true CN113744762B (en) 2023-10-27

Family

ID=78730440

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110908512.8A Active CN113744762B (en) 2021-08-09 2021-08-09 Signal-to-noise ratio determining method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113744762B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114051259B (en) * 2022-01-13 2022-04-15 高拓讯达(北京)科技有限公司 Detection method and detection device for wifi signal

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101599274A (en) * 2009-06-26 2009-12-09 瑞声声学科技(深圳)有限公司 The method that voice strengthen
CN103871421A (en) * 2014-03-21 2014-06-18 厦门莱亚特医疗器械有限公司 Self-adaptive denoising method and system based on sub-band noise analysis
US9437212B1 (en) * 2013-12-16 2016-09-06 Marvell International Ltd. Systems and methods for suppressing noise in an audio signal for subbands in a frequency domain based on a closed-form solution
CN108010539A (en) * 2017-12-05 2018-05-08 广州势必可赢网络科技有限公司 Voice quality evaluation method and device based on voice activation detection
CN108630221A (en) * 2017-03-24 2018-10-09 现代自动车株式会社 Audio Signal Quality Enhancement Based on Quantized SNR Analysis and Adaptive Wiener Filtering
CN109801646A (en) * 2019-01-31 2019-05-24 北京嘉楠捷思信息技术有限公司 Voice endpoint detection method and device based on fusion features
CN112863535A (en) * 2021-01-05 2021-05-28 中国科学院声学研究所 Residual echo and noise elimination method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070237341A1 (en) * 2006-04-05 2007-10-11 Creative Technology Ltd Frequency domain noise attenuation utilizing two transducers

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101599274A (en) * 2009-06-26 2009-12-09 瑞声声学科技(深圳)有限公司 The method that voice strengthen
US9437212B1 (en) * 2013-12-16 2016-09-06 Marvell International Ltd. Systems and methods for suppressing noise in an audio signal for subbands in a frequency domain based on a closed-form solution
CN103871421A (en) * 2014-03-21 2014-06-18 厦门莱亚特医疗器械有限公司 Self-adaptive denoising method and system based on sub-band noise analysis
CN108630221A (en) * 2017-03-24 2018-10-09 现代自动车株式会社 Audio Signal Quality Enhancement Based on Quantized SNR Analysis and Adaptive Wiener Filtering
CN108010539A (en) * 2017-12-05 2018-05-08 广州势必可赢网络科技有限公司 Voice quality evaluation method and device based on voice activation detection
CN109801646A (en) * 2019-01-31 2019-05-24 北京嘉楠捷思信息技术有限公司 Voice endpoint detection method and device based on fusion features
CN112863535A (en) * 2021-01-05 2021-05-28 中国科学院声学研究所 Residual echo and noise elimination method and device

Also Published As

Publication number Publication date
CN113744762A (en) 2021-12-03

Similar Documents

Publication Publication Date Title
CN108615535B (en) Voice enhancement method and device, intelligent voice equipment and computer equipment
WO2021000597A1 (en) Voice signal processing method and device, terminal, and storage medium
CN108564963B (en) Method and apparatus for enhancing voice
US11475907B2 (en) Method and device of denoising voice signal
KR20160125984A (en) Systems and methods for speaker dictionary based speech modeling
JP2011203759A (en) Method and apparatus for multi-sensory speech enhancement
BRPI0612668A2 (en) multisensory speech using a speech state model
US11308946B2 (en) Methods and apparatus for ASR with embedded noise reduction
JP6374120B2 (en) System and method for speech restoration
US9484044B1 (en) Voice enhancement and/or speech features extraction on noisy audio signals using successively refined transforms
US9530434B1 (en) Reducing octave errors during pitch determination for noisy audio signals
US20130253920A1 (en) Method and apparatus for robust speaker and speech recognition
US9208794B1 (en) Providing sound models of an input signal using continuous and/or linear fitting
CN113744762B (en) Signal-to-noise ratio determining method and device, electronic equipment and storage medium
CN113763974B (en) Packet loss compensation method and device, electronic equipment and storage medium
CN114530160A (en) Model training method, echo cancellation method, system, device and storage medium
CN114596870A (en) Real-time audio processing method and device, computer storage medium and electronic equipment
BR112014009647B1 (en) NOISE Attenuation APPLIANCE AND NOISE Attenuation METHOD
CN113035216B (en) Microphone array voice enhancement method and related equipment
US20150162014A1 (en) Systems and methods for enhancing an audio signal
US10540990B2 (en) Processing of speech signals
CN115101097A (en) Voice signal processing method and device, electronic equipment and storage medium
CN111326166B (en) Voice processing method and device, computer readable storage medium and electronic equipment
WO2021217750A1 (en) Method and system for eliminating channel difference in voice interaction, electronic device, and medium
EP3669356B1 (en) Low complexity detection of voiced speech and pitch estimation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant