WO2017177782A1 - 语音信号级联处理方法、终端和计算机可读存储介质 - Google Patents

语音信号级联处理方法、终端和计算机可读存储介质 Download PDF

Info

Publication number
WO2017177782A1
WO2017177782A1 PCT/CN2017/076653 CN2017076653W WO2017177782A1 WO 2017177782 A1 WO2017177782 A1 WO 2017177782A1 CN 2017076653 W CN2017076653 W CN 2017076653W WO 2017177782 A1 WO2017177782 A1 WO 2017177782A1
Authority
WO
WIPO (PCT)
Prior art keywords
signal
voice signal
speech signal
voice
speech
Prior art date
Application number
PCT/CN2017/076653
Other languages
English (en)
French (fr)
Inventor
梁俊斌
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to EP17781758.2A priority Critical patent/EP3444819B1/en
Publication of WO2017177782A1 publication Critical patent/WO2017177782A1/zh
Priority to US16/001,736 priority patent/US10832696B2/en
Priority to US17/076,656 priority patent/US11605394B2/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
    • G10L21/0364Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude for improving intelligibility
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/90Pitch determination of speech signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/26Pre-filtering or post-filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
    • G10L21/0324Details of processing therefor
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/06Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being correlation coefficients
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/09Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being zero crossing rates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals

Definitions

  • the present invention relates to the field of audio data processing, and in particular, to a voice signal concatenation processing method, a terminal, and a non-transitory computer readable storage medium.
  • VOIP Voice over Internet Protocol
  • PSTN Public Switched Telephone Network
  • the phone communicates with the mobile phone of the wireless network.
  • Voices of different networks use different voice codecs.
  • the Global System for Mobile Communication (Global System for Mobile Communication) network uses AMR-NB coding
  • the fixed telephone uses G711 code
  • the IP phone uses G729 and other codes.
  • the voice coding formats supported by the network terminals are inconsistent, which inevitably leads to multiple codec processes on the call link.
  • the purpose is to enable different network terminals to interwork voice intercommunication after cascaded coding and decoding.
  • a speech signal concatenation processing method a terminal, and a nonvolatile computer readable storage medium are provided.
  • a voice signal cascade processing method includes:
  • the voice signal is the first feature signal, performing pre-emphasis filtering on the first feature signal by using the first pre-emphasis filter coefficient to obtain a first pre-enhanced voice signal; if the voice signal is a second feature signal, And then performing pre-emphasis filtering on the second characteristic signal by using a second pre-emphasis filter coefficient to obtain a second pre-enhanced speech signal;
  • the first pre-enhanced speech signal or the second pre-enhanced speech signal is output to perform a concatenated codec process according to the first pre-enhanced speech signal or the second pre-enhanced speech signal.
  • a terminal comprising a memory and a processor, wherein the memory stores computer readable instructions, and when the instructions are executed by the processor, the processor performs the following steps:
  • the voice signal is the first feature signal, performing pre-emphasis filtering on the first feature signal by using a first pre-emphasis filter coefficient to obtain a first pre-enhanced voice signal;
  • the voice signal is the second feature signal, performing pre-emphasis filtering on the second feature signal by using the second pre-emphasis filter coefficient to obtain a second pre-enhanced voice signal;
  • the first pre-enhanced speech signal or the second pre-enhanced speech signal is output to perform a concatenated codec process according to the first pre-enhanced speech signal or the second pre-enhanced speech signal.
  • One or more non-transitory computer readable storage media containing computer executable instructions that, when executed by one or more processors, cause the processor to perform the following steps:
  • the voice signal is the first feature signal, performing pre-emphasis filtering on the first feature signal by using a first pre-emphasis filter coefficient to obtain a first pre-enhanced voice signal;
  • the voice signal is the second feature signal, performing pre-emphasis filtering on the second feature signal by using the second pre-emphasis filter coefficient to obtain a second pre-enhanced voice signal;
  • the first pre-enhanced speech signal or the second pre-enhanced speech signal is output to perform a concatenated codec process according to the first pre-enhanced speech signal or the second pre-enhanced speech signal.
  • FIG. 1 is a schematic diagram of an application environment of a voice signal cascade processing method in an embodiment
  • FIG. 2 is a schematic diagram showing the internal structure of a terminal in an embodiment
  • 3A is a schematic diagram of frequency energy damage of a first characteristic signal after being cascaded coded in an embodiment
  • FIG. 3B is a schematic diagram of frequency energy damage of a second characteristic signal after cascade encoding and decoding in one embodiment
  • FIG. 4 is a flow chart of a method for processing a speech signal cascade in an embodiment
  • FIG. 5 is a specific flowchart of obtaining a first pre-emphasis filter coefficient and a second pre-emphasis filter coefficient according to offline training of training samples in an audio training set;
  • 6 is a pitch period for acquiring the voice signal in an embodiment
  • Figure 7 is a schematic diagram of the principle of three-level clipping processing
  • Figure 8 is a schematic diagram showing the calculation result of the pitch period of a speech
  • FIG. 9 is a schematic diagram showing the enhancement of the voice input signal of the online call by the pre-enhanced filter coefficient of the offline training in one embodiment
  • FIG. 10 is a schematic diagram of a pre-enhanced cascaded codec signal after cascading codec
  • Figure 11 is a signal spectrum of an unenhanced cascaded codec and an enhanced cascaded codec. a comparison diagram of the spectrum of the number;
  • FIG. 12 is a schematic diagram showing a comparison of a signal spectrum of an unenhanced cascaded codec and a mid-high frequency portion of an enhanced cascaded codec signal spectrum;
  • FIG. 13 is a structural block diagram of a speech signal cascade processing apparatus in an embodiment
  • FIG. 14 is a structural block diagram of a speech signal cascade processing apparatus in another embodiment
  • Figure 15 is a schematic diagram showing the internal structure of a training module in an embodiment
  • Figure 16 is a block diagram showing the structure of a speech signal concatenation processing apparatus in another embodiment.
  • first may be referred to as a second client
  • second client may be referred to as a first client, without departing from the scope of the present invention.
  • Both the first client and the second client are clients, but they are not the same client.
  • FIG. 1 is a schematic diagram of an application environment of a voice signal cascade processing method in an embodiment.
  • the application environment includes a first terminal 110, a first network 120, a second network 130, and a second terminal 140.
  • the first terminal 110 receives the voice signal, and after receiving the voice signal through the codec processing of the first network 120 and the second network 130, is received by the second terminal 140.
  • the first terminal 110 performs feature recognition on the voice signal by using a voice; if the voice signal is a first feature signal, pre-attenuating the first feature signal by using a first pre-emphasis filter coefficient to obtain a first pre- And enhancing the voice signal; and if the voice signal is the second feature signal, performing pre-emphasis filtering on the second feature signal by using the second pre-emphasis filter coefficient to obtain a second pre-enhanced voice signal; and outputting the first pre-pre
  • the enhanced speech signal or the second pre-enhanced speech signal is subjected to cascading codec processing by the first network 120 and the second network 130 to obtain a pre-enhanced cascaded codec signal, and the second terminal 140 receives To the pre-enhanced cascaded codec signal, the received signal is highly understandable.
  • the first terminal 110 receives the voice signal sent by the second terminal 140 and passes through the second network 130 and the first network 120, and performs pre-enhance
  • FIG. 2 is a schematic diagram showing the internal structure of a terminal in an embodiment.
  • the terminal includes a processor, a storage medium, a memory, a network interface, a sound collecting device, and a speaker connected through a system bus.
  • the storage medium of the terminal stores an operating system and computer readable instructions that, when executed, cause the processor to perform steps to implement a speech signal concatenation processing method.
  • the processor is configured to provide computing and control capabilities to support operation of the entire terminal, the processor being used to perform a speech signal cascade processing method, including acquiring a voice signal; performing feature recognition on the voice signal; if the voice signal is The first characteristic signal is pre-enhanced and filtered by the first pre-enhanced filter coefficient to obtain a first pre-enhanced speech signal; and if the speech signal is a second characteristic signal, the second pre-emphasis filter is adopted And pre-enhancing filtering the second characteristic signal to obtain a second pre-enhanced speech signal; outputting the first pre-enhanced speech signal or the second pre-enhanced speech signal to be according to the first pre-enhanced speech signal or the second pre-emphasis
  • the speech signal is subjected to cascade encoding and decoding processing.
  • the terminal can be a telephone, a mobile phone, a tablet computer or a personal digital assistant capable of making a network call. It will be understood by those skilled in the art that the structure shown in FIG. 2 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the terminal to which the solution of the present application is applied.
  • the specific terminal may include a ratio. More or fewer components are shown in the figures, or some components are combined, or have different component arrangements.
  • the key component of speech intelligibility is the medium and high frequency energy information of the speech signal.
  • the first characteristic signal is low in its fundamental frequency (generally within 125 Hz (Hertz)), and the main energy components are concentrated in the middle and low frequency (below 1000 Hz), medium and high.
  • the frequency (above 1000 Hz) is less, the fundamental frequency of the second characteristic signal is higher (generally above 125 Hz), and the medium and high frequency components are more than the first characteristic signal, as shown in FIG. 3A and FIG.
  • the frequency energy of the first first characteristic signal and the second characteristic signal are both Damage, because the high-frequency ratio of the first characteristic signal is low, the medium-high frequency energy is lower after the cascaded encoding and decoding, so that the speech intelligibility of the first characteristic signal is greatly affected, and the sound that the listening party feels is difficult to hear.
  • the content of the speech is clear, and the second characteristic signal has loss in the middle and high frequency, but after the cascade encoding, the medium and high frequency has enough energy to achieve better speech intelligibility.
  • a speech synthesis model CELP Code Excited Linear Prediction
  • speech minimum distortion as a criterion
  • the encoding process mainly ensures that the medium and low frequency distortion is the smallest, while the medium and high frequency relative distortion with small energy ratio is relatively large.
  • the spectrum energy distribution of the second characteristic signal is more balanced, medium and high.
  • the frequency also has many components, so the energy loss of the medium and high frequency components after encoding and decoding is relatively low. That is, the intelligibility degradation performance of the first feature signal and the second feature signal after cascading codec is significantly different.
  • the curved solid line in FIG. 3A is the first characteristic signal original signal, and the broken line is the cascade encoded and decoded signal.
  • the curved solid line in FIG. 3B is the second characteristic signal original signal, and the broken line is the cascaded coded signal.
  • the abscissa is the frequency
  • the ordinate is the energy
  • is the normalized energy value. Normalization is based on the maximum of the first characteristic signal or the second characteristic signal as normal.
  • the first characteristic signal may be a male voice signal
  • the second characteristic signal may be a female voice signal.
  • a voice signal cascade processing method which is executed on the terminal of FIG. 1, includes:
  • Step 402 Acquire a voice signal.
  • the voice signal is a voice signal in the original voice signal that is input.
  • the terminal acquires the original speech signal after the cascaded encoding and decoding process, and recognizes the speech signal in the original speech signal.
  • the cascaded codec is related to the actual link of the original voice signal.
  • the IP phone supporting G.729A and the GSM mobile phone interoperate the cascaded codec can be G.729A code+G.729 decoding+AMRNB code+AMRNB decoding.
  • Voice intelligibility refers to the extent to which the listener hears and understands the speaker's verbal content.
  • Step 404 performing feature recognition on the voice signal.
  • performing feature recognition on the voice signal includes: acquiring a pitch week of the voice signal And determining whether the pitch period of the voice signal is greater than a preset period value, and if so, the voice signal is a first feature signal, and if not, the voice signal is a second feature signal.
  • the frequency of the vocal cord vibration is referred to as the fundamental frequency, and the corresponding period is referred to as the pitch period.
  • the preset period value can be set as needed, for example, the period is 60 samples. If the pitch period of the speech signal is greater than 60 samples, the speech signal is the first characteristic signal. If the speech signal is less than or equal to 60 samples, the speech signal is the second characteristic signal.
  • Step 406 If the voice signal is the first feature signal, pre-enhance filtering the first feature signal by using the first pre-emphasis filter coefficient to obtain a first pre-enhanced voice signal.
  • Step 408 If the voice signal is the second feature signal, pre-attenuating the second feature signal by using the second pre-emphasis filter coefficient to obtain a second pre-enhanced voice signal.
  • the first characteristic signal and the second characteristic signal may be speech signals in different frequency bands.
  • Step 410 Output the first pre-enhanced speech signal or the second pre-enhanced speech signal to perform concatenated codec processing according to the first pre-enhanced speech signal or the second pre-enhanced speech signal.
  • the above-mentioned speech signal cascading processing method performs pre-enhanced filtering processing on the first characteristic signal by using the first pre-enhanced filter coefficient, and pre-enhanced filtering on the second characteristic signal by using the second pre-enhanced filter coefficient by performing feature recognition on the speech signal.
  • pre-enhanced speech is subjected to cascade encoding and decoding processing, and the receiving party can clearly understand the voice information, and improve the intelligibility of the voice signal after the cascade encoding and decoding, respectively, for the first characteristic signal and the second characteristic signal respectively
  • the corresponding filter coefficients are used for enhanced filtering processing, which is more targeted and more accurate in filtering processing.
  • the voice signal cascading processing method before acquiring the voice signal, further includes: acquiring an input original audio signal; detecting the original audio signal as a voice signal or a non-speech signal; if the original audio signal is a voice The signal acquires a voice signal; if the original audio signal is a non-speech signal, the non-speech signal is subjected to high-pass filtering.
  • the sample speech signal is judged to be a speech signal or a non-speech signal by VAD.
  • High-pass filtering is performed on non-speech to reduce signal noise.
  • the voice signal concatenation processing method before the acquiring the voice signal, further comprises: performing offline training according to the training samples in the audio training set to obtain the first pre-emphasis filter coefficient. And a second pre-emphasis filter coefficient.
  • the training samples in the male audio training set may be voice signals recorded or selected from the network.
  • the step of performing offline training according to the training samples in the audio training set to obtain the first pre-emphasis filter coefficient and the second pre-emphasis filter coefficient includes:
  • Step 502 Acquire a sample speech signal from the audio training set, where the sample speech signal is a first feature sample speech signal or a second feature sample speech signal.
  • an audio training set is pre-established, and the audio training set includes a plurality of first feature sample speech signals and second feature sample speech signals.
  • the first feature sample speech signal and the second feature sample speech signal in the audio training set exist independently.
  • the first feature sample speech signal and the second feature sample speech signal are sample speech signals of different characteristic signals.
  • the method further includes: determining whether the sample speech signal is a speech signal, and if so, performing analog cascade encoding and decoding processing on the sample speech signal to obtain a degraded speech signal, and if not, re-acquiring the sample speech from the audio training set. signal.
  • VAD Voice Activity Detection
  • VAD Voice Activity Detection
  • VAD is a speech detection algorithm that estimates speech based on energy, zero-crossing rate, and low-noise estimation.
  • the steps of determining whether the sample speech signal is a speech signal include (a1) to (a5):
  • the active speech start point and the active speech end point are obtained from the active speech and the inactive speech in (a4) according to the energy threshold value and the zero-crossing rate threshold value.
  • the VAD detection method may employ a double threshold detection method or a speech detection method based on an autocorrelation maximum value.
  • the process of double threshold detection includes:
  • mute length is less than the set maximum mute length, it indicates that the speech has not yet ended.
  • the speech segment if the length of the speech is less than the minimum noise length, the speech is considered too short, and the noise is at this time, and the voice is judged to be muted. Segment; otherwise the voice enters the end segment.
  • Step 504 Perform analog cascade encoding and decoding processing on the sample speech signal to obtain a degraded speech signal.
  • Analog cascaded codec refers to the actual link that the original voice signal passes through.
  • the analog cascade codec can be G.729A code+G.729 decoding+AMRNB. Encoding + AMRNB decoding.
  • the degraded speech signal is obtained after the sample speech signal is subjected to offline cascaded codec processing.
  • Step 506 Acquire an energy attenuation value at a different frequency point corresponding to the degraded speech signal and the sample speech signal, and use the energy attenuation value as a frequency point energy compensation value.
  • the energy value corresponding to the sample speech signal of each frequency point is subtracted from the energy value corresponding to the degraded speech signal to obtain an energy attenuation value of the corresponding frequency point, and the energy attenuation value is the energy compensation of the frequency point required later. value.
  • Step 508 averaging the frequency energy compensation values corresponding to the first characteristic signal in the audio training set to obtain an energy average compensation value of the first characteristic signal at different frequency points, and the audio training The frequency energy compensation values corresponding to the second characteristic signal in the training set are averaged to obtain the energy average compensation value of the second characteristic signal at different frequency points.
  • Step 510 Perform filtering and fitting according to the energy average compensation value of the first characteristic signal at different frequency points to obtain a first pre-emphasis filter coefficient, and perform filtering according to the energy average compensation value of the second characteristic signal at different frequency points.
  • the second pre-emphasis filter coefficient is obtained by fitting.
  • an adaptive filter fitting method is used to filter and fit the energy average compensation value of the first characteristic signal to obtain a group first. Pre-emphasis filter coefficients. Based on the energy average compensation value of the second characteristic signal at different frequency points, an adaptive filter fitting method is used to filter and fit the energy average compensation value of the second characteristic signal to obtain a set of second pre-emphasis filter coefficients.
  • FIR Finite Impulse Response
  • the FIR filter's pre-emphasis filter coefficients a 0 ⁇ a m can be calculated by matlab's fir2 function.
  • the amplitude-frequency characteristics of the device are determined by the vector pairs f and m, f is the normalized frequency vector, m is the amplitude at the corresponding frequency point, and n is the order of the filter.
  • the energy compensation value of each frequency point is taken as m, and the function of fir2 is input to calculate b.
  • the offline training obtains the first pre-emphasis filter coefficient and the second pre-emphasis filter coefficient, and the first pre-emphasis filter coefficient and the second pre-emphasis filter coefficient can be accurately obtained through offline training, so as to facilitate the subsequent online filtering process to obtain the enhanced speech signal. , effectively improve the intelligibility of the cascading coded speech signal.
  • the pitch period for acquiring the voice signal includes:
  • Step 602 performing band pass filtering on the voice signal.
  • the band-pass filtering of the speech signal may be performed by using a filter of 80 Hz to 1500 Hz, and filtering by a bandpass filter of 60 to 1000 Hz may be used, and is not limited thereto. That is, the frequency range of band pass filtering is set according to specific needs.
  • Step 604 performing pre-emphasis processing on the band-pass filtered speech signal.
  • the pre-emphasis refers to the improvement of the high-frequency component of the input signal by the transmitting end.
  • Step 606 The speech signal is translated into a frame by a rectangular window, and each frame window has a first sampling point length, and each frame shifts the second sampling point number.
  • the window length of the rectangular window is the first sampling point number
  • the first sampling point number may be 280 points
  • the second sampling point may be 80 points
  • the first sampling point number and the second sampling point number are not limited thereto.
  • 80 points corresponds to 10ms (milliseconds) of data, and with 80 points of translation, each frame will introduce 10ms of new data for calculation.
  • step 608 a three-level clipping process is performed on each frame of the signal.
  • the three-level clipping process such as setting the positive and negative thresholds, outputs 1 if the sample value is greater than the positive threshold, and outputs -1 if the sample value is less than the negative threshold, and outputs 0 if the sample value is less than the negative threshold.
  • the positive threshold is C
  • the negative threshold is -C. If the sample value exceeds the positive threshold C, the output is 1, and the sample value is less than the negative threshold -C, then -1 is output, and the remaining output is 0.
  • Three-level clipping processing is performed on each frame of the signal to obtain t(i), where i ranges from 1 to 280.
  • step 610 an autocorrelation value is calculated for each intrasampled sample point.
  • the sampling point in each frame calculates the product of the two factors of the autocorrelation value divided by the product of the respective roots.
  • the formula for calculating the autocorrelation value is:
  • r(k) is the autocorrelation value
  • t(k+l-1) is the result of the corresponding (k+l-1) three-level clipping process
  • k is a value of 20 to 160 which is a regular pitch.
  • the cycle search range is 8000/20 to 8000/160 if the pair is converted to the fundamental frequency, that is, the range of 50 Hz to 400 Hz, that is, the normal fundamental frequency range of the vocal, and the k exceeding the range of 20 to 160 can be regarded as the non-human normal sound fundamental frequency range. Save calculation time without calculation.
  • step 612 the sequence number corresponding to the largest autocorrelation value in each frame is used as the pitch period of each frame.
  • the sequence number corresponding to the largest autocorrelation value in each frame can be obtained, and the sequence number corresponding to the largest autocorrelation is used as the pitch period of each frame.
  • steps 602 and 604 may be omitted.
  • Figure 8 is a schematic diagram showing the calculation results of the pitch period of a speech.
  • the abscissa in the first picture is the serial number of the sampling point, and the ordinate is the sample value of the sampling point, that is, the amplitude of the sampling point. It can be seen that the sample value of the sampling point changes, and some sampling points The sample value is large, and some sample points have small sample values.
  • the abscissa is the number of frames
  • the ordinate is the pitch period value.
  • the pitch period of the non-speech frame defaults to 0.
  • the above-described speech signal concatenation processing method will be described below in conjunction with specific embodiments.
  • the first feature signal is a male voice and the second feature signal is a female voice.
  • the voice signal cascade processing method includes an offline training portion and an online processing portion.
  • the offline training section includes:
  • step (c2) vad determines whether the sample speech signal is a speech signal, and if so, performs step (c3), and if not, returns (c2).
  • the sample voice signal passes through a plurality of codec links that need to pass through the actual link link, for example, an IP phone supporting G.729A and a GSM mobile phone intercommunication, and the analog cascade codec can be G.729A code+G.729 decoding+ AMRNB encoding + AMRNB decoding.
  • the degraded speech signal is obtained after the sample speech signal is subjected to offline cascaded codec processing.
  • the energy value corresponding to the sample speech signal of each frequency point is subtracted from the energy value corresponding to the degraded speech signal to obtain an energy attenuation value of the corresponding frequency point, and the energy attenuation value is the energy compensation of the frequency point required later. value.
  • the average energy compensation value corresponding to the male voice of the male and female voice training is averaged to obtain the average energy compensation value of the male voice at different frequency points, and the frequency energy compensation value corresponding to the female voice of the male and female voice training concentration is averaged. Get the average energy compensation value of female voice at different frequency points.
  • an adaptive filter fitting method is used to filter and fit the energy average compensation value of male voice to obtain a set of male pre-enhanced filter coefficients.
  • an adaptive filter fitting method is used to filter and fit the female energy's average compensation value to obtain a set of female sound pre-enhanced filter coefficients.
  • the online processing section includes:
  • step (d2) Whether the vad detects a voice signal, and if so, performs step (d3), and if not, performs step (d4).
  • step (d3) It is judged that the voice signal is a male voice or a female voice, and if it is a male voice, the step (d4) is performed, and if it is a female voice, the step (d5) is performed.
  • the above-mentioned speech intelligibility improvement method performs high-pass filtering processing on non-speech, reduces signal noise, and recognizes a speech signal as a male acoustic signal or a female acoustic signal, and performs pre-enhance filtering on the male acoustic pre-enhanced filter coefficient obtained by off-line training of the male acoustic signal.
  • FIG. 10 is a schematic diagram of a pre-enhanced cascaded codec signal after cascading codec.
  • the first picture is the original signal
  • the second picture is the cascaded coded signal
  • the third picture is the pre-enhanced filtering processed concatenated codec signal. It can be seen that the pre-enhanced cascaded codec signal is stronger than the cascaded codec signal, which makes the sound clearer and understandable, and improves the intelligibility of the speech.
  • FIG. 11 is a schematic diagram showing the comparison of the signal spectrum of the cascading codec without enhancement and the signal spectrum of the enhanced cascading codec.
  • the curve is the signal spectrum of the cascaded codec without enhancement processing, and each point is the signal spectrum of the enhanced cascaded codec, the abscissa is the frequency, and the ordinate is the absolute energy.
  • the signal spectrum intensity is enhanced and the intelligibility is improved.
  • FIG. 12 is a schematic diagram showing the comparison of the signal spectrum of the cascading codec without enhancement and the mid-high frequency portion of the signal spectrum of the enhanced cascading codec.
  • the curve is the signal spectrum of the cascaded codec without enhancement processing, and each point is the signal spectrum of the enhanced cascaded codec, the abscissa is the frequency, the ordinate is the absolute energy, and the spectrum intensity of the enhanced signal is enhanced. After the pre-emphasis processing in the middle and high frequency sections, the signal energy is stronger and the intelligibility is improved.
  • FIG. 13 is a block diagram showing the structure of a speech signal cascade processing apparatus in an embodiment.
  • a speech signal concatenation processing apparatus includes a speech signal acquisition module 1302, an identification module 1304, a first signal enhancement module 1306, a second signal enhancement module 1308, and an output module 1310. among them:
  • the voice signal acquisition module 1302 is configured to acquire a voice signal.
  • the identification module 1304 is configured to perform feature recognition on the voice signal.
  • the first signal enhancement module 1306 is configured to perform pre-emphasis filtering on the first feature signal by using the first pre-enhanced filter coefficient to obtain the first pre-enhanced speech signal if the voice signal is the first feature signal.
  • the second signal enhancement module 1308 is configured to perform pre-emphasis filtering on the second feature signal by using the second pre-emphasis filter coefficient to obtain the second pre-enhanced speech signal, if the voice signal is the second feature signal.
  • the output module 1310 is configured to output the first pre-enhanced speech signal or the second pre-enhanced speech signal.
  • Cascading codec processing is performed according to the first pre-enhanced speech signal or the second pre-enhanced speech signal.
  • the above-mentioned speech signal cascading processing device performs pre-enhanced filtering processing on the first characteristic signal by using the first pre-enhanced filter coefficient and pre-enhanced filtering on the second characteristic signal by using the second pre-enhanced filter coefficient by performing feature recognition on the speech signal.
  • pre-enhanced speech is subjected to cascade encoding and decoding processing, and the receiving party can clearly understand the voice information, and improve the intelligibility of the voice signal after the cascade encoding and decoding, respectively, for the first characteristic signal and the second characteristic signal respectively.
  • the corresponding filter coefficients are used for enhanced filtering processing, which is more targeted and more accurate in filtering processing.
  • FIG. 14 is a block diagram showing the structure of a speech signal concatenation processing apparatus in another embodiment.
  • a speech signal concatenation processing apparatus includes a speech signal acquisition module 1302, an identification module 1304, a first signal enhancement module 1306, a second signal enhancement module 1308, and an output module 1310, and a training module 1312. .
  • the training module 1312 is configured to perform offline training according to the training samples in the audio training set to obtain the first pre-emphasis filter coefficients and the second pre-emphasis filter coefficients before the acquiring the voice signals.
  • FIG. 15 is a schematic diagram showing the internal structure of a training module in one embodiment.
  • the training module 1310 includes a selection unit 1502, an analog cascade codec unit 1504, an energy compensation value acquisition unit 1506, an average energy compensation value acquisition unit 1508, and a filter coefficient acquisition unit 1510.
  • the selecting unit 1502 is configured to acquire a sample speech signal from the audio training set, where the sample speech signal is a first feature sample speech signal or a second feature sample speech signal.
  • the analog cascade codec unit 1504 is configured to perform analog cascade codec processing on the sample speech signal to obtain a degraded speech signal.
  • the energy compensation value obtaining unit 1506 is configured to obtain an energy attenuation value at a different frequency point corresponding to the degraded speech signal and the sample speech signal, and use the energy attenuation value as the frequency point energy compensation value.
  • the average energy compensation value obtaining unit 1508 is configured to average the frequency energy compensation values corresponding to the first characteristic signal in the audio training set to obtain an energy average compensation value of the first characteristic signal at different frequency points, and train the audio The frequency energy compensation values corresponding to the concentrated second characteristic signals are averaged to obtain an energy average compensation value of the second characteristic signal at different frequency points.
  • the filter coefficient acquiring unit 1510 is configured to perform filter fitting according to the energy average compensation value of the first characteristic signal at different frequency points to obtain a first pre-emphasis filter coefficient, and average energy at different frequency points according to the second characteristic signal.
  • the compensation value is filtered and fitted to obtain a second pre-emphasis filter coefficient.
  • the offline training obtains the first pre-emphasis filter coefficient and the second pre-emphasis filter coefficient, and the first pre-emphasis filter coefficient and the second pre-emphasis filter coefficient can be accurately obtained through offline training, so as to facilitate the subsequent online filtering process to obtain the enhanced speech signal. , effectively improve the intelligibility of the cascading coded speech signal.
  • the identification module 1304 is further configured to acquire a pitch period of the voice signal; and determine whether a pitch period of the voice signal is greater than a preset period value, and if yes, the voice signal is a first feature signal, and if not, Then the speech signal is a second characteristic signal.
  • the identification module 1304 is further configured to: perform a translational framing on the voice signal in a rectangular window, each window has a first sampling point length, and each frame shifts a second sampling point; and performs three-level clipping processing on each frame signal; The autocorrelation value is calculated for the sampling points in each frame; and the sequence number corresponding to the largest autocorrelation value in each frame is used as the pitch period of each frame.
  • the identification module 1304 is further configured to: perform a translational framing on the voice signal in a rectangular window, and each channel has a first sampling point, and before each frame shifts the second sampling point, the voice signal is band-pass filtered. And performing pre-emphasis processing on the band-pass filtered speech signal.
  • FIG. 16 is a block diagram showing the structure of a speech signal concatenation processing apparatus in another embodiment.
  • a speech signal concatenation processing apparatus includes, in addition to a speech signal acquisition module 1302, an identification module 1304, a first signal enhancement module 1306, a second signal enhancement module 1308, and an output module 1310, an original signal acquisition. Module 1314, detection module 1316, and filtering module 1318.
  • the original signal acquisition module 1314 is configured to acquire the input original audio signal.
  • the detecting module 1316 is configured to detect that the original audio signal is a voice signal or a non-speech signal.
  • the voice signal acquisition module 1302 is further configured to acquire a voice signal if the original audio signal is a voice signal.
  • the filtering module 1318 is configured to: if the original audio signal is a non-speech signal, the non-speech signal Perform high-pass filtering.
  • the above-mentioned speech signal cascading processing device performs high-pass filtering processing on non-speech, reduces noise of the signal, performs feature recognition on the speech signal, and performs pre-enhancement filtering processing on the first characteristic signal by using the first pre-enhanced filter coefficient, and second
  • the characteristic signal is pre-enhanced filtering processing by using the second pre-emphasis filter coefficient, and the pre-enhanced speech is subjected to cascade encoding and decoding processing, and the receiving party can clearly understand the voice information, thereby improving the understandability of the voice signal after the cascade encoding and decoding.
  • the corresponding filter coefficients are used for the enhancement filtering process, which is more targeted and more accurate in filtering processing.
  • a voice signal concatenation processing apparatus may include a voice signal acquisition module 1302, an identification module 1304, a first signal enhancement module 1306, a second signal enhancement module 1308, an output module 1310, a training module 1312, and an original. All possible combinations in signal acquisition module 1314, detection module 1316, and filtering module 1318.
  • the storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), or the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Telephonic Communication Services (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Telephone Function (AREA)

Abstract

一种语音信号级联处理方法,包括:获取语音信号(402);对该语音信号进行特征识别(404);若该语音信号为第一特征信号,则采用第一预增强滤波系数对该第一特征信号进行预增强滤波,得到第一预增强语音信号(406);若该语音信号为第二特征信号,则采用第二预增强滤波系数对该第二特征信号进行预增强滤波,得到第二预增强语音信号(408);输出该第一预增强语音信号或第二预增强语音信号,以根据该第一预增强语音信号或第二预增强语音信号进行级联编解码处理(410)。

Description

语音信号级联处理方法、终端和计算机可读存储介质
本申请要求于2016年04月15日提交中国专利局、申请号为201610235392.9、发明名称为“语音信号级联处理方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及音频数据处理领域,特别是涉及一种语音信号级联处理方法、终端和非易失性计算机可读存储介质。
背景技术
随着VOIP(Voice over Internet Protocol,网络电话)业务的推广,不同网络间互融应用日趋增多,如经互联网的IP电话与经PSTN(Public Switched Telephone Network,公共交换电话网络)固定电话互通,IP电话与无线网络的手机互通。不同网络的语音采用的是不同的语音编解码,如无线GSM(Global System for Mobile Communication,全球移动通信系统)网络采用AMR-NB编码,固定电话采用G711编码,IP电话采用G729等编码,由于各网络终端支持的语音编码格式不一致,必然导致通话链路出现多次编解码过程,目的使级联编解码后不同网络终端可以互通语音对接,然而,当前所用语音编码绝大部分都是有损编码器,即每次编解码必然导致语音质量下降,级联编解码的次数越多语音质量下降越剧烈,结果导致语音双方听不清对方的说话内容,即语音可懂度下降。
发明内容
根据本申请的各种实施例,提供一种语音信号级联处理方法、终端和非易失性计算机可读存储介质。
一种语音信号级联处理方法,包括:
获取语音信号;
对所述语音信号进行特征识别;
若所述语音信号为第一特征信号,则采用第一预增强滤波系数对所述第一特征信号进行预增强滤波,得到第一预增强语音信号;若所述语音信号为第二特征信号,则采用第二预增强滤波系数对所述第二特征信号进行预增强滤波,得到第二预增强语音信号;以及
输出所述第一预增强语音信号或第二预增强语音信号,以根据所述第一预增强语音信号或第二预增强语音信号进行级联编解码处理。
一种终端,包括存储器及处理器,所述存储器中储存有计算机可读指令,所述指令被所述处理器执行时,使得所述处理器执行以下步骤:
获取语音信号;
对所述语音信号进行特征识别;
若所述语音信号为第一特征信号,则采用第一预增强滤波系数对所述第一特征信号进行预增强滤波,得到第一预增强语音信号;
若所述语音信号为第二特征信号,则采用第二预增强滤波系数对所述第二特征信号进行预增强滤波,得到第二预增强语音信号;以及
输出所述第一预增强语音信号或第二预增强语音信号,以根据所述第一预增强语音信号或第二预增强语音信号进行级联编解码处理。
一个或多个包含计算机可执行指令的非易失性计算机可读存储介质,当所述计算机可执行指令被一个或多个处理器执行时,使得所述处理器执行以下步骤:
获取语音信号;
对所述语音信号进行特征识别;
若所述语音信号为第一特征信号,则采用第一预增强滤波系数对所述第一特征信号进行预增强滤波,得到第一预增强语音信号;
若所述语音信号为第二特征信号,则采用第二预增强滤波系数对所述第二特征信号进行预增强滤波,得到第二预增强语音信号;以及
输出所述第一预增强语音信号或第二预增强语音信号,以根据所述第一预增强语音信号或第二预增强语音信号进行级联编解码处理。
本发明的一个或多个实施例的细节在下面的附图和描述中提出。本发明的其它特征、目的和优点将从说明书、附图以及权利要求书变得明显。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为一个实施例中语音信号级联处理方法的应用环境示意图;
图2为一个实施例中终端的内部结构示意图;
图3A为一个实施例中经过级联编解码后第一特征信号的频率能量损伤示意图;
图3B为一个实施例中经过级联编解码后第二特征信号的频率能量损伤示意图;
图4为一个实施例中语音信号级联处理方法的流程图;
图5为根据音频训练集中的训练样本进行离线训练得到第一预增强滤波系数和第二预增强滤波系数的具体流程图;
图6为一个实施例中获取该语音信号的基音周期;
图7为三电平削波处理原理示意图;
图8为一段语音的基音周期计算结果示意图;
图9为一个实施例中离线训练的预增强滤波系数对在线通话的语音输入信号进行增强的示意图;
图10为级联编解码后信号经过预增强后的级联编解码信号的示意图;
图11为未做增强的级联编解码的信号频谱与增强后的级联编解码的信 号频谱的对比示意图;
图12为未做增强的级联编解码的信号频谱与增强后的级联编解码的信号频谱的中高频部分的对比示意图;
图13为一个实施例中语音信号级联处理装置的结构框图;
图14为另一个实施例中语音信号级联处理装置的结构框图;
图15为一个实施例中训练模块的内部结构示意图;
图16为另一个实施例中语音信号级联处理装置的结构框图。
具体实施方式
为了使本发明的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。
可以理解,本发明所使用的术语“第一”、“第二”等可在本文中用于描述各种元件,但这些元件不受这些术语限制。这些术语仅用于将第一个元件与另一个元件区分。举例来说,在不脱离本发明的范围的情况下,可以将第一客户端称为第二客户端,且类似地,可将第二客户端称为第一客户端。第一客户端和第二客户端两者都是客户端,但其不是同一客户端。
图1为一个实施例中语音信号级联处理方法的应用环境示意图。如图1所示,该应用环境包括第一终端110、第一网络120、第二网络130和第二终端140。第一终端110接收语音信号,将语音信号经过第一网络120和第二网络130的编解码处理后,被第二终端140接收。第一终端110将语音对所述语音信号进行特征识别;若所述语音信号为第一特征信号,则采用第一预增强滤波系数对所述第一特征信号进行预增强滤波,得到第一预增强语音信号;以及若所述语音信号为第二特征信号,则采用第二预增强滤波系数对所述第二特征信号进行预增强滤波,得到第二预增强语音信号;输出所述第一预增强语音信号或第二预增强语音信号,经过第一网络120和第二网络130进行级联编解码处理后得到预增强后的级联编解码信号,第二终端140接收 到预增强后的级联编解码信号,接收到的信号可懂度高。第一终端110接收第二终端140发送的经过第二网络130、第一网络120的语音信号,同样对接收到的语音信号进行预增强滤波处理。
图2为一个实施例中终端的内部结构示意图。如图2所示,该终端包括通过系统总线连接的处理器、存储介质、内存、网络接口、声音采集装置和扬声器。其中,终端的存储介质存储有操作系统和计算机可读指令,该计算机可读指令被执行时,使得处理器执行步骤以实现一种语音信号级联处理方法。该处理器用于提供计算和控制能力,支撑整个终端的运行,该处理器被用于执行一种语音信号级联处理方法,包括获取语音信号;对该语音信号进行特征识别;若该语音信号为第一特征信号,则采用第一预增强滤波系数对该第一特征信号进行预增强滤波,得到第一预增强语音信号;以及若该语音信号为第二特征信号,则采用第二预增强滤波系数对该第二特征信号进行预增强滤波,得到第二预增强语音信号;输出该第一预增强语音信号或第二预增强语音信号,以根据该第一预增强语音信号或第二预增强语音信号进行级联编解码处理。该终端可以是能进行网络通话的电话机、手机、平板电脑或者个人数字助理等。本领域技术人员可以理解,图2中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的终端的限定,具体的终端可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
因经过级联编解码后的语音信号,其中高频会明显的损伤,且第一特征信号和第二特征信号的语音可懂度在级联编解码后会有不同程度的而影响,因为影响语音可懂度的关键成分是语音信号的中高频能量信息,第一特征信号由于在其基频较低(一般在125Hz(赫兹)以内),主要能量成分集中在中低频(1000Hz以下),中高频(1000Hz以上)成分较少,第二特征信号的基频较高(一般在125Hz以上),中高频成分比第一特征信号要多,如图3A和图3B所示,经过级联编解码后第一特征信号和第二特征信号的频率能量均有 损伤,由于第一特征信号中高频比例偏低,经过级联编解码后中高频能量更低,使第一特征信号的语音可懂度影响极大,导致收听方感觉听到的声音模糊难以听清楚说话内容,而第二特征信号虽然中高频也有损耗,但经过级联编码后中高频还有足够能量以达到较好的语音可懂度。从语音编解码原理来说,以语音听觉失真最小为准则的一种编解码模型CELP(Code Excited Linear Prediction,码激励线性预测编码)合成的语音为例,由于第一特征信号语音的频谱能量分布很不均衡,大部分能量集中在中低频,所以编码过程主要确保中低频失真最小,而占能量比例较小的中高频相对失真较大,相反,第二特征信号的频谱能量分布较为均衡,中高频也有较多成分,所以经过编解码后中高频成分能量损失相对不高。也就是,第一特征信号和第二特征信号在经过级联编解码后的可懂度下降表现是有明显差异的。图3A中曲实线为第一特征信号原始信号,虚线为经级联编解码后的信号。图3B中曲实线为第二特征信号原始信号,虚线为经级联编解码后的信号。图3A和图3B中横坐标为频率,纵坐标为能量,且为归一化后的能量值。归一化是基于第一特征信号或第二特征信号中最大峰值作为归一的。第一特征信号可为男声信号,第二特征信号可为女声信号。
图4为一个实施例中语音信号级联处理方法的流程图。如图4所示,一种语音信号级联处理方法,运行于图1的终端上,包括:
步骤402,获取语音信号。
本实施例中,语音信号为识别输入的原始语音信号中的语音信号。终端获取到经过级联编解码处理后的原始语音信号,识别出原始语音信号中的语音信号。级联编解码与原始语音信号经过的实际链路环节相关,例如支持G.729A的IP电话与GSM手机互通,则级联编解码可为G.729A编码+G.729解码+AMRNB编码+AMRNB解码。
语音可懂度是指听者听清楚和理解说话人的口头表述内容的程度。
步骤404,对该语音信号进行特征识别。
本实施例中,对该语音信号进行特征识别包括:获取语音信号的基音周 期;判断该语音信号的基音周期是否大于预设周期值,若是,则该语音信号为第一特征信号,若否,则该语音信号为第二特征信号。
具体地,声带振动的频率称为基频,相应的周期称为基音周期。预设周期值可根据需要设定,如周期为60个样点。语音信号的基音周期大于60个样点,则该语音信号为第一特征信号,若小于或等于60个样点,则该语音信号为第二特征信号。
步骤406,若该语音信号为第一特征信号,则采用第一预增强滤波系数对该第一特征信号进行预增强滤波,得到第一预增强语音信号。
步骤408,若该语音信号为第二特征信号,则采用第二预增强滤波系数对该第二特征信号进行预增强滤波,得到第二预增强语音信号。
第一特征信号和第二特征信号可为不同频段范围内的语音信号。
步骤410,输出该第一预增强语音信号或第二预增强语音信号,以根据该第一预增强语音信号或第二预增强语音信号进行级联编解码处理。
上述语音信号级联处理方法,通过对语音信号进行特征识别,对第一特征信号采用第一预增强滤波系数进行预增强滤波处理,对第二特征信号采用第二预增强滤波系数进行预增强滤波处理,将预增强语音进行级联编解码处理,接收方能更清楚听清语音信息,提高了经过级联编解码后的语音信号的可懂度,针对第一特征信号和第二特征信号分别采用对应的滤波系数进行增强滤波处理,针对性更强,滤波处理更加准确。
在一个实施例中,上述语音信号级联处理方法,在获取语音信号之前,还包括:获取输入的原始音频信号;检测该原始音频信号为语音信号或非语音信号;若该原始音频信号为语音信号,则获取语音信号;若该原始音频信号为非语音信号,则对该非语音信号进行高通滤波处理。
本实施例中,通过VAD判断样本语音信号为语音信号或非语音信号。
对非语音进行高通滤波处理,降低信号的噪声。
在一个实施例中,在该获取语音信号之前,该语音信号级联处理方法还包括:根据音频训练集中的训练样本进行离线训练得到第一预增强滤波系数 和第二预增强滤波系数。
本实施例中,男音频训练集中的训练样本可为录制或从网络上筛选得到的语音信号等。
如图5所示,在一个实施例中,该根据音频训练集中的训练样本进行离线训练得到第一预增强滤波系数和第二预增强滤波系数的步骤包括:
步骤502,从音频训练集中获取样本语音信号,该样本语音信号为第一特征样本语音信号或第二特征样本语音信号。
本实施例中,预先建立音频训练集,音频训练集中包含多个第一特征样本语音信号和第二特征样本语音信号。音频训练集中的第一特征样本语音信号和第二特征样本语音信号独立存在。第一特征样本语音信号和第二特征样本语音信号为不同特征信号的样本语音信号。
在步骤502之后,还包括:判断该样本语音信号是否为语音信号,若是,则对样本语音信号进行模拟级联编解码处理,得到降级语音信号,若否,则重新从音频训练集中获取样本语音信号。
本实施例中,采用VAD(Voice Activity Detection,语音活跃度检测)判断样本语音信号是否为语音信号。VAD是一种语音检测算法,是基于能量、过零率和低噪估计等估算语音。
判断样本语音信号是否为语音信号的步骤包括(a1)至(a5):
(a1)接收连续语音,并从此连续语音中获取音框;
(a2)计算音框的能量,并根据这些能量取得能量门限值;
(a3)分别计算获取的音框的越零率,并根据这些越零率取得越零率门限值;
(a4)使用线性回归演绎法,并以(a2)中获取的能量及(a3)中获取的越零率作为线性回归演绎法的输入参数,用来判断每一音框是否为活动语音或非活动语音;
(a5)根据能量门限值及越零率门限值,自(a4)中的活动语音及非活动语音中取得活动语音起点及活动语音终点。
VAD检测方法可采用双门限检测法或基于自相关极大值的语音检测法。
双门限检测法的过程包括:
(b1)在开始阶段做预加重和分帧处理,将语音信号分成一帧一帧;
(b2)设置初始化参数,包括最大静音长度、短时能量的门限和短时过零率的门限;
(b3)判断当语音在静音段或过渡段时,如果语音信号的短时能量值大于短时能量的高门限,或者语音信号的短时过零率大于短时过零率的高门限,则确认进入语音段,如果短时能量的值大于短时能量的低门限或者过零率的值大于过零率的低门限,则语音处于过渡段,否则语音仍处于静音段;
(b4)当语音信号在语音段时,判断如果短时能量的低门限或短时过零率的值大于短时过零率的低门限,则语音信号仍然处于语音段;
(b5)如果静音长度小于设置的最大静音长度,则表明语音还尚未结束,还在语音段,如果语音的长度小于最小噪声长度,则认为语音太短,此时是噪声,同时判断语音处于静音段;否则语音进入结束段。
步骤504,对该样本语音信号进行模拟级联编解码处理,得到降级语音信号。
模拟级联编解码是指模拟原始语音信号经过的实际链路环节,例如支持G.729A的IP电话与GSM手机互通,则模拟级联编解码可为G.729A编码+G.729解码+AMRNB编码+AMRNB解码。对样本语音信号经过离线的级联编解码处理后得到降级语音信号。
步骤506,获取该降级语音信号与样本语音信号对应不同频点上的能量衰减值,将该能量衰减值作为频点能量补偿值。
具体地,将每个频点的样本语音信号所对应的能量值减去降级语音信号所对应的能量值得到对应频点的能量衰减值,该能量衰减值即为后续需要的该频点能量补偿值。
步骤508,对该音频训练集中的第一特征信号所对应的频点能量补偿值求平均得到第一特征信号在不同频点上的能量平均补偿值,以及对该音频训 练集中的第二特征信号所对应的频点能量补偿值求平均得到第二特征信号在不同频点上的能量平均补偿值。
具体地,对音频训练集中第一特征信号所有的能量补偿值求平均得到第一特征信号在不同频点上的能量平均补偿值,对音频训练集中第二特征信号所有的能量补偿值求平均得到第二特征信号在不同频点上的能量平均补偿值。
步骤510,根据该第一特征信号在不同频点上的能量平均补偿值进行滤波拟合得到第一预增强滤波系数,以及根据该第二特征信号在不同频点上的能量平均补偿值进行滤波拟合得到第二预增强滤波系数。
本实施例中,基于第一特征信号在不同频点上的能量平均补偿值为目标,采用自适应滤波器拟合方式对第一特征信号的能量平均补偿值进行滤波拟合得到一组第一预增强滤波系数。基于第二特征信号在不同频点上的能量平均补偿值为目标,采用自适应滤波器拟合方式对第二特征信号的能量平均补偿值进行滤波拟合得到一组第二预增强滤波系数。
预增强滤波器可使用FIR(Finite Impulse Response,有限长单位冲激响应滤波器)滤波器:y[n]=a0*x[n]+a1*x[n-1]+...+am*x[n-m]。
FIR滤波器的预增强滤波系数a0~am可通过matlab的fir2函数进行计算求得,函数b=fir2(n,f,m)是用来设计多通带任意响应FIR滤波器,该滤波器的幅频特性由向量对f和m确定,f为归一化频率向量,m为对应频率点上的幅度,n为滤波器的阶数。本实施例中,将各频点的能量补偿值作为m,输入fir2函数从而计算得到b。
上述离线训练得到第一预增强滤波系数和第二预增强滤波系数,通过离线训练可以准确得到第一预增强滤波系数和第二预增强滤波系数,方便后续进行在线滤波处理得到增强后的语音信号,有效提高级联编解码后的语音信号的可懂度。
如图6所示,在一个实施例中,该获取该语音信号的基音周期包括:
步骤602,对该语音信号进行带通滤波。
本实施例中,对语音信号进行带通滤波可采用80Hz~1500Hz的滤波器进行滤波,也可采用60~1000Hz的带通滤波器进行滤波等,不限于此。也就是带通滤波的频率范围根据具体需求设置。
步骤604,将该带通滤波后的语音信号进行预加重处理。
本实施例中,预加重是指发送端对输入信号高频分量的提升。
步骤606,对该语音信号以矩形窗进行平移分帧,每帧窗长第一采样点数,每帧平移第二采样点数。
本实施例中,矩形窗的窗长为第一采样点数,第一采样点数可为280点,第二采样点可为80点,第一采样点数和第二采样点数不限于此。80点对应的是10ms(毫秒)数据,采用80点平移,则是每帧都会引入10ms的新数据进行计算。
步骤608,对每帧信号进行三电平削波处理。
本实施例中,三电平削波处理,如设定正负阈值,如果样点值大于正阈值则输出1,如果样点值小于负阈值则输出-1,其余情况输出为0。
如图7所示,正阈值为C,负阈值为-C,若样点值超过正阈值C,则输出1,样点值小于负阈值-C,则输出-1,其余输出为0。
对每帧信号进行三电平削波处理得到t(i),其中,i取值范围为1~280。
步骤610,对每帧内采样点计算自相关值。
本实施例中,每帧内采样点计算自相关值为两个因子的积除以各自的开方根的乘积。计算自相关值的公式为:
Figure PCTCN2017076653-appb-000001
其中,r(k)为自相关值,t(k+l-1)为对应的(k+l-1)的三电平削波处理的结果,k取值为20至160是常规的基音周期搜索范围,若对换为基频则为8000/20~8000/160,即50Hz~400Hz范围,即人声正常基频范围,k超出20~160可认为非人类正常声音基频范围,可不用计算,节省计算时间。
因k最大值为160,l的最大值为121,则t的最大范围为160+121-1=280,故三电平削波中i的最大值为280。
步骤612,以每帧中自相关值最大者所对应的序号作为每帧的基音周期。
本实施例中,通过计算每帧中自相关值,可得到每帧中自相关值最大者对应的序号,将该自相关最大者对应的序号作为每帧的基音周期。
在其他实施例中,步骤602和步骤604可以省略。
图8为一段语音的基音周期计算结果示意图。如图8所示,第一幅图中的横坐标为采样点的序号,纵坐标为采样点的样点值即采样点的幅值,可知采样点的样点值呈现变化,有的采样点的样点值大,有的采样点的样点值小。第二幅图中的横坐标为帧数,纵坐标为基音周期值,对于语音帧求取基音周期,非语音帧的基音周期默认为0。
下面结合具体的实施例描述上述语音信号级联处理方法。如图9所示,以第一特征信号为男声,第二特征信号为女声为例,上述语音信号级联处理方法包括离线训练部分和在线处理部分。离线训练部分包括:
(c1)从男女声训练集中获取样本语音信号。
(c2)vad判决样本语音信号是否为语音信号,若是,则执行步骤(c3),若否,则返回(c2)。
(c3)若为语音信号,则对样本语音信号进行模拟级联编解码处理,得到降级语音信号。
将样本语音信号经过实际链路环节所需要经过的多个编解码环节,例如支持G.729A的IP电话与GSM手机互通,则模拟级联编解码可为G.729A编码+G.729解码+AMRNB编码+AMRNB解码。对样本语音信号经过离线的级联编解码处理后得到降级语音信号。
(c4)计算各频点能量衰减值,即为能量补偿值。
具体地,将每个频点的样本语音信号所对应的能量值减去降级语音信号所对应的能量值得到对应频点的能量衰减值,该能量衰减值即为后续需要的该频点能量补偿值。
(c5)分别计算男声和女声的频点能量补偿值的平均值。
对该男女声训练集中的男声所对应的频点能量补偿值求平均得到男声在不同频点上的能量平均补偿值,以及对该男女声训练集中的女声所对应的频点能量补偿值求平均得到女声在不同频点上的能量平均补偿值。
(c6)计算男声预增强滤波系数和女声预增强滤波系数。
基于男声在不同频点上的能量平均补偿值为目标,采用自适应滤波器拟合方式对男声的能量平均补偿值进行滤波拟合得到一组男声预增强滤波系数。基于女声在不同频点上的能量平均补偿值为目标,采用自适应滤波器拟合方式对女声的能量平均补偿值进行滤波拟合得到一组女声预增强滤波系数。
在线处理部分包括:
(d1)语音信号输入。
(d2)vad检测是否为语音信号,若是,则执行步骤(d3),若否执行步骤(d4)。
(d3)判断语音信号为男声或女声,若为男声,执行步骤(d4),若为女声,执行步骤(d5)。
(d4)调用离线训练得到的男声预增强滤波系数对男声语音信号进行预增强滤波处理,得到增强后的语音信号。
(d5)调用离线训练得到的女声预增强滤波系数对女声语音信号进行预增强滤波处理,得到增强后的语音信号。
(d6)对非语音信号进行高通滤波处理,得到增强后语音。
上述语音可懂度提升方法,对非语音进行高通滤波处理,降低信号的噪声,通过识别出语音信号为男声信号或女声信号,对男声信号采用离线训练得到的男声预增强滤波系数进行预增强滤波处理,对女声信号采用离线训练得到的女声预增强滤波系数进行预增强滤波处理,针对男声信号和女声信号分别采用对应的滤波系数进行增强滤波处理,提高了语音信号的可懂度,因针对男声和女声分别处理,针对性更强,滤波处理更加准确。
图10为级联编解码后信号经过预增强后的级联编解码信号的示意图。如图10所示,第一幅图为原始信号,第二幅图为级联编解码后的信号,第三幅图为经预增强滤波处理后的级联编解码信号。由此可知,经过预增强后的级联编解码信号相比于级联编解码后信号能量更强,听起来更清晰可懂,提高了语音的可懂度。
图11为未做增强的级联编解码的信号频谱与增强后的级联编解码的信号频谱的对比示意图。如图11所示,曲线为未做增强处理的级联编解码的信号频谱,各点为增强后的级联编解码的信号频谱,横坐标为频率,纵坐标为绝对能量,做增强处理后的信号频谱强度增强,可懂度提升。
图12为未做增强的级联编解码的信号频谱与增强后的级联编解码的信号频谱的中高频部分的对比示意图。曲线为未做增强处理的级联编解码的信号频谱,各点为增强后的级联编解码的信号频谱,横坐标为频率,纵坐标为绝对能量,做增强处理后的信号频谱强度增强,中高频部分做了预增强处理后信号能量更强,提高了可懂度。
图13为一个实施例中语音信号级联处理装置的结构框图。如图13所示,一种语音信号级联处理装置,包括语音信号获取模块1302、识别模块1304、第一信号增强模块1306、第二信号增强模块1308和输出模块1310。其中:
语音信号获取模块1302用于获取语音信号。
识别模块1304用于对该语音信号进行特征识别。
第一信号增强模块1306用于若该语音信号为第一特征信号,则采用第一预增强滤波系数对该第一特征信号进行预增强滤波,得到第一预增强语音信号。
第二信号增强模块1308用于若该语音信号为第二特征信号,则采用第二预增强滤波系数对该第二特征信号进行预增强滤波,得到第二预增强语音信号。
输出模块1310用于输出该第一预增强语音信号或第二预增强语音信号, 以根据该第一预增强语音信号或第二预增强语音信号进行级联编解码处理。
上述语音信号级联处理装置,通过对语音信号进行特征识别,对第一特征信号采用第一预增强滤波系数进行预增强滤波处理,对第二特征信号采用第二预增强滤波系数进行预增强滤波处理,将预增强语音进行级联编解码处理,接收方能更清楚听清语音信息,提高了经过级联编解码后的语音信号的可懂度,针对第一特征信号和第二特征信号分别采用对应的滤波系数进行增强滤波处理,针对性更强,滤波处理更加准确。
图14为另一个实施例中语音信号级联处理装置的结构框图。如图14所示,一种语音信号级联处理装置,除了包括语音信号获取模块1302、识别模块1304、第一信号增强模块1306、第二信号增强模块1308和输出模块1310,还包括训练模块1312。
训练模块1312用于在该获取语音信号之前,根据音频训练集中的训练样本进行离线训练得到第一预增强滤波系数和第二预增强滤波系数。
图15为一个实施例中训练模块的内部结构示意图。如图15所示,该训练模块1310包括选取单元1502、模拟级联编解码单元1504、能量补偿值获取单元1506、平均能量补偿值获取单元1508和滤波系数获取单元1510。
选取单元1502用于从音频训练集中获取样本语音信号,该样本语音信号为第一特征样本语音信号或第二特征样本语音信号。
模拟级联编解码单元1504用于对该样本语音信号进行模拟级联编解码处理,得到降级语音信号。
能量补偿值获取单元1506用于获取该降级语音信号与样本语音信号对应不同频点上的能量衰减值,将该能量衰减值作为频点能量补偿值。
平均能量补偿值获取单元1508用于对该音频训练集中的第一特征信号所对应的频点能量补偿值求平均得到第一特征信号在不同频点上的能量平均补偿值,以及对该音频训练集中的第二特征信号所对应的频点能量补偿值求平均得到第二特征信号在不同频点上的能量平均补偿值。
滤波系数获取单元1510用于根据该第一特征信号在不同频点上的能量平均补偿值进行滤波拟合得到第一预增强滤波系数,以及根据该第二特征信号在不同频点上的能量平均补偿值进行滤波拟合得到第二预增强滤波系数。
上述离线训练得到第一预增强滤波系数和第二预增强滤波系数,通过离线训练可以准确得到第一预增强滤波系数和第二预增强滤波系数,方便后续进行在线滤波处理得到增强后的语音信号,有效提高级联编解码后的语音信号的可懂度。
在一个实施例中,识别模块1304还用于获取该语音信号的基音周期;以及判断该语音信号的基音周期是否大于预设周期值,若是,则该语音信号为第一特征信号,若否,则该语音信号为第二特征信号。
进一步的,识别模块1304还用于对该语音信号以矩形窗进行平移分帧,每帧窗长第一采样点数,每帧平移第二采样点数;对每帧信号进行三电平削波处理;对每帧内采样点计算自相关值;以及以每帧中自相关值最大者所对应的序号作为每帧的基音周期。
进一步的,识别模块1304还用于在该对该语音信号以矩形窗进行平移分帧,每帧窗长第一采样点数,每帧平移第二采样点数之前,对该语音信号进行带通滤波,以及将该带通滤波后的语音信号进行预加重处理。
图16为另一个实施例中语音信号级联处理装置的结构框图。如图16所示,一种语音信号级联处理装置,除了包括语音信号获取模块1302、识别模块1304、第一信号增强模块1306、第二信号增强模块1308和输出模块1310,还包括原始信号获取模块1314、检测模块1316、滤波模块1318。
原始信号获取模块1314用于获取输入的原始音频信号。
检测模块1316用于检测该原始音频信号为语音信号或非语音信号。
该语音信号获取模块1302还用于若该原始音频信号为语音信号,获取语音信号。
滤波模块1318用于若该原始音频信号为非语音信号,则对该非语音信号 进行高通滤波处理。
上述语音信号级联处理装置,对非语音进行高通滤波处理,降低信号的噪声,通过对语音信号进行特征识别,对第一特征信号采用第一预增强滤波系数进行预增强滤波处理,对第二特征信号采用第二预增强滤波系数进行预增强滤波处理,将预增强语音进行级联编解码处理,接收方能更清楚听清语音信息,提高了经过级联编解码后的语音信号的可懂度,针对第一特征信号和第二特征信号分别采用对应的滤波系数进行增强滤波处理,针对性更强,滤波处理更加准确。
在其他实施例中,一种语音信号级联处理装置,可包括语音信号获取模块1302、识别模块1304、第一信号增强模块1306、第二信号增强模块1308、输出模块1310、训练模块1312、原始信号获取模块1314、检测模块1316、滤波模块1318中所有可能的组合。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于一非易失性计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,所述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)等。
以上所述实施例仅表达了本发明的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对本发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本发明构思的前提下,还可以做出若干变形和改进,这些都属于本发明的保护范围。因此,本发明专利的保护范围应以所附权利要求为准。

Claims (18)

  1. 一种语音信号级联处理方法,包括:
    获取语音信号;
    对所述语音信号进行特征识别;
    若所述语音信号为第一特征信号,则采用第一预增强滤波系数对所述第一特征信号进行预增强滤波,得到第一预增强语音信号;
    若所述语音信号为第二特征信号,则采用第二预增强滤波系数对所述第二特征信号进行预增强滤波,得到第二预增强语音信号;以及
    输出所述第一预增强语音信号或第二预增强语音信号,以根据所述第一预增强语音信号或第二预增强语音信号进行级联编解码处理。
  2. 根据权利要求1所述的方法,其特征在于,在所述获取语音信号之前,所述方法还包括:
    根据音频训练集中的训练样本进行离线训练得到第一预增强滤波系数和第二预增强滤波系数,包括:
    从音频训练集中获取样本语音信号,所述样本语音信号为第一特征样本语音信号或第二特征样本语音信号;
    对所述样本语音信号进行模拟级联编解码处理,得到降级语音信号;
    获取所述降级语音信号与样本语音信号对应不同频点上的能量衰减值,将所述能量衰减值作为频点能量补偿值;
    对所述音频训练集中的第一特征信号所对应的频点能量补偿值求平均得到第一特征信号在不同频点上的能量平均补偿值,以及对所述音频训练集中的第二特征信号所对应的频点能量补偿值求平均得到第二特征信号在不同频点上的能量平均补偿值;以及
    根据所述第一特征信号在不同频点上的能量平均补偿值进行滤波拟合得到第一预增强滤波系数,以及根据所述第二特征信号在不同频点上的能量平均补偿值进行滤波拟合得到第二预增强滤波系数。
  3. 根据权利要求1所述的方法,其特征在于,所述对所述语音信号进行特征识别包括:
    获取所述语音信号的基音周期;
    判断所述语音信号的基音周期是否大于预设周期值,若是,则所述语音信号为第一特征信号,若否,则所述语音信号为第二特征信号。
  4. 根据权利要求3所述的方法,其特征在于,所述获取所述语音信号的基音周期包括:
    对所述语音信号以矩形窗进行平移分帧,每帧窗长第一采样点数,每帧平移第二采样点数;
    对每帧信号进行三电平削波处理;
    对每帧内采样点计算自相关值;以及
    以每帧中自相关值最大者所对应的序号作为每帧的基音周期。
  5. 根据权利要求4所述的方法,其特征在于,在所述对所述语音信号以矩形窗进行平移分帧,每帧窗长第一采样点数,每帧平移第二采样点数之前,所述获取所述语音信号的基音周期还包括:
    对所述语音信号进行带通滤波;
    将所述带通滤波后的语音信号进行预加重处理。
  6. 根据权利要求1所述的方法,其特征在于,在所述获取语音信号的步骤之前,所述方法还包括:
    获取输入的原始音频信号;
    检测所述原始音频信号为语音信号或非语音信号;
    若所述原始音频信号为语音信号,则执行所述获取语音信号的步骤;
    若所述原始音频信号为非语音信号,则对所述非语音信号进行高通滤波处理。
  7. 一种终端,包括存储器及处理器,所述存储器中储存有计算机可读指令,所述指令被所述处理器执行时,使得所述处理器执行以下步骤:
    获取语音信号;
    对所述语音信号进行特征识别;
    若所述语音信号为第一特征信号,则采用第一预增强滤波系数对所述第一特征信号进行预增强滤波,得到第一预增强语音信号;
    若所述语音信号为第二特征信号,则采用第二预增强滤波系数对所述第二特征信号进行预增强滤波,得到第二预增强语音信号;以及
    输出所述第一预增强语音信号或第二预增强语音信号,以根据所述第一预增强语音信号或第二预增强语音信号进行级联编解码处理。
  8. 根据权利要求7所述的终端,其特征在于,在所述获取语音信号之前,所述处理器还用于执行以下步骤:
    根据音频训练集中的训练样本进行离线训练得到第一预增强滤波系数和第二预增强滤波系数,包括:
    从音频训练集中获取样本语音信号,所述样本语音信号为第一特征样本语音信号或第二特征样本语音信号;
    对所述样本语音信号进行模拟级联编解码处理,得到降级语音信号;
    获取所述降级语音信号与样本语音信号对应不同频点上的能量衰减值,将所述能量衰减值作为频点能量补偿值;
    对所述音频训练集中的第一特征信号所对应的频点能量补偿值求平均得到第一特征信号在不同频点上的能量平均补偿值,以及对所述音频训练集中的第二特征信号所对应的频点能量补偿值求平均得到第二特征信号在不同频点上的能量平均补偿值;以及
    根据所述第一特征信号在不同频点上的能量平均补偿值进行滤波拟合得到第一预增强滤波系数,以及根据所述第二特征信号在不同频点上的能量平均补偿值进行滤波拟合得到第二预增强滤波系数。
  9. 根据权利要求7所述的终端,其特征在于,所述对所述语音信号进行特征识别包括:
    获取所述语音信号的基音周期;
    判断所述语音信号的基音周期是否大于预设周期值,若是,则所述语音 信号为第一特征信号,若否,则所述语音信号为第二特征信号。
  10. 根据权利要求9所述的终端,其特征在于,所述获取所述语音信号的基音周期包括:
    对所述语音信号以矩形窗进行平移分帧,每帧窗长第一采样点数,每帧平移第二采样点数;
    对每帧信号进行三电平削波处理;
    对每帧内采样点计算自相关值;以及
    以每帧中自相关值最大者所对应的序号作为每帧的基音周期。
  11. 根据权利要求10所述的终端,其特征在于,在所述对所述语音信号以矩形窗进行平移分帧,每帧窗长第一采样点数,每帧平移第二采样点数之前,所述获取所述语音信号的基音周期还包括:
    对所述语音信号进行带通滤波;
    将所述带通滤波后的语音信号进行预加重处理。
  12. 根据权利要求7所述的终端,其特征在于,在所述获取语音信号的步骤之前,所述处理器还用于执行以下步骤:
    获取输入的原始音频信号;
    检测所述原始音频信号为语音信号或非语音信号;
    若所述原始音频信号为语音信号,则执行所述获取语音信号的步骤;
    若所述原始音频信号为非语音信号,则对所述非语音信号进行高通滤波处理。
  13. 一个或多个包含计算机可执行指令的非易失性计算机可读存储介质,当所述计算机可执行指令被一个或多个处理器执行时,使得所述处理器执行以下步骤:
    获取语音信号;
    对所述语音信号进行特征识别;
    若所述语音信号为第一特征信号,则采用第一预增强滤波系数对所述第一特征信号进行预增强滤波,得到第一预增强语音信号;
    若所述语音信号为第二特征信号,则采用第二预增强滤波系数对所述第二特征信号进行预增强滤波,得到第二预增强语音信号;以及
    输出所述第一预增强语音信号或第二预增强语音信号,以根据所述第一预增强语音信号或第二预增强语音信号进行级联编解码处理。
  14. 根据权利要求13所述的非易失性计算机可读存储介质,其特征在于,在所述获取语音信号之前,所述处理器还用于执行以下步骤:
    根据音频训练集中的训练样本进行离线训练得到第一预增强滤波系数和第二预增强滤波系数,包括:
    从音频训练集中获取样本语音信号,所述样本语音信号为第一特征样本语音信号或第二特征样本语音信号;
    对所述样本语音信号进行模拟级联编解码处理,得到降级语音信号;
    获取所述降级语音信号与样本语音信号对应不同频点上的能量衰减值,将所述能量衰减值作为频点能量补偿值;
    对所述音频训练集中的第一特征信号所对应的频点能量补偿值求平均得到第一特征信号在不同频点上的能量平均补偿值,以及对所述音频训练集中的第二特征信号所对应的频点能量补偿值求平均得到第二特征信号在不同频点上的能量平均补偿值;以及
    根据所述第一特征信号在不同频点上的能量平均补偿值进行滤波拟合得到第一预增强滤波系数,以及根据所述第二特征信号在不同频点上的能量平均补偿值进行滤波拟合得到第二预增强滤波系数。
  15. 根据权利要求13所述的非易失性计算机可读存储介质,其特征在于,所述对所述语音信号进行特征识别包括:
    获取所述语音信号的基音周期;
    判断所述语音信号的基音周期是否大于预设周期值,若是,则所述语音信号为第一特征信号,若否,则所述语音信号为第二特征信号。
  16. 根据权利要求15所述的非易失性计算机可读存储介质,其特征在于,所述获取所述语音信号的基音周期包括:
    对所述语音信号以矩形窗进行平移分帧,每帧窗长第一采样点数,每帧平移第二采样点数;
    对每帧信号进行三电平削波处理;
    对每帧内采样点计算自相关值;以及
    以每帧中自相关值最大者所对应的序号作为每帧的基音周期。
  17. 根据权利要求16所述的非易失性计算机可读存储介质,其特征在于,在所述对所述语音信号以矩形窗进行平移分帧,每帧窗长第一采样点数,每帧平移第二采样点数之前,所述获取所述语音信号的基音周期还包括:
    对所述语音信号进行带通滤波;
    将所述带通滤波后的语音信号进行预加重处理。
  18. 根据权利要求13所述的非易失性计算机可读存储介质,其特征在于,在所述获取语音信号的步骤之前,所述处理器还用于执行以下步骤:
    获取输入的原始音频信号;
    检测所述原始音频信号为语音信号或非语音信号;
    若所述原始音频信号为语音信号,则执行所述获取语音信号的步骤;
    若所述原始音频信号为非语音信号,则对所述非语音信号进行高通滤波处理。
PCT/CN2017/076653 2016-04-15 2017-03-14 语音信号级联处理方法、终端和计算机可读存储介质 WO2017177782A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP17781758.2A EP3444819B1 (en) 2016-04-15 2017-03-14 Voice signal cascade processing method and terminal, and computer readable storage medium
US16/001,736 US10832696B2 (en) 2016-04-15 2018-06-06 Speech signal cascade processing method, terminal, and computer-readable storage medium
US17/076,656 US11605394B2 (en) 2016-04-15 2020-10-21 Speech signal cascade processing method, terminal, and computer-readable storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610235392.9A CN105913854B (zh) 2016-04-15 2016-04-15 语音信号级联处理方法和装置
CN201610235392.9 2016-04-15

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/001,736 Continuation-In-Part US10832696B2 (en) 2016-04-15 2018-06-06 Speech signal cascade processing method, terminal, and computer-readable storage medium

Publications (1)

Publication Number Publication Date
WO2017177782A1 true WO2017177782A1 (zh) 2017-10-19

Family

ID=56747068

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/076653 WO2017177782A1 (zh) 2016-04-15 2017-03-14 语音信号级联处理方法、终端和计算机可读存储介质

Country Status (4)

Country Link
US (2) US10832696B2 (zh)
EP (1) EP3444819B1 (zh)
CN (1) CN105913854B (zh)
WO (1) WO2017177782A1 (zh)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105913854B (zh) 2016-04-15 2020-10-23 腾讯科技(深圳)有限公司 语音信号级联处理方法和装置
CN107731232A (zh) * 2017-10-17 2018-02-23 深圳市沃特沃德股份有限公司 语音翻译方法和装置
CN110288977B (zh) * 2019-06-29 2022-05-31 联想(北京)有限公司 一种数据处理方法、装置及电子设备
CN110401611B (zh) * 2019-06-29 2021-12-07 西南电子技术研究所(中国电子科技集团公司第十研究所) 快速检测cpfsk信号的方法
US11064297B2 (en) * 2019-08-20 2021-07-13 Lenovo (Singapore) Pte. Ltd. Microphone position notification
US11710492B2 (en) * 2019-10-02 2023-07-25 Qualcomm Incorporated Speech encoding using a pre-encoded database
US11823706B1 (en) * 2019-10-14 2023-11-21 Meta Platforms, Inc. Voice activity detection in audio signal
CN113409803B (zh) * 2020-11-06 2024-01-23 腾讯科技(深圳)有限公司 语音信号处理方法、装置、存储介质及设备
CN113160835A (zh) * 2021-04-23 2021-07-23 河南牧原智能科技有限公司 一种猪只声音提取方法、装置、设备及可读存储介质
US11830514B2 (en) * 2021-05-27 2023-11-28 GM Global Technology Operations LLC System and method for augmenting vehicle phone audio with background sounds
CN113488071A (zh) * 2021-07-16 2021-10-08 河南牧原智能科技有限公司 一种猪只咳嗽识别方法、装置、设备及可读存储介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0929065A2 (en) * 1998-01-09 1999-07-14 AT&T Corp. A modular approach to speech enhancement with an application to speech coding
CN1285945A (zh) * 1998-01-07 2001-02-28 艾利森公司 一种用于对声音编码、同时抑制声学背景噪声的系统和方法
WO2004097799A1 (en) * 2003-04-24 2004-11-11 Massachusetts Institute Of Technology System and method for spectral enhancement employing compression and expansion
CN102779527A (zh) * 2012-08-07 2012-11-14 无锡成电科大科技发展有限公司 基于窗函数共振峰增强的语音增强方法
CN103413553A (zh) * 2013-08-20 2013-11-27 腾讯科技(深圳)有限公司 音频编码方法、音频解码方法、编码端、解码端和系统
CN104269177A (zh) * 2014-09-22 2015-01-07 联想(北京)有限公司 一种语音处理方法及电子设备
CN105913854A (zh) * 2016-04-15 2016-08-31 腾讯科技(深圳)有限公司 语音信号级联处理方法和装置

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5012518A (en) * 1989-07-26 1991-04-30 Itt Corporation Low-bit-rate speech coder using LPC data reduction processing
US5657422A (en) * 1994-01-28 1997-08-12 Lucent Technologies Inc. Voice activity detection driven noise remediator
US6104991A (en) * 1998-02-27 2000-08-15 Lucent Technologies, Inc. Speech encoding and decoding system which modifies encoding and decoding characteristics based on an audio signal
US7949520B2 (en) * 2004-10-26 2011-05-24 QNX Software Sytems Co. Adaptive filter pitch extraction
US8566086B2 (en) * 2005-06-28 2013-10-22 Qnx Software Systems Limited System for adaptive enhancement of speech signals
US8160877B1 (en) * 2009-08-06 2012-04-17 Narus, Inc. Hierarchical real-time speaker recognition for biometric VoIP verification and targeting
US8280726B2 (en) * 2009-12-23 2012-10-02 Qualcomm Incorporated Gender detection in mobile phones
US8831942B1 (en) * 2010-03-19 2014-09-09 Narus, Inc. System and method for pitch based gender identification with suspicious speaker detection
EP2795613B1 (en) * 2011-12-21 2017-11-29 Huawei Technologies Co., Ltd. Very short pitch detection and coding
US9330684B1 (en) * 2015-03-27 2016-05-03 Continental Automotive Systems, Inc. Real-time wind buffet noise detection

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1285945A (zh) * 1998-01-07 2001-02-28 艾利森公司 一种用于对声音编码、同时抑制声学背景噪声的系统和方法
EP0929065A2 (en) * 1998-01-09 1999-07-14 AT&T Corp. A modular approach to speech enhancement with an application to speech coding
WO2004097799A1 (en) * 2003-04-24 2004-11-11 Massachusetts Institute Of Technology System and method for spectral enhancement employing compression and expansion
CN102779527A (zh) * 2012-08-07 2012-11-14 无锡成电科大科技发展有限公司 基于窗函数共振峰增强的语音增强方法
CN103413553A (zh) * 2013-08-20 2013-11-27 腾讯科技(深圳)有限公司 音频编码方法、音频解码方法、编码端、解码端和系统
CN104269177A (zh) * 2014-09-22 2015-01-07 联想(北京)有限公司 一种语音处理方法及电子设备
CN105913854A (zh) * 2016-04-15 2016-08-31 腾讯科技(深圳)有限公司 语音信号级联处理方法和装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3444819A4 *

Also Published As

Publication number Publication date
US10832696B2 (en) 2020-11-10
US11605394B2 (en) 2023-03-14
CN105913854A (zh) 2016-08-31
EP3444819A1 (en) 2019-02-20
US20180286422A1 (en) 2018-10-04
EP3444819B1 (en) 2021-08-11
US20210035596A1 (en) 2021-02-04
CN105913854B (zh) 2020-10-23
EP3444819A4 (en) 2019-04-24

Similar Documents

Publication Publication Date Title
WO2017177782A1 (zh) 语音信号级联处理方法、终端和计算机可读存储介质
WO2021147237A1 (zh) 语音信号处理方法、装置、电子设备及存储介质
US7461003B1 (en) Methods and apparatus for improving the quality of speech signals
US20110054889A1 (en) Enhancing Receiver Intelligibility in Voice Communication Devices
JP5232151B2 (ja) パケットベースのエコー除去および抑制
JP4018571B2 (ja) 音声強調装置
KR20040101575A (ko) 다중스트림 특징 프로세싱을 이용하는 분산형 음성인식시스템
US6026356A (en) Methods and devices for noise conditioning signals representative of audio information in compressed and digitized form
WO2019233362A1 (zh) 基于深度学习的语音音质增强方法、装置和系统
WO2021103778A1 (zh) 语音处理方法、装置、计算机可读存储介质和计算机设备
US10504530B2 (en) Switching between transforms
EP2507982B1 (en) Decoding speech signals
AU2023254936A1 (en) Multi-channel signal generator, audio encoder and related methods relying on a mixing noise signal
CN107071197B (zh) 一种基于全相位多延迟分块频域的回音消除方法及系统
US11488616B2 (en) Real-time assessment of call quality
CN112634912A (zh) 丢包补偿方法及装置
CN101557443B (zh) 数字电话会议的桥路运算方法
Prasad et al. SPCp1-01: Voice Activity Detection for VoIP-An Information Theoretic Approach
US20160019903A1 (en) Optimized mixing of audio streams encoded by sub-band encoding
EP1944761A1 (en) Disturbance reduction in digital signal processing
CN112908350B (zh) 一种音频处理方法、通信装置、芯片及其模组设备
Čubrilović et al. Audio Denoising using Encoder-Decoder Deep Neural Network in the Case of HF Radio
CN109215673B (zh) 一种VoIP电话网络RTP语音流降噪方法
JP2005142757A (ja) 受信装置および方法
Nam et al. A preprocessing approach to improving the quality of the music decoded by an EVRC codec

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2017781758

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2017781758

Country of ref document: EP

Effective date: 20181115

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17781758

Country of ref document: EP

Kind code of ref document: A1