CN107919136B - Digital voice sampling frequency estimation method based on Gaussian mixture model - Google Patents

Digital voice sampling frequency estimation method based on Gaussian mixture model Download PDF

Info

Publication number
CN107919136B
CN107919136B CN201711112810.6A CN201711112810A CN107919136B CN 107919136 B CN107919136 B CN 107919136B CN 201711112810 A CN201711112810 A CN 201711112810A CN 107919136 B CN107919136 B CN 107919136B
Authority
CN
China
Prior art keywords
sampling frequency
voice
training
sampling
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711112810.6A
Other languages
Chinese (zh)
Other versions
CN107919136A (en
Inventor
吕勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hohai University HHU
Original Assignee
Hohai University HHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hohai University HHU filed Critical Hohai University HHU
Priority to CN201711112810.6A priority Critical patent/CN107919136B/en
Publication of CN107919136A publication Critical patent/CN107919136A/en
Application granted granted Critical
Publication of CN107919136B publication Critical patent/CN107919136B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Transmission Systems Not Characterized By The Medium Used For Transmission (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention discloses a Gaussian mixture model-based digital voice sampling frequency estimation method, which comprises the steps of firstly, generating a GMM by using high-sampling-rate digital voice training; then interpolating the input voice with low sampling rate to be estimated, and increasing the sampling frequency; and finally, carrying out probability calculation on the interpolated digital voice by using the GMM, and adjusting the interpolation multiple according to the calculation result to enable the output probability of the GMM to reach the maximum value, thereby obtaining the sampling frequency of the input voice. The invention can identify the sampling frequency of unknown digital voice and reduce the system performance reduction caused by the mismatching of the sampling frequency.

Description

Digital voice sampling frequency estimation method based on Gaussian mixture model
Technical Field
The invention belongs to the field of voice processing, and particularly relates to a voice processing method for estimating the sampling frequency of input voice by using a Gaussian mixture model generated by high-sampling-rate digital voice training.
Background
The voice is a basic means for human to communicate information and is also the most convenient and effective human-computer interaction tool in the motion process. Digital speech has the advantages of high accuracy and easy storage and transmission, but different digital systems have different computational performance, access speed, storage space, battery capacity and application, and thus different sampling frequencies are used. If the sampling frequency of the input speech does not match the sampling frequency of the digital system, this can lead to a degradation of the performance of the speech processing system. Therefore, it is necessary to transform the input speech to match the sampling frequency with the digital system, thereby enhancing the practical application capability of the speech processing system.
If the sampling frequency of the input voice is known, only the ratio of the sampling frequency to the system sampling frequency needs to be calculated, and then interpolation or decimation is carried out on the input voice to enable the sampling frequency to be consistent with the system. However, in some applications, the sampling frequency of the input speech is unknown. For example, the audio on the network is monitored, capturing a digital speech signal segment whose sampling frequency may be unknown.
Disclosure of Invention
The purpose of the invention is as follows: aiming at the problems in the prior art, the invention provides a digital speech sampling frequency estimation method based on a Gaussian Mixture Model (GMM). In the method, firstly, a GMM is generated by high sampling rate digital voice training; then interpolating the input voice with low sampling rate to be estimated, and increasing the sampling frequency; and finally, carrying out probability calculation on the interpolated digital voice by using the GMM, and adjusting the interpolation multiple according to the calculation result to enable the output probability of the GMM to reach the maximum value, thereby obtaining the sampling frequency of the input voice.
The method comprises the following specific steps:
(1) sampling training voice at 48kHz, windowing and framing the training voice, extracting cepstrum characteristics, and training by using the characteristic vectors of all voice units to generate a Gaussian mixture model;
(2) interpolating low-sampling-rate input voice to be estimated (namely voice with the sampling frequency lower than 48 kHz) to improve the sampling frequency;
(3) inputting the interpolated digital voice into GMM, and calculating the output probability;
(4) repeating (2) and (3) for all interpolation multiples, and recording the output probability of each time;
(5) and comparing the output probabilities corresponding to all the interpolation multiples, wherein the interpolation multiple corresponding to the maximum output probability is the ratio of the training voice sampling frequency to the input voice sampling frequency.
Drawings
Fig. 1 is an overall framework of a digital speech sampling frequency estimation system based on a gaussian mixture model, which mainly comprises a model training module, a signal interpolation module, an interpolation multiple control module and a frequency estimation module.
Detailed Description
The present invention is further illustrated by the following examples, which are intended to be purely exemplary and are not intended to limit the scope of the invention, as various equivalent modifications of the invention will occur to those skilled in the art upon reading the present disclosure and fall within the scope of the appended claims.
As shown in fig. 1, the method for estimating the frequency of digital speech samples based on the gaussian mixture model mainly includes model training, signal interpolation, interpolation multiple control and frequency estimation modules. The specific embodiments of the various main modules in the drawings are described in detail below:
1. model training
Firstly, sampling training voice at 48kHz, windowing, framing, and carrying out fast Fourier transform on each frame of voice signal to obtain an amplitude spectrum of each frame of signal; then, performing Mel filtering on the magnitude spectrum of each frame of signal, and taking the logarithm to obtain the cepstrum characteristic parameter of the training voice; and finally, training by using the feature vectors of all the phonetic units to generate a Gaussian mixture model:
Figure BDA0001465629720000021
wherein o istA cepstrum feature vector representing the t frame of training speech; c. Cm、μmSum-sigmamRespectively representing the mixing coefficient, the mean vector and the covariance matrix of the mth Gaussian unit in the GMM.
2. Signal interpolation
Because the gaussian mixture model is trained by training speech with a high sampling rate, the sampling frequency of the input digital speech can be considered to be lower than that of the training speech, and the input speech is interpolated to match the sampling frequency with the GMM.
Let the interpolation multiple be DiThen, the input digital voice x (n) is interpolated to obtain xi(n):
Figure BDA0001465629720000031
Interpolated digital speech xi(n) the sampling frequency is D of the original input digital speech x (n)iAnd (4) doubling.
3. Interpolation multiple control
Using the sampling frequency of training speech and the sampling frequency f of a group of common speech1,f2,…,fi,…,fNRatio D of1,D2,…,Di,…,DNAs the initial value interpolation multiple.
4. Frequency estimation
And inputting the digital voice after each interpolation into a Gaussian mixture model, and calculating the output probability of the digital voice. Comparing the output probabilities corresponding to all the interpolation multiples to determine the interpolation multiple that maximizes the output probability
Figure BDA0001465629720000032
And is arranged at
Figure BDA0001465629720000033
And (4) fine-tuning the interpolation multiple nearby to enable the output probability of the GMM to reach the maximum value. If the interpolation multiple at this time is recorded as
Figure BDA0001465629720000034
Then the sampling frequency of the original input speech
Figure BDA0001465629720000035
It can be estimated that:
Figure BDA0001465629720000036
wherein f is0The sampling frequency for the high rate training speech is here taken to be 48 kHz.
After the sampling frequency of the input voice is estimated, the input voice can be interpolated according to the sampling frequency of the target system, and then the input voice is input into the target system for processing.

Claims (5)

1. A digital voice sampling frequency estimation method based on a Gaussian mixture model is characterized in that firstly, a GMM is generated by high sampling rate digital voice training; then interpolating the input voice with low sampling rate to be estimated, and increasing the sampling frequency; and finally, carrying out probability calculation on the interpolated digital voice by using the GMM, and adjusting the interpolation multiple according to the calculation result to enable the output probability of the GMM to reach the maximum value, thereby obtaining the sampling frequency of the input voice.
2. The method for estimating the sampling frequency of the digital voice based on the Gaussian mixture model as claimed in claim 1, specifically comprising:
(1) sampling training voice at 48kHz, windowing and framing the training voice, extracting cepstrum characteristics, and training by using the characteristic vectors of all voice units to generate a Gaussian mixture model;
(2) interpolating the input voice with low sampling rate to be estimated, and increasing the sampling frequency;
(3) inputting the interpolated digital voice into GMM, and calculating the output probability;
(4) repeating (2) and (3) for all interpolation multiples, and recording the output probability of each time;
(5) and comparing the output probabilities corresponding to all the interpolation multiples, wherein the interpolation multiple corresponding to the maximum output probability is the ratio of the training voice sampling frequency to the input voice sampling frequency.
3. The method for estimating the sampling frequency of the digital speech based on the Gaussian mixture model as claimed in claim 2, wherein the Gaussian mixture model generated in the step (1) is:
Figure FDA0001465629710000011
wherein o istA cepstrum feature vector representing the t frame of training speech; c. Cm、μmSum-sigmamRespectively representing the mixing coefficient, the mean vector and the covariance matrix of the mth Gaussian unit in the GMM.
4. The method of claim 2, wherein the interpolation is performed on the low-sampling-rate input speech to be estimated:
let the interpolation multiple be DiThen, the input digital voice x (n) is interpolated to obtain xi(n):
Figure FDA0001465629710000012
Interpolated digital speech xi(n) the sampling frequency is D of the original input digital speech x (n)iAnd (4) doubling.
5. The method of claim 4, wherein the sampling frequency of the training speech and the sampling frequency f of the set of common speech are used1,f2,…,fi,…,fNRatio D of1,D2,…,Di,…,DNAs the initial value interpolation multiple.
CN201711112810.6A 2017-11-13 2017-11-13 Digital voice sampling frequency estimation method based on Gaussian mixture model Active CN107919136B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711112810.6A CN107919136B (en) 2017-11-13 2017-11-13 Digital voice sampling frequency estimation method based on Gaussian mixture model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711112810.6A CN107919136B (en) 2017-11-13 2017-11-13 Digital voice sampling frequency estimation method based on Gaussian mixture model

Publications (2)

Publication Number Publication Date
CN107919136A CN107919136A (en) 2018-04-17
CN107919136B true CN107919136B (en) 2021-07-09

Family

ID=61896270

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711112810.6A Active CN107919136B (en) 2017-11-13 2017-11-13 Digital voice sampling frequency estimation method based on Gaussian mixture model

Country Status (1)

Country Link
CN (1) CN107919136B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109459612A (en) * 2019-01-09 2019-03-12 上海艾为电子技术股份有限公司 The detection method and device of the sample frequency of digital audio and video signals
CN111341302B (en) * 2020-03-02 2023-10-31 苏宁云计算有限公司 Voice stream sampling rate determining method and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8639502B1 (en) * 2009-02-16 2014-01-28 Arrowhead Center, Inc. Speaker model-based speech enhancement system
CN107204840A (en) * 2017-07-31 2017-09-26 电子科技大学 Sinusoidal signal frequency method of estimation based on DFT and iteration correction

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3998606B8 (en) * 2009-10-21 2022-12-07 Dolby International AB Oversampling in a combined transposer filter bank

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8639502B1 (en) * 2009-02-16 2014-01-28 Arrowhead Center, Inc. Speaker model-based speech enhancement system
CN107204840A (en) * 2017-07-31 2017-09-26 电子科技大学 Sinusoidal signal frequency method of estimation based on DFT and iteration correction

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
《Improved AdaBoost Algorithm Using VQMAP for Speaker Identification》;Haiyang Wu et al.;《2010 International Conference on Electrical and Control Engineering》;20101231;全文 *
《Smooth interpolation of Gaussian mixture models》;Petr Zelinka et al.;《2009 19th International Conference Radioelektronika》;20091231;全文 *
《VTS feature compensation based on two-layer GMM structure for robust speech recognition》;Lin Zhou et al.;《 2016 8th International Conference on Wireless Communications & Signal Processing (WCSP)》;20161231;全文 *

Also Published As

Publication number Publication date
CN107919136A (en) 2018-04-17

Similar Documents

Publication Publication Date Title
Li et al. Two heads are better than one: A two-stage complex spectral mapping approach for monaural speech enhancement
Zhang et al. Deep learning for environmentally robust speech recognition: An overview of recent developments
CN110634497B (en) Noise reduction method and device, terminal equipment and storage medium
US9666183B2 (en) Deep neural net based filter prediction for audio event classification and extraction
US8370139B2 (en) Feature-vector compensating apparatus, feature-vector compensating method, and computer program product
CN108335694B (en) Far-field environment noise processing method, device, equipment and storage medium
Lin et al. Speech enhancement using multi-stage self-attentive temporal convolutional networks
WO2021114733A1 (en) Noise suppression method for processing at different frequency bands, and system thereof
CN110088835B (en) Blind source separation using similarity measures
CN103903612A (en) Method for performing real-time digital speech recognition
CN110047478B (en) Multi-channel speech recognition acoustic modeling method and device based on spatial feature compensation
CN110556125B (en) Feature extraction method and device based on voice signal and computer storage medium
CN109192200A (en) A kind of audio recognition method
CN106373559B (en) Robust feature extraction method based on log-spectrum signal-to-noise ratio weighting
EP4189677B1 (en) Noise reduction using machine learning
Ganapathy Multivariate autoregressive spectrogram modeling for noisy speech recognition
CN113053400B (en) Training method of audio signal noise reduction model, audio signal noise reduction method and equipment
Phapatanaburi et al. Noise robust voice activity detection using joint phase and magnitude based feature enhancement
CN107919136B (en) Digital voice sampling frequency estimation method based on Gaussian mixture model
CN110998723A (en) Signal processing device using neural network, signal processing method using neural network, and signal processing program
CN112992190B (en) Audio signal processing method and device, electronic equipment and storage medium
KR102220964B1 (en) Method and device for audio recognition
Loweimi et al. Robust Source-Filter Separation of Speech Signal in the Phase Domain.
WO2023226572A1 (en) Feature representation extraction method and apparatus, device, medium and program product
US20230116052A1 (en) Array geometry agnostic multi-channel personalized speech enhancement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant