CN107919136B - Digital voice sampling frequency estimation method based on Gaussian mixture model - Google Patents
Digital voice sampling frequency estimation method based on Gaussian mixture model Download PDFInfo
- Publication number
- CN107919136B CN107919136B CN201711112810.6A CN201711112810A CN107919136B CN 107919136 B CN107919136 B CN 107919136B CN 201711112810 A CN201711112810 A CN 201711112810A CN 107919136 B CN107919136 B CN 107919136B
- Authority
- CN
- China
- Prior art keywords
- sampling frequency
- voice
- training
- sampling
- speech
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000005070 sampling Methods 0.000 title claims abstract description 56
- 239000000203 mixture Substances 0.000 title claims abstract description 16
- 238000000034 method Methods 0.000 title claims abstract description 13
- 239000013598 vector Substances 0.000 claims description 7
- 238000009432 framing Methods 0.000 claims description 3
- 239000011159 matrix material Substances 0.000 claims description 2
- 238000001228 spectrum Methods 0.000 description 2
- 241000287196 Asthenes Species 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Transmission Systems Not Characterized By The Medium Used For Transmission (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
The invention discloses a Gaussian mixture model-based digital voice sampling frequency estimation method, which comprises the steps of firstly, generating a GMM by using high-sampling-rate digital voice training; then interpolating the input voice with low sampling rate to be estimated, and increasing the sampling frequency; and finally, carrying out probability calculation on the interpolated digital voice by using the GMM, and adjusting the interpolation multiple according to the calculation result to enable the output probability of the GMM to reach the maximum value, thereby obtaining the sampling frequency of the input voice. The invention can identify the sampling frequency of unknown digital voice and reduce the system performance reduction caused by the mismatching of the sampling frequency.
Description
Technical Field
The invention belongs to the field of voice processing, and particularly relates to a voice processing method for estimating the sampling frequency of input voice by using a Gaussian mixture model generated by high-sampling-rate digital voice training.
Background
The voice is a basic means for human to communicate information and is also the most convenient and effective human-computer interaction tool in the motion process. Digital speech has the advantages of high accuracy and easy storage and transmission, but different digital systems have different computational performance, access speed, storage space, battery capacity and application, and thus different sampling frequencies are used. If the sampling frequency of the input speech does not match the sampling frequency of the digital system, this can lead to a degradation of the performance of the speech processing system. Therefore, it is necessary to transform the input speech to match the sampling frequency with the digital system, thereby enhancing the practical application capability of the speech processing system.
If the sampling frequency of the input voice is known, only the ratio of the sampling frequency to the system sampling frequency needs to be calculated, and then interpolation or decimation is carried out on the input voice to enable the sampling frequency to be consistent with the system. However, in some applications, the sampling frequency of the input speech is unknown. For example, the audio on the network is monitored, capturing a digital speech signal segment whose sampling frequency may be unknown.
Disclosure of Invention
The purpose of the invention is as follows: aiming at the problems in the prior art, the invention provides a digital speech sampling frequency estimation method based on a Gaussian Mixture Model (GMM). In the method, firstly, a GMM is generated by high sampling rate digital voice training; then interpolating the input voice with low sampling rate to be estimated, and increasing the sampling frequency; and finally, carrying out probability calculation on the interpolated digital voice by using the GMM, and adjusting the interpolation multiple according to the calculation result to enable the output probability of the GMM to reach the maximum value, thereby obtaining the sampling frequency of the input voice.
The method comprises the following specific steps:
(1) sampling training voice at 48kHz, windowing and framing the training voice, extracting cepstrum characteristics, and training by using the characteristic vectors of all voice units to generate a Gaussian mixture model;
(2) interpolating low-sampling-rate input voice to be estimated (namely voice with the sampling frequency lower than 48 kHz) to improve the sampling frequency;
(3) inputting the interpolated digital voice into GMM, and calculating the output probability;
(4) repeating (2) and (3) for all interpolation multiples, and recording the output probability of each time;
(5) and comparing the output probabilities corresponding to all the interpolation multiples, wherein the interpolation multiple corresponding to the maximum output probability is the ratio of the training voice sampling frequency to the input voice sampling frequency.
Drawings
Fig. 1 is an overall framework of a digital speech sampling frequency estimation system based on a gaussian mixture model, which mainly comprises a model training module, a signal interpolation module, an interpolation multiple control module and a frequency estimation module.
Detailed Description
The present invention is further illustrated by the following examples, which are intended to be purely exemplary and are not intended to limit the scope of the invention, as various equivalent modifications of the invention will occur to those skilled in the art upon reading the present disclosure and fall within the scope of the appended claims.
As shown in fig. 1, the method for estimating the frequency of digital speech samples based on the gaussian mixture model mainly includes model training, signal interpolation, interpolation multiple control and frequency estimation modules. The specific embodiments of the various main modules in the drawings are described in detail below:
1. model training
Firstly, sampling training voice at 48kHz, windowing, framing, and carrying out fast Fourier transform on each frame of voice signal to obtain an amplitude spectrum of each frame of signal; then, performing Mel filtering on the magnitude spectrum of each frame of signal, and taking the logarithm to obtain the cepstrum characteristic parameter of the training voice; and finally, training by using the feature vectors of all the phonetic units to generate a Gaussian mixture model:
wherein o istA cepstrum feature vector representing the t frame of training speech; c. Cm、μmSum-sigmamRespectively representing the mixing coefficient, the mean vector and the covariance matrix of the mth Gaussian unit in the GMM.
2. Signal interpolation
Because the gaussian mixture model is trained by training speech with a high sampling rate, the sampling frequency of the input digital speech can be considered to be lower than that of the training speech, and the input speech is interpolated to match the sampling frequency with the GMM.
Let the interpolation multiple be DiThen, the input digital voice x (n) is interpolated to obtain xi(n):
Interpolated digital speech xi(n) the sampling frequency is D of the original input digital speech x (n)iAnd (4) doubling.
3. Interpolation multiple control
Using the sampling frequency of training speech and the sampling frequency f of a group of common speech1,f2,…,fi,…,fNRatio D of1,D2,…,Di,…,DNAs the initial value interpolation multiple.
4. Frequency estimation
And inputting the digital voice after each interpolation into a Gaussian mixture model, and calculating the output probability of the digital voice. Comparing the output probabilities corresponding to all the interpolation multiples to determine the interpolation multiple that maximizes the output probabilityAnd is arranged atAnd (4) fine-tuning the interpolation multiple nearby to enable the output probability of the GMM to reach the maximum value. If the interpolation multiple at this time is recorded asThen the sampling frequency of the original input speechIt can be estimated that:
wherein f is0The sampling frequency for the high rate training speech is here taken to be 48 kHz.
After the sampling frequency of the input voice is estimated, the input voice can be interpolated according to the sampling frequency of the target system, and then the input voice is input into the target system for processing.
Claims (5)
1. A digital voice sampling frequency estimation method based on a Gaussian mixture model is characterized in that firstly, a GMM is generated by high sampling rate digital voice training; then interpolating the input voice with low sampling rate to be estimated, and increasing the sampling frequency; and finally, carrying out probability calculation on the interpolated digital voice by using the GMM, and adjusting the interpolation multiple according to the calculation result to enable the output probability of the GMM to reach the maximum value, thereby obtaining the sampling frequency of the input voice.
2. The method for estimating the sampling frequency of the digital voice based on the Gaussian mixture model as claimed in claim 1, specifically comprising:
(1) sampling training voice at 48kHz, windowing and framing the training voice, extracting cepstrum characteristics, and training by using the characteristic vectors of all voice units to generate a Gaussian mixture model;
(2) interpolating the input voice with low sampling rate to be estimated, and increasing the sampling frequency;
(3) inputting the interpolated digital voice into GMM, and calculating the output probability;
(4) repeating (2) and (3) for all interpolation multiples, and recording the output probability of each time;
(5) and comparing the output probabilities corresponding to all the interpolation multiples, wherein the interpolation multiple corresponding to the maximum output probability is the ratio of the training voice sampling frequency to the input voice sampling frequency.
3. The method for estimating the sampling frequency of the digital speech based on the Gaussian mixture model as claimed in claim 2, wherein the Gaussian mixture model generated in the step (1) is:
wherein o istA cepstrum feature vector representing the t frame of training speech; c. Cm、μmSum-sigmamRespectively representing the mixing coefficient, the mean vector and the covariance matrix of the mth Gaussian unit in the GMM.
4. The method of claim 2, wherein the interpolation is performed on the low-sampling-rate input speech to be estimated:
let the interpolation multiple be DiThen, the input digital voice x (n) is interpolated to obtain xi(n):
Interpolated digital speech xi(n) the sampling frequency is D of the original input digital speech x (n)iAnd (4) doubling.
5. The method of claim 4, wherein the sampling frequency of the training speech and the sampling frequency f of the set of common speech are used1,f2,…,fi,…,fNRatio D of1,D2,…,Di,…,DNAs the initial value interpolation multiple.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711112810.6A CN107919136B (en) | 2017-11-13 | 2017-11-13 | Digital voice sampling frequency estimation method based on Gaussian mixture model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711112810.6A CN107919136B (en) | 2017-11-13 | 2017-11-13 | Digital voice sampling frequency estimation method based on Gaussian mixture model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107919136A CN107919136A (en) | 2018-04-17 |
CN107919136B true CN107919136B (en) | 2021-07-09 |
Family
ID=61896270
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711112810.6A Active CN107919136B (en) | 2017-11-13 | 2017-11-13 | Digital voice sampling frequency estimation method based on Gaussian mixture model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107919136B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109459612A (en) * | 2019-01-09 | 2019-03-12 | 上海艾为电子技术股份有限公司 | The detection method and device of the sample frequency of digital audio and video signals |
CN111341302B (en) * | 2020-03-02 | 2023-10-31 | 苏宁云计算有限公司 | Voice stream sampling rate determining method and device |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8639502B1 (en) * | 2009-02-16 | 2014-01-28 | Arrowhead Center, Inc. | Speaker model-based speech enhancement system |
CN107204840A (en) * | 2017-07-31 | 2017-09-26 | 电子科技大学 | Sinusoidal signal frequency method of estimation based on DFT and iteration correction |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3998606B8 (en) * | 2009-10-21 | 2022-12-07 | Dolby International AB | Oversampling in a combined transposer filter bank |
-
2017
- 2017-11-13 CN CN201711112810.6A patent/CN107919136B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8639502B1 (en) * | 2009-02-16 | 2014-01-28 | Arrowhead Center, Inc. | Speaker model-based speech enhancement system |
CN107204840A (en) * | 2017-07-31 | 2017-09-26 | 电子科技大学 | Sinusoidal signal frequency method of estimation based on DFT and iteration correction |
Non-Patent Citations (3)
Title |
---|
《Improved AdaBoost Algorithm Using VQMAP for Speaker Identification》;Haiyang Wu et al.;《2010 International Conference on Electrical and Control Engineering》;20101231;全文 * |
《Smooth interpolation of Gaussian mixture models》;Petr Zelinka et al.;《2009 19th International Conference Radioelektronika》;20091231;全文 * |
《VTS feature compensation based on two-layer GMM structure for robust speech recognition》;Lin Zhou et al.;《 2016 8th International Conference on Wireless Communications & Signal Processing (WCSP)》;20161231;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN107919136A (en) | 2018-04-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Li et al. | Two heads are better than one: A two-stage complex spectral mapping approach for monaural speech enhancement | |
Zhang et al. | Deep learning for environmentally robust speech recognition: An overview of recent developments | |
CN110634497B (en) | Noise reduction method and device, terminal equipment and storage medium | |
US9666183B2 (en) | Deep neural net based filter prediction for audio event classification and extraction | |
US8370139B2 (en) | Feature-vector compensating apparatus, feature-vector compensating method, and computer program product | |
CN108335694B (en) | Far-field environment noise processing method, device, equipment and storage medium | |
Lin et al. | Speech enhancement using multi-stage self-attentive temporal convolutional networks | |
WO2021114733A1 (en) | Noise suppression method for processing at different frequency bands, and system thereof | |
CN110088835B (en) | Blind source separation using similarity measures | |
CN103903612A (en) | Method for performing real-time digital speech recognition | |
CN110047478B (en) | Multi-channel speech recognition acoustic modeling method and device based on spatial feature compensation | |
CN110556125B (en) | Feature extraction method and device based on voice signal and computer storage medium | |
CN109192200A (en) | A kind of audio recognition method | |
CN106373559B (en) | Robust feature extraction method based on log-spectrum signal-to-noise ratio weighting | |
EP4189677B1 (en) | Noise reduction using machine learning | |
Ganapathy | Multivariate autoregressive spectrogram modeling for noisy speech recognition | |
CN113053400B (en) | Training method of audio signal noise reduction model, audio signal noise reduction method and equipment | |
Phapatanaburi et al. | Noise robust voice activity detection using joint phase and magnitude based feature enhancement | |
CN107919136B (en) | Digital voice sampling frequency estimation method based on Gaussian mixture model | |
CN110998723A (en) | Signal processing device using neural network, signal processing method using neural network, and signal processing program | |
CN112992190B (en) | Audio signal processing method and device, electronic equipment and storage medium | |
KR102220964B1 (en) | Method and device for audio recognition | |
Loweimi et al. | Robust Source-Filter Separation of Speech Signal in the Phase Domain. | |
WO2023226572A1 (en) | Feature representation extraction method and apparatus, device, medium and program product | |
US20230116052A1 (en) | Array geometry agnostic multi-channel personalized speech enhancement |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |