CN116230018A - Synthetic voice quality evaluation method for voice synthesis system - Google Patents
Synthetic voice quality evaluation method for voice synthesis system Download PDFInfo
- Publication number
- CN116230018A CN116230018A CN202310151042.4A CN202310151042A CN116230018A CN 116230018 A CN116230018 A CN 116230018A CN 202310151042 A CN202310151042 A CN 202310151042A CN 116230018 A CN116230018 A CN 116230018A
- Authority
- CN
- China
- Prior art keywords
- voice
- speech
- quality
- synthesized
- prediction model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000015572 biosynthetic process Effects 0.000 title claims abstract description 33
- 238000003786 synthesis reaction Methods 0.000 title claims abstract description 33
- 238000000034 method Methods 0.000 title claims abstract description 18
- 238000013441 quality evaluation Methods 0.000 title claims abstract description 15
- 238000012549 training Methods 0.000 claims abstract description 9
- 230000008447 perception Effects 0.000 claims abstract description 8
- 238000012360 testing method Methods 0.000 claims abstract description 6
- 238000012545 processing Methods 0.000 claims description 9
- 238000001303 quality assessment method Methods 0.000 claims description 8
- 230000006870 function Effects 0.000 claims description 7
- 238000001228 spectrum Methods 0.000 claims description 6
- 238000013527 convolutional neural network Methods 0.000 claims description 4
- 238000005311 autocorrelation function Methods 0.000 claims description 3
- 238000009432 framing Methods 0.000 claims description 3
- 230000007246 mechanism Effects 0.000 claims description 3
- 238000013135 deep learning Methods 0.000 abstract description 4
- 238000011156 evaluation Methods 0.000 description 10
- 230000000875 corresponding effect Effects 0.000 description 4
- 238000005070 sampling Methods 0.000 description 3
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000003595 spectral effect Effects 0.000 description 2
- 238000007476 Maximum Likelihood Methods 0.000 description 1
- 238000012076 audiometry Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 229910044991 metal oxide Inorganic materials 0.000 description 1
- 150000004706 metal oxides Chemical class 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/60—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02P—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
- Y02P90/00—Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
- Y02P90/30—Computing systems specially adapted for manufacturing
Landscapes
- Engineering & Computer Science (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Quality & Reliability (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a synthetic voice quality evaluation method for a voice synthesis system, which comprises the following steps: creating a synthetic voice quality data set, creating a data set containing synthetic voice of a voice synthesis system, performing voice perception quality evaluation test on the voice in the data set to obtain a data tag, and measuring voice quality by taking the data tag as a score; creating a voice quality score prediction model to predict the synthesized voice; and inputting the synthesized voice to be detected into a voice quality score prediction model, training the voice quality score prediction model by a synthesized voice quality data set, and outputting a synthesized voice quality prediction score. The quality evaluation is carried out on the synthesized voice of the voice synthesis system, a synthesized voice quality dataset is established by using a deep learning-based method, and a voice quality score prediction model is trained by using the dataset, so that the voice quality score prediction model can predict the quality score of the synthesized voice of the voice synthesis system.
Description
Technical Field
The present invention relates to the field of speech quality assessment, and in particular, to a method for assessing the quality of synthesized speech in a speech synthesis system.
Background
The assessment of speech quality is important to measure and improve the quality of speech synthesis systems. However, the current evaluation index based on human audiometry has high labor cost and certain error. This is because the perception and preference of each person are different, for example, the degree of acceptance for different timbres, different accents, different speech rates, affects the objectivity and accuracy of the evaluation result. On the other hand, some other speech related tasks have fairly objective evaluation indexes, such as word error rate in automatic speech recognition and speaker error rate in speaker recognition, which can be directly used as measurement and optimization indexes of the tasks, but cannot be applied in speech synthesis. Most deep learning based speech synthesis systems will use the maximum likelihood (or minimum error) of acoustic parameters and phoneme duration to train and optimize during the training process. However, these criteria, although usable during training, cannot be used as an indicator for assessing the quality of the synthesized speech because when using text other than the speech synthesis system dataset, no corresponding reference speech can be used to calculate the acoustic parameters and the phoneme duration, and these criteria are not usable and, in addition, do not reflect the subjective perceptual assessment of the synthesized speech by the listener.
Objective evaluation methods, such as mel-frequency cepstrum distance, are often used in the field of speech conversion to measure the quality of the converted speech, but these indexes mainly measure the distortion of acoustic parameters, and cannot measure the subjective feeling of the listener well. Speech quality perception assessment (PESQ) issued by the international telecommunication union telecommunication standards office is often used in industrial applications to evaluate speech quality, however, this method requires a high quality reference speech, which limitation makes it not directly applicable to the evaluation of synthesized speech, because the synthesized speech often does not have a corresponding original speech, and the result of the evaluation of the method does not pay attention to the naturalness of the synthesized speech.
By classifying subjective feelings of listeners and designating a grading evaluation criterion, mean Opinion Score (MOS) and other subjective evaluation methods can quantify subjective evaluation of voices by listeners as grades, which can well measure naturalness of voices and subjective feelings of voices by listeners. However, MOS scores need to be collected by a subjective quality assessment test of speech, which would require a significant amount of time and labor cost when there are many audio samples, because the subjective quality assessment test of speech requires multiple listeners to cover all speech with their measures and scores, resulting in a more objective perceived quality assessment result.
In the deep learning-based synthetic voice Quality evaluation method, quality-Net provides effective non-invasive evaluation for voice enhancement by a model based on a Bi-directional long-short term memory network (Bi-LSTM) with a prediction score highly correlated with a PESQ score. However, PESQ score as a speech enhancement indicator is not a measure of the naturalness of the synthesized speech for purely predictive speech synthesis system synthesized speech quality. The Mos-Net adopts a model based on a convolutional neural network and a two-way long-short-term memory network, so that the converted voice quality can be predicted. The model performs well in predicting system quality, but the main goal of this work is to evaluate the speech conversion system.
Therefore, it is necessary to explore a synthetic speech quality assessment method for a speech synthesis system.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a synthetic voice quality evaluation method for a voice synthesis system.
In order to solve the technical problems, the invention provides the following technical scheme:
the invention discloses a synthetic voice quality evaluation method for a voice synthesis system, which comprises the following steps:
creating a synthetic voice quality data set, creating a data set containing synthetic voice of a voice synthesis system, performing voice perception quality evaluation test on the voice in the data set to obtain a data tag, and measuring voice quality by taking the data tag as a score;
creating a voice quality score prediction model to predict the synthesized voice;
and inputting the synthesized voice to be detected into a voice quality score prediction model, training the voice quality score prediction model by a synthesized voice quality data set, and outputting a synthesized voice quality prediction score.
As a preferable technical scheme of the invention, the synthetic text adopted by the voice synthesis system comprises long sentences and short sentences, the synthetic voice is used as an audio sample of a synthetic voice quality data set, and the synthetic voice is subjected to quality evaluation to obtain MOS (metal oxide semiconductor) scores which are used as data labels.
As a preferable technical scheme of the invention, the synthesized voice adopts voices of a plurality of different speakers, and the MOS score corresponding to the data tag is an average value of single voice scores.
As a preferable technical scheme of the invention, the voice quality score prediction model is based on a convolutional neural network and a self-attention mechanism model, the input of the voice quality score prediction model is a feature extracted from synthesized voice of a voice synthesis system, the output of the voice quality score prediction model is an MOS score predicted by the synthesized voice, and the data set predicts the perception quality of the synthesized voice after training the synthesized voice quality score prediction model.
As a preferable technical scheme of the invention, the synthetic voice is required to be processed before being input into the voice quality score prediction model, and the processing steps are as follows: pre-emphasis, framing, windowing, short-time fourier transform, the pre-emphasis employing a first order high pass filter: h (z) =1- μz -1 Wherein μz -1 The value of the frame is 0.9-1.0, the frame is used for analyzing and extracting characteristic parameters for subsequent processing and processing of the voice signals, wherein the extracted characteristic parameters comprise short-time energy and average amplitude of voice, short-time average zero-crossing rate, short-time autocorrelation function and short-time average amplitude difference function, the value of the frame adopted for a long time is 10-30ms, the window function adopted for windowing comprises a Hamming window and a rectangular window, and the short-time Fourier transform is used for carrying out Fourier transform on a frame of signal to obtain a frequency spectrum.
As a preferred embodiment of the present invention, the energy spectrum generated by the synthesized speech is characterized by a mel filter bank.
Compared with the prior art, the invention has the following beneficial effects:
according to the invention, the quality evaluation is carried out on the synthesized voice of the voice synthesis system, a synthesized voice quality dataset is established by using a deep learning-based method, and a voice quality score prediction model is trained by using the dataset, so that the voice quality score prediction model can predict the quality score of the synthesized voice of the voice synthesis system.
Drawings
The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention. In the drawings:
FIG. 1 is an overall flow chart of the present invention;
FIG. 2 is a schematic illustration of a speech quality score prediction model of the present invention;
Detailed Description
The preferred embodiments of the present invention will be described below with reference to the accompanying drawings, it being understood that the preferred embodiments described herein are for illustration and explanation of the present invention only, and are not intended to limit the present invention.
Example 1
As shown in fig. 1-2, the present invention provides a synthetic speech quality assessment method for a speech synthesis system, comprising:
creating a synthetic voice quality data set, creating a data set containing synthetic voice of a voice synthesis system, performing voice perception quality evaluation test on the voice in the data set to obtain a data tag, and measuring voice quality by taking the data tag as a score;
creating a voice quality score prediction model to predict the synthesized voice;
and inputting the synthesized voice to be detected into a voice quality score prediction model, training the voice quality score prediction model by a synthesized voice quality data set, and outputting a synthesized voice quality prediction score.
Further, the synthesized text adopted by the voice synthesis system comprises long sentences and short sentences, the synthesized voice is used as an audio sample of a synthesized voice quality data set, the synthesized voice is subjected to quality evaluation to obtain MOS scores, and the MOS scores are used as data labels.
Furthermore, the synthesized voice adopts voices of a plurality of different speakers, the MOS score corresponding to the data tag is an average value of single voice scores, and in order to ensure the diversity of data, some of the voice synthesis systems have better synthesized voice quality, and some of the voice synthesis systems can synthesize voice with poorer quality.
Further, the speech quality score prediction model is based on a convolutional neural network and a self-attention mechanism model, the input of the speech quality score prediction model is a feature extracted from synthesized speech of a speech synthesis system, the output of the speech quality score prediction model is MOS score predicted by the synthesized speech, and the data set predicts the perception quality of the synthesized speech after training the synthesized speech quality score prediction model.
Further, the synthesized voice is input into the voice quality score prediction model and is processed, and the processing steps are as follows: pre-emphasis, framing, windowing, short-time fourier transform,
pre-emphasis employs a first order high pass filter: h (z) =1- μz -1 Wherein μz -1 The value of (2) is 0.9-1.0, the value is 0.96, the high-frequency part is promoted, and the formants of high frequency are highlighted;
analyzing and extracting characteristic parameters from the voice signal in frames for subsequent processing and processing of the voice signal, wherein the extracted characteristic parameters comprise short-time energy and average amplitude of voice, short-time average zero-crossing rate, short-time autocorrelation function and short-time average amplitude difference function, the time value of the adopted frame length is 10-30ms, X sampling points are integrated into one point to serve as frames, the length of each frame is 256 bytes, S sampling points are arranged in an overlapping area between adjacent frames, the value of S is about 1/3-1/2 of X, and the sampling frequency is 8KHz;
the window function used for windowing comprises a hamming window and a rectangular window, wherein the windowing operation is S' (n) =s (n) ×w (n), and the hamming window is a hamming windowRectangular window->The higher spectral resolution of the rectangular window, but the spectral leakage is more serious, the Hamming window has smoother low-pass characteristics, the frequency characteristics of short-time signals can be reflected to a higher degree, and different window functions can be selected according to different conditions in the voice analysis process;
short-time FourierFourier transform is carried out on a frame of signal to obtain a frequency spectrum, and an X point sequence X [ n ] in a time domain]Is defined as DFT of (2);
further, the energy spectrum produced by the synthesized speech is characterized via a mel filter bank.
Finally, it should be noted that: the foregoing description is only a preferred embodiment of the present invention, and the present invention is not limited thereto, but it is to be understood that modifications and equivalents of some of the technical features described in the foregoing embodiments may be made by those skilled in the art, although the present invention has been described in detail with reference to the foregoing embodiments. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (6)
1. A synthetic speech quality assessment method for a speech synthesis system, comprising:
creating a synthetic voice quality data set, creating a data set containing synthetic voice of a voice synthesis system, performing voice perception quality evaluation test on the voice in the data set to obtain a data tag, and measuring voice quality by taking the data tag as a score;
creating a voice quality score prediction model to predict the synthesized voice;
and inputting the synthesized voice to be detected into a voice quality score prediction model, training the voice quality score prediction model by a synthesized voice quality data set, and outputting a synthesized voice quality prediction score.
2. The method according to claim 1, wherein the synthesized text used in the speech synthesis system includes long sentences and short sentences, the synthesized speech is used as an audio sample of a synthesized speech quality data set, and the synthesized speech is subjected to quality evaluation to obtain MOS scores as data labels.
3. The method for evaluating the quality of synthesized speech in a speech synthesis system according to claim 2, wherein the synthesized speech uses speech from a plurality of different speakers, and the MOS score corresponding to the data tag is an average value of the individual speech scores.
4. The method according to claim 2, wherein the speech quality score prediction model is based on a convolutional neural network and a self-attention mechanism model, the input of the speech quality score prediction model is a feature extracted from the synthesized speech of the speech synthesis system, the output of the speech quality score prediction model is a MOS score predicted for the synthesized speech, and the data set predicts the perceived quality of the synthesized speech after training the synthesized speech quality score prediction model.
5. The method for evaluating the quality of synthesized speech in a speech synthesis system according to claim 4, wherein said step of processing said synthesized speech before inputting said speech quality score prediction model comprises the steps of: pre-emphasis, framing, windowing, short-time fourier transform, the pre-emphasis employing a first order high pass filter: h (z) =1- μz -1 Wherein μz -1 The value of the frame is 0.9-1.0, the frame is used for analyzing and extracting characteristic parameters for subsequent processing and processing of the voice signals, wherein the extracted characteristic parameters comprise short-time energy and average amplitude of voice, short-time average zero-crossing rate, short-time autocorrelation function and short-time average amplitude difference function, the value of the frame adopted for a long time is 10-30ms, the window function adopted for windowing comprises a Hamming window and a rectangular window, and the short-time Fourier transform is used for carrying out Fourier transform on a frame of signal to obtain a frequency spectrum.
6. A method for quality assessment of synthesized speech for a speech synthesis system according to claim 5, wherein the energy spectrum produced by the synthesized speech is characterized via a mel filter bank.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310151042.4A CN116230018A (en) | 2023-02-22 | 2023-02-22 | Synthetic voice quality evaluation method for voice synthesis system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310151042.4A CN116230018A (en) | 2023-02-22 | 2023-02-22 | Synthetic voice quality evaluation method for voice synthesis system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116230018A true CN116230018A (en) | 2023-06-06 |
Family
ID=86578123
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310151042.4A Pending CN116230018A (en) | 2023-02-22 | 2023-02-22 | Synthetic voice quality evaluation method for voice synthesis system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116230018A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116504245A (en) * | 2023-06-26 | 2023-07-28 | 凯泰铭科技(北京)有限公司 | Method and system for compiling rules by voice |
CN118645085A (en) * | 2024-08-16 | 2024-09-13 | 罗普特科技集团股份有限公司 | Method and system for evaluating voice quality of mobile perception terminal based on deep learning |
-
2023
- 2023-02-22 CN CN202310151042.4A patent/CN116230018A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116504245A (en) * | 2023-06-26 | 2023-07-28 | 凯泰铭科技(北京)有限公司 | Method and system for compiling rules by voice |
CN116504245B (en) * | 2023-06-26 | 2023-09-22 | 凯泰铭科技(北京)有限公司 | Method and system for compiling rules by voice |
CN118645085A (en) * | 2024-08-16 | 2024-09-13 | 罗普特科技集团股份有限公司 | Method and system for evaluating voice quality of mobile perception terminal based on deep learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108900725B (en) | Voiceprint recognition method and device, terminal equipment and storage medium | |
Kingsbury et al. | Robust speech recognition using the modulation spectrogram | |
EP1083542B1 (en) | A method and apparatus for speech detection | |
Zahorian et al. | A spectral/temporal method for robust fundamental frequency tracking | |
AU2007210334B2 (en) | Non-intrusive signal quality assessment | |
CN109034046B (en) | Method for automatically identifying foreign matters in electric energy meter based on acoustic detection | |
CN116230018A (en) | Synthetic voice quality evaluation method for voice synthesis system | |
Nwe et al. | Detection of stress and emotion in speech using traditional and FFT based log energy features | |
Fraile et al. | Automatic detection of laryngeal pathologies in records of sustained vowels by means of mel-frequency cepstral coefficient parameters and differentiation of patients by sex | |
EP0708958A1 (en) | Multi-language speech recognition system | |
Rendón et al. | Automatic detection of hypernasality in children | |
Dubey et al. | Non-intrusive speech quality assessment using several combinations of auditory features | |
Dubuisson et al. | On the use of the correlation between acoustic descriptors for the normal/pathological voices discrimination | |
Chandrashekar et al. | Breathiness indices for classification of dysarthria based on type and speech intelligibility | |
AU2021101586A4 (en) | A System and a Method for Non-Intrusive Speech Quality and Intelligibility Evaluation Measures using FLANN Model | |
Chandrashekar et al. | Region based prediction and score combination for automatic intelligibility assessment of dysarthric speech | |
Dubey et al. | Hypernasality Severity Detection Using Constant Q Cepstral Coefficients. | |
Kąkol et al. | Improving objective speech quality indicators in noise conditions | |
CN111091816B (en) | Data processing system and method based on voice evaluation | |
Slaney et al. | Pitch-gesture modeling using subband autocorrelation change detection. | |
Hinterleitner et al. | Comparison of approaches for instrumentally predicting the quality of text-to-speech systems: Data from Blizzard Challenges 2008 and 2009 | |
Kumar et al. | Speech quality evaluation for different pitch detection algorithms in LPC speech analysis–synthesis system | |
Merzougui et al. | Diagnosing Spasmodic Dysphonia with the Power of AI | |
CN113808595B (en) | Voice conversion method and device from source speaker to target speaker | |
Gayathri et al. | Identification of voice pathology from temporal and cepstral features for vowel ‘a’low intonation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |