CN116230018A - Synthetic voice quality evaluation method for voice synthesis system - Google Patents

Synthetic voice quality evaluation method for voice synthesis system Download PDF

Info

Publication number
CN116230018A
CN116230018A CN202310151042.4A CN202310151042A CN116230018A CN 116230018 A CN116230018 A CN 116230018A CN 202310151042 A CN202310151042 A CN 202310151042A CN 116230018 A CN116230018 A CN 116230018A
Authority
CN
China
Prior art keywords
voice
speech
quality
synthesized
prediction model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310151042.4A
Other languages
Chinese (zh)
Inventor
陈紫东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN202310151042.4A priority Critical patent/CN116230018A/en
Publication of CN116230018A publication Critical patent/CN116230018A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a synthetic voice quality evaluation method for a voice synthesis system, which comprises the following steps: creating a synthetic voice quality data set, creating a data set containing synthetic voice of a voice synthesis system, performing voice perception quality evaluation test on the voice in the data set to obtain a data tag, and measuring voice quality by taking the data tag as a score; creating a voice quality score prediction model to predict the synthesized voice; and inputting the synthesized voice to be detected into a voice quality score prediction model, training the voice quality score prediction model by a synthesized voice quality data set, and outputting a synthesized voice quality prediction score. The quality evaluation is carried out on the synthesized voice of the voice synthesis system, a synthesized voice quality dataset is established by using a deep learning-based method, and a voice quality score prediction model is trained by using the dataset, so that the voice quality score prediction model can predict the quality score of the synthesized voice of the voice synthesis system.

Description

Synthetic voice quality evaluation method for voice synthesis system
Technical Field
The present invention relates to the field of speech quality assessment, and in particular, to a method for assessing the quality of synthesized speech in a speech synthesis system.
Background
The assessment of speech quality is important to measure and improve the quality of speech synthesis systems. However, the current evaluation index based on human audiometry has high labor cost and certain error. This is because the perception and preference of each person are different, for example, the degree of acceptance for different timbres, different accents, different speech rates, affects the objectivity and accuracy of the evaluation result. On the other hand, some other speech related tasks have fairly objective evaluation indexes, such as word error rate in automatic speech recognition and speaker error rate in speaker recognition, which can be directly used as measurement and optimization indexes of the tasks, but cannot be applied in speech synthesis. Most deep learning based speech synthesis systems will use the maximum likelihood (or minimum error) of acoustic parameters and phoneme duration to train and optimize during the training process. However, these criteria, although usable during training, cannot be used as an indicator for assessing the quality of the synthesized speech because when using text other than the speech synthesis system dataset, no corresponding reference speech can be used to calculate the acoustic parameters and the phoneme duration, and these criteria are not usable and, in addition, do not reflect the subjective perceptual assessment of the synthesized speech by the listener.
Objective evaluation methods, such as mel-frequency cepstrum distance, are often used in the field of speech conversion to measure the quality of the converted speech, but these indexes mainly measure the distortion of acoustic parameters, and cannot measure the subjective feeling of the listener well. Speech quality perception assessment (PESQ) issued by the international telecommunication union telecommunication standards office is often used in industrial applications to evaluate speech quality, however, this method requires a high quality reference speech, which limitation makes it not directly applicable to the evaluation of synthesized speech, because the synthesized speech often does not have a corresponding original speech, and the result of the evaluation of the method does not pay attention to the naturalness of the synthesized speech.
By classifying subjective feelings of listeners and designating a grading evaluation criterion, mean Opinion Score (MOS) and other subjective evaluation methods can quantify subjective evaluation of voices by listeners as grades, which can well measure naturalness of voices and subjective feelings of voices by listeners. However, MOS scores need to be collected by a subjective quality assessment test of speech, which would require a significant amount of time and labor cost when there are many audio samples, because the subjective quality assessment test of speech requires multiple listeners to cover all speech with their measures and scores, resulting in a more objective perceived quality assessment result.
In the deep learning-based synthetic voice Quality evaluation method, quality-Net provides effective non-invasive evaluation for voice enhancement by a model based on a Bi-directional long-short term memory network (Bi-LSTM) with a prediction score highly correlated with a PESQ score. However, PESQ score as a speech enhancement indicator is not a measure of the naturalness of the synthesized speech for purely predictive speech synthesis system synthesized speech quality. The Mos-Net adopts a model based on a convolutional neural network and a two-way long-short-term memory network, so that the converted voice quality can be predicted. The model performs well in predicting system quality, but the main goal of this work is to evaluate the speech conversion system.
Therefore, it is necessary to explore a synthetic speech quality assessment method for a speech synthesis system.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a synthetic voice quality evaluation method for a voice synthesis system.
In order to solve the technical problems, the invention provides the following technical scheme:
the invention discloses a synthetic voice quality evaluation method for a voice synthesis system, which comprises the following steps:
creating a synthetic voice quality data set, creating a data set containing synthetic voice of a voice synthesis system, performing voice perception quality evaluation test on the voice in the data set to obtain a data tag, and measuring voice quality by taking the data tag as a score;
creating a voice quality score prediction model to predict the synthesized voice;
and inputting the synthesized voice to be detected into a voice quality score prediction model, training the voice quality score prediction model by a synthesized voice quality data set, and outputting a synthesized voice quality prediction score.
As a preferable technical scheme of the invention, the synthetic text adopted by the voice synthesis system comprises long sentences and short sentences, the synthetic voice is used as an audio sample of a synthetic voice quality data set, and the synthetic voice is subjected to quality evaluation to obtain MOS (metal oxide semiconductor) scores which are used as data labels.
As a preferable technical scheme of the invention, the synthesized voice adopts voices of a plurality of different speakers, and the MOS score corresponding to the data tag is an average value of single voice scores.
As a preferable technical scheme of the invention, the voice quality score prediction model is based on a convolutional neural network and a self-attention mechanism model, the input of the voice quality score prediction model is a feature extracted from synthesized voice of a voice synthesis system, the output of the voice quality score prediction model is an MOS score predicted by the synthesized voice, and the data set predicts the perception quality of the synthesized voice after training the synthesized voice quality score prediction model.
As a preferable technical scheme of the invention, the synthetic voice is required to be processed before being input into the voice quality score prediction model, and the processing steps are as follows: pre-emphasis, framing, windowing, short-time fourier transform, the pre-emphasis employing a first order high pass filter: h (z) =1- μz -1 Wherein μz -1 The value of the frame is 0.9-1.0, the frame is used for analyzing and extracting characteristic parameters for subsequent processing and processing of the voice signals, wherein the extracted characteristic parameters comprise short-time energy and average amplitude of voice, short-time average zero-crossing rate, short-time autocorrelation function and short-time average amplitude difference function, the value of the frame adopted for a long time is 10-30ms, the window function adopted for windowing comprises a Hamming window and a rectangular window, and the short-time Fourier transform is used for carrying out Fourier transform on a frame of signal to obtain a frequency spectrum.
As a preferred embodiment of the present invention, the energy spectrum generated by the synthesized speech is characterized by a mel filter bank.
Compared with the prior art, the invention has the following beneficial effects:
according to the invention, the quality evaluation is carried out on the synthesized voice of the voice synthesis system, a synthesized voice quality dataset is established by using a deep learning-based method, and a voice quality score prediction model is trained by using the dataset, so that the voice quality score prediction model can predict the quality score of the synthesized voice of the voice synthesis system.
Drawings
The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention. In the drawings:
FIG. 1 is an overall flow chart of the present invention;
FIG. 2 is a schematic illustration of a speech quality score prediction model of the present invention;
Detailed Description
The preferred embodiments of the present invention will be described below with reference to the accompanying drawings, it being understood that the preferred embodiments described herein are for illustration and explanation of the present invention only, and are not intended to limit the present invention.
Example 1
As shown in fig. 1-2, the present invention provides a synthetic speech quality assessment method for a speech synthesis system, comprising:
creating a synthetic voice quality data set, creating a data set containing synthetic voice of a voice synthesis system, performing voice perception quality evaluation test on the voice in the data set to obtain a data tag, and measuring voice quality by taking the data tag as a score;
creating a voice quality score prediction model to predict the synthesized voice;
and inputting the synthesized voice to be detected into a voice quality score prediction model, training the voice quality score prediction model by a synthesized voice quality data set, and outputting a synthesized voice quality prediction score.
Further, the synthesized text adopted by the voice synthesis system comprises long sentences and short sentences, the synthesized voice is used as an audio sample of a synthesized voice quality data set, the synthesized voice is subjected to quality evaluation to obtain MOS scores, and the MOS scores are used as data labels.
Furthermore, the synthesized voice adopts voices of a plurality of different speakers, the MOS score corresponding to the data tag is an average value of single voice scores, and in order to ensure the diversity of data, some of the voice synthesis systems have better synthesized voice quality, and some of the voice synthesis systems can synthesize voice with poorer quality.
Further, the speech quality score prediction model is based on a convolutional neural network and a self-attention mechanism model, the input of the speech quality score prediction model is a feature extracted from synthesized speech of a speech synthesis system, the output of the speech quality score prediction model is MOS score predicted by the synthesized speech, and the data set predicts the perception quality of the synthesized speech after training the synthesized speech quality score prediction model.
Further, the synthesized voice is input into the voice quality score prediction model and is processed, and the processing steps are as follows: pre-emphasis, framing, windowing, short-time fourier transform,
pre-emphasis employs a first order high pass filter: h (z) =1- μz -1 Wherein μz -1 The value of (2) is 0.9-1.0, the value is 0.96, the high-frequency part is promoted, and the formants of high frequency are highlighted;
analyzing and extracting characteristic parameters from the voice signal in frames for subsequent processing and processing of the voice signal, wherein the extracted characteristic parameters comprise short-time energy and average amplitude of voice, short-time average zero-crossing rate, short-time autocorrelation function and short-time average amplitude difference function, the time value of the adopted frame length is 10-30ms, X sampling points are integrated into one point to serve as frames, the length of each frame is 256 bytes, S sampling points are arranged in an overlapping area between adjacent frames, the value of S is about 1/3-1/2 of X, and the sampling frequency is 8KHz;
the window function used for windowing comprises a hamming window and a rectangular window, wherein the windowing operation is S' (n) =s (n) ×w (n), and the hamming window is a hamming window
Figure BDA0004090769350000051
Rectangular window->
Figure BDA0004090769350000052
The higher spectral resolution of the rectangular window, but the spectral leakage is more serious, the Hamming window has smoother low-pass characteristics, the frequency characteristics of short-time signals can be reflected to a higher degree, and different window functions can be selected according to different conditions in the voice analysis process;
short-time FourierFourier transform is carried out on a frame of signal to obtain a frequency spectrum, and an X point sequence X [ n ] in a time domain]Is defined as DFT of (2);
Figure BDA0004090769350000053
further, the energy spectrum produced by the synthesized speech is characterized via a mel filter bank.
Finally, it should be noted that: the foregoing description is only a preferred embodiment of the present invention, and the present invention is not limited thereto, but it is to be understood that modifications and equivalents of some of the technical features described in the foregoing embodiments may be made by those skilled in the art, although the present invention has been described in detail with reference to the foregoing embodiments. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (6)

1. A synthetic speech quality assessment method for a speech synthesis system, comprising:
creating a synthetic voice quality data set, creating a data set containing synthetic voice of a voice synthesis system, performing voice perception quality evaluation test on the voice in the data set to obtain a data tag, and measuring voice quality by taking the data tag as a score;
creating a voice quality score prediction model to predict the synthesized voice;
and inputting the synthesized voice to be detected into a voice quality score prediction model, training the voice quality score prediction model by a synthesized voice quality data set, and outputting a synthesized voice quality prediction score.
2. The method according to claim 1, wherein the synthesized text used in the speech synthesis system includes long sentences and short sentences, the synthesized speech is used as an audio sample of a synthesized speech quality data set, and the synthesized speech is subjected to quality evaluation to obtain MOS scores as data labels.
3. The method for evaluating the quality of synthesized speech in a speech synthesis system according to claim 2, wherein the synthesized speech uses speech from a plurality of different speakers, and the MOS score corresponding to the data tag is an average value of the individual speech scores.
4. The method according to claim 2, wherein the speech quality score prediction model is based on a convolutional neural network and a self-attention mechanism model, the input of the speech quality score prediction model is a feature extracted from the synthesized speech of the speech synthesis system, the output of the speech quality score prediction model is a MOS score predicted for the synthesized speech, and the data set predicts the perceived quality of the synthesized speech after training the synthesized speech quality score prediction model.
5. The method for evaluating the quality of synthesized speech in a speech synthesis system according to claim 4, wherein said step of processing said synthesized speech before inputting said speech quality score prediction model comprises the steps of: pre-emphasis, framing, windowing, short-time fourier transform, the pre-emphasis employing a first order high pass filter: h (z) =1- μz -1 Wherein μz -1 The value of the frame is 0.9-1.0, the frame is used for analyzing and extracting characteristic parameters for subsequent processing and processing of the voice signals, wherein the extracted characteristic parameters comprise short-time energy and average amplitude of voice, short-time average zero-crossing rate, short-time autocorrelation function and short-time average amplitude difference function, the value of the frame adopted for a long time is 10-30ms, the window function adopted for windowing comprises a Hamming window and a rectangular window, and the short-time Fourier transform is used for carrying out Fourier transform on a frame of signal to obtain a frequency spectrum.
6. A method for quality assessment of synthesized speech for a speech synthesis system according to claim 5, wherein the energy spectrum produced by the synthesized speech is characterized via a mel filter bank.
CN202310151042.4A 2023-02-22 2023-02-22 Synthetic voice quality evaluation method for voice synthesis system Pending CN116230018A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310151042.4A CN116230018A (en) 2023-02-22 2023-02-22 Synthetic voice quality evaluation method for voice synthesis system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310151042.4A CN116230018A (en) 2023-02-22 2023-02-22 Synthetic voice quality evaluation method for voice synthesis system

Publications (1)

Publication Number Publication Date
CN116230018A true CN116230018A (en) 2023-06-06

Family

ID=86578123

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310151042.4A Pending CN116230018A (en) 2023-02-22 2023-02-22 Synthetic voice quality evaluation method for voice synthesis system

Country Status (1)

Country Link
CN (1) CN116230018A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116504245A (en) * 2023-06-26 2023-07-28 凯泰铭科技(北京)有限公司 Method and system for compiling rules by voice

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116504245A (en) * 2023-06-26 2023-07-28 凯泰铭科技(北京)有限公司 Method and system for compiling rules by voice
CN116504245B (en) * 2023-06-26 2023-09-22 凯泰铭科技(北京)有限公司 Method and system for compiling rules by voice

Similar Documents

Publication Publication Date Title
CN108900725B (en) Voiceprint recognition method and device, terminal equipment and storage medium
Kingsbury et al. Robust speech recognition using the modulation spectrogram
EP1083542B1 (en) A method and apparatus for speech detection
Morrison Forensic voice comparison and the paradigm shift
Zahorian et al. A spectral/temporal method for robust fundamental frequency tracking
AU2007210334B2 (en) Non-intrusive signal quality assessment
CN109034046B (en) Method for automatically identifying foreign matters in electric energy meter based on acoustic detection
Fraile et al. Automatic detection of laryngeal pathologies in records of sustained vowels by means of mel-frequency cepstral coefficient parameters and differentiation of patients by sex
Nwe et al. Detection of stress and emotion in speech using traditional and FFT based log energy features
AU7328294A (en) Multi-language speech recognition system
Rendón et al. Automatic detection of hypernasality in children
Dubey et al. Non-intrusive speech quality assessment using several combinations of auditory features
Dubuisson et al. On the use of the correlation between acoustic descriptors for the normal/pathological voices discrimination
CN116230018A (en) Synthetic voice quality evaluation method for voice synthesis system
Tavi et al. Recognition of Creaky Voice from Emergency Calls.
Chandrashekar et al. Breathiness indices for classification of dysarthria based on type and speech intelligibility
Rahman et al. Dynamic time warping assisted svm classifier for bangla speech recognition
AU2021101586A4 (en) A System and a Method for Non-Intrusive Speech Quality and Intelligibility Evaluation Measures using FLANN Model
Dubey et al. Hypernasality Severity Detection Using Constant Q Cepstral Coefficients.
CN115985341A (en) Voice scoring method and voice scoring device
CN111091816B (en) Data processing system and method based on voice evaluation
Slaney et al. Pitch-gesture modeling using subband autocorrelation change detection.
Hinterleitner et al. Comparison of approaches for instrumentally predicting the quality of text-to-speech systems: Data from Blizzard Challenges 2008 and 2009
Kumar et al. Speech quality evaluation for different pitch detection algorithms in LPC speech analysis–synthesis system
Gayathri et al. Identification of voice pathology from temporal and cepstral features for vowel ‘a’low intonation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination