CN116230018A

CN116230018A - Synthetic voice quality evaluation method for voice synthesis system

Info

Publication number: CN116230018A
Application number: CN202310151042.4A
Authority: CN
Inventors: 陈紫东
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2023-02-22
Filing date: 2023-02-22
Publication date: 2023-06-06

Abstract

The invention discloses a synthetic voice quality evaluation method for a voice synthesis system, which comprises the following steps: creating a synthetic voice quality data set, creating a data set containing synthetic voice of a voice synthesis system, performing voice perception quality evaluation test on the voice in the data set to obtain a data tag, and measuring voice quality by taking the data tag as a score; creating a voice quality score prediction model to predict the synthesized voice; and inputting the synthesized voice to be detected into a voice quality score prediction model, training the voice quality score prediction model by a synthesized voice quality data set, and outputting a synthesized voice quality prediction score. The quality evaluation is carried out on the synthesized voice of the voice synthesis system, a synthesized voice quality dataset is established by using a deep learning-based method, and a voice quality score prediction model is trained by using the dataset, so that the voice quality score prediction model can predict the quality score of the synthesized voice of the voice synthesis system.

Description

Synthetic voice quality evaluation method for voice synthesis system

Technical Field

The present invention relates to the field of speech quality assessment, and in particular, to a method for assessing the quality of synthesized speech in a speech synthesis system.

Background

The assessment of speech quality is important to measure and improve the quality of speech synthesis systems. However, the current evaluation index based on human audiometry has high labor cost and certain error. This is because the perception and preference of each person are different, for example, the degree of acceptance for different timbres, different accents, different speech rates, affects the objectivity and accuracy of the evaluation result. On the other hand, some other speech related tasks have fairly objective evaluation indexes, such as word error rate in automatic speech recognition and speaker error rate in speaker recognition, which can be directly used as measurement and optimization indexes of the tasks, but cannot be applied in speech synthesis. Most deep learning based speech synthesis systems will use the maximum likelihood (or minimum error) of acoustic parameters and phoneme duration to train and optimize during the training process. However, these criteria, although usable during training, cannot be used as an indicator for assessing the quality of the synthesized speech because when using text other than the speech synthesis system dataset, no corresponding reference speech can be used to calculate the acoustic parameters and the phoneme duration, and these criteria are not usable and, in addition, do not reflect the subjective perceptual assessment of the synthesized speech by the listener.

Objective evaluation methods, such as mel-frequency cepstrum distance, are often used in the field of speech conversion to measure the quality of the converted speech, but these indexes mainly measure the distortion of acoustic parameters, and cannot measure the subjective feeling of the listener well. Speech quality perception assessment (PESQ) issued by the international telecommunication union telecommunication standards office is often used in industrial applications to evaluate speech quality, however, this method requires a high quality reference speech, which limitation makes it not directly applicable to the evaluation of synthesized speech, because the synthesized speech often does not have a corresponding original speech, and the result of the evaluation of the method does not pay attention to the naturalness of the synthesized speech.

By classifying subjective feelings of listeners and designating a grading evaluation criterion, mean Opinion Score (MOS) and other subjective evaluation methods can quantify subjective evaluation of voices by listeners as grades, which can well measure naturalness of voices and subjective feelings of voices by listeners. However, MOS scores need to be collected by a subjective quality assessment test of speech, which would require a significant amount of time and labor cost when there are many audio samples, because the subjective quality assessment test of speech requires multiple listeners to cover all speech with their measures and scores, resulting in a more objective perceived quality assessment result.

In the deep learning-based synthetic voice Quality evaluation method, quality-Net provides effective non-invasive evaluation for voice enhancement by a model based on a Bi-directional long-short term memory network (Bi-LSTM) with a prediction score highly correlated with a PESQ score. However, PESQ score as a speech enhancement indicator is not a measure of the naturalness of the synthesized speech for purely predictive speech synthesis system synthesized speech quality. The Mos-Net adopts a model based on a convolutional neural network and a two-way long-short-term memory network, so that the converted voice quality can be predicted. The model performs well in predicting system quality, but the main goal of this work is to evaluate the speech conversion system.

Therefore, it is necessary to explore a synthetic speech quality assessment method for a speech synthesis system.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a synthetic voice quality evaluation method for a voice synthesis system.

In order to solve the technical problems, the invention provides the following technical scheme:

the invention discloses a synthetic voice quality evaluation method for a voice synthesis system, which comprises the following steps:

creating a synthetic voice quality data set, creating a data set containing synthetic voice of a voice synthesis system, performing voice perception quality evaluation test on the voice in the data set to obtain a data tag, and measuring voice quality by taking the data tag as a score;

creating a voice quality score prediction model to predict the synthesized voice;

and inputting the synthesized voice to be detected into a voice quality score prediction model, training the voice quality score prediction model by a synthesized voice quality data set, and outputting a synthesized voice quality prediction score.

As a preferable technical scheme of the invention, the synthetic text adopted by the voice synthesis system comprises long sentences and short sentences, the synthetic voice is used as an audio sample of a synthetic voice quality data set, and the synthetic voice is subjected to quality evaluation to obtain MOS (metal oxide semiconductor) scores which are used as data labels.

As a preferable technical scheme of the invention, the synthesized voice adopts voices of a plurality of different speakers, and the MOS score corresponding to the data tag is an average value of single voice scores.

As a preferable technical scheme of the invention, the voice quality score prediction model is based on a convolutional neural network and a self-attention mechanism model, the input of the voice quality score prediction model is a feature extracted from synthesized voice of a voice synthesis system, the output of the voice quality score prediction model is an MOS score predicted by the synthesized voice, and the data set predicts the perception quality of the synthesized voice after training the synthesized voice quality score prediction model.

As a preferable technical scheme of the invention, the synthetic voice is required to be processed before being input into the voice quality score prediction model, and the processing steps are as follows: pre-emphasis, framing, windowing, short-time fourier transform, the pre-emphasis employing a first order high pass filter: h (z) =1- μz ^-1 Wherein μz ^-1 The value of the frame is 0.9-1.0, the frame is used for analyzing and extracting characteristic parameters for subsequent processing and processing of the voice signals, wherein the extracted characteristic parameters comprise short-time energy and average amplitude of voice, short-time average zero-crossing rate, short-time autocorrelation function and short-time average amplitude difference function, the value of the frame adopted for a long time is 10-30ms, the window function adopted for windowing comprises a Hamming window and a rectangular window, and the short-time Fourier transform is used for carrying out Fourier transform on a frame of signal to obtain a frequency spectrum.

As a preferred embodiment of the present invention, the energy spectrum generated by the synthesized speech is characterized by a mel filter bank.

Compared with the prior art, the invention has the following beneficial effects:

according to the invention, the quality evaluation is carried out on the synthesized voice of the voice synthesis system, a synthesized voice quality dataset is established by using a deep learning-based method, and a voice quality score prediction model is trained by using the dataset, so that the voice quality score prediction model can predict the quality score of the synthesized voice of the voice synthesis system.

Drawings

The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention. In the drawings:

FIG. 1 is an overall flow chart of the present invention;

FIG. 2 is a schematic illustration of a speech quality score prediction model of the present invention;

Detailed Description

The preferred embodiments of the present invention will be described below with reference to the accompanying drawings, it being understood that the preferred embodiments described herein are for illustration and explanation of the present invention only, and are not intended to limit the present invention.

Example 1

As shown in fig. 1-2, the present invention provides a synthetic speech quality assessment method for a speech synthesis system, comprising:

Further, the synthesized text adopted by the voice synthesis system comprises long sentences and short sentences, the synthesized voice is used as an audio sample of a synthesized voice quality data set, the synthesized voice is subjected to quality evaluation to obtain MOS scores, and the MOS scores are used as data labels.

Furthermore, the synthesized voice adopts voices of a plurality of different speakers, the MOS score corresponding to the data tag is an average value of single voice scores, and in order to ensure the diversity of data, some of the voice synthesis systems have better synthesized voice quality, and some of the voice synthesis systems can synthesize voice with poorer quality.

Further, the speech quality score prediction model is based on a convolutional neural network and a self-attention mechanism model, the input of the speech quality score prediction model is a feature extracted from synthesized speech of a speech synthesis system, the output of the speech quality score prediction model is MOS score predicted by the synthesized speech, and the data set predicts the perception quality of the synthesized speech after training the synthesized speech quality score prediction model.

Further, the synthesized voice is input into the voice quality score prediction model and is processed, and the processing steps are as follows: pre-emphasis, framing, windowing, short-time fourier transform,

pre-emphasis employs a first order high pass filter: h (z) =1- μz ^-1 Wherein μz ^-1 The value of (2) is 0.9-1.0, the value is 0.96, the high-frequency part is promoted, and the formants of high frequency are highlighted;

analyzing and extracting characteristic parameters from the voice signal in frames for subsequent processing and processing of the voice signal, wherein the extracted characteristic parameters comprise short-time energy and average amplitude of voice, short-time average zero-crossing rate, short-time autocorrelation function and short-time average amplitude difference function, the time value of the adopted frame length is 10-30ms, X sampling points are integrated into one point to serve as frames, the length of each frame is 256 bytes, S sampling points are arranged in an overlapping area between adjacent frames, the value of S is about 1/3-1/2 of X, and the sampling frequency is 8KHz;

the window function used for windowing comprises a hamming window and a rectangular window, wherein the windowing operation is S' (n) =s (n) ×w (n), and the hamming window is a hamming window

Rectangular window->

The higher spectral resolution of the rectangular window, but the spectral leakage is more serious, the Hamming window has smoother low-pass characteristics, the frequency characteristics of short-time signals can be reflected to a higher degree, and different window functions can be selected according to different conditions in the voice analysis process;

short-time FourierFourier transform is carried out on a frame of signal to obtain a frequency spectrum, and an X point sequence X [ n ] in a time domain]Is defined as DFT of (2);

further, the energy spectrum produced by the synthesized speech is characterized via a mel filter bank.

Finally, it should be noted that: the foregoing description is only a preferred embodiment of the present invention, and the present invention is not limited thereto, but it is to be understood that modifications and equivalents of some of the technical features described in the foregoing embodiments may be made by those skilled in the art, although the present invention has been described in detail with reference to the foregoing embodiments. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A synthetic speech quality assessment method for a speech synthesis system, comprising:

2. The method according to claim 1, wherein the synthesized text used in the speech synthesis system includes long sentences and short sentences, the synthesized speech is used as an audio sample of a synthesized speech quality data set, and the synthesized speech is subjected to quality evaluation to obtain MOS scores as data labels.

3. The method for evaluating the quality of synthesized speech in a speech synthesis system according to claim 2, wherein the synthesized speech uses speech from a plurality of different speakers, and the MOS score corresponding to the data tag is an average value of the individual speech scores.

4. The method according to claim 2, wherein the speech quality score prediction model is based on a convolutional neural network and a self-attention mechanism model, the input of the speech quality score prediction model is a feature extracted from the synthesized speech of the speech synthesis system, the output of the speech quality score prediction model is a MOS score predicted for the synthesized speech, and the data set predicts the perceived quality of the synthesized speech after training the synthesized speech quality score prediction model.

5. The method for evaluating the quality of synthesized speech in a speech synthesis system according to claim 4, wherein said step of processing said synthesized speech before inputting said speech quality score prediction model comprises the steps of: pre-emphasis, framing, windowing, short-time fourier transform, the pre-emphasis employing a first order high pass filter: h (z) =1- μz ^-1 Wherein μz ^-1 The value of the frame is 0.9-1.0, the frame is used for analyzing and extracting characteristic parameters for subsequent processing and processing of the voice signals, wherein the extracted characteristic parameters comprise short-time energy and average amplitude of voice, short-time average zero-crossing rate, short-time autocorrelation function and short-time average amplitude difference function, the value of the frame adopted for a long time is 10-30ms, the window function adopted for windowing comprises a Hamming window and a rectangular window, and the short-time Fourier transform is used for carrying out Fourier transform on a frame of signal to obtain a frequency spectrum.

6. A method for quality assessment of synthesized speech for a speech synthesis system according to claim 5, wherein the energy spectrum produced by the synthesized speech is characterized via a mel filter bank.