GB2423903A

GB2423903A - Assessing the subjective quality of TTS systems which accounts for variations between synthesised and original speech

Info

Publication number: GB2423903A
Application number: GB0504546A
Authority: GB
Inventors: Tina Louise Burrows
Original assignee: Toshiba Research Europe Ltd
Current assignee: Toshiba Europe Ltd
Priority date: 2005-03-04
Filing date: 2005-03-04
Publication date: 2006-09-06
Anticipated expiration: 2025-03-04
Also published as: GB2423903B; GB0504546D0

Abstract

An assessment system for assessing a Text-to-Speech (TTS) system comprising a database comprising a plurality of test texts, associated natural speech data and time-aligned annotations; a perceptual model, for converting speech data (waveform and annotation) to a representation that is perceptually meaningful for comparison; an error model for comparing, for each test text, the perceptually meaningful data generated by the perceptual model for speech data from a TTS system and corresponding perceptually meaningful data generated by the perceptual model for natural speech data from the database in order to derive objective error measurements of the TTS system and a cognitive model capable of mapping the objective error measurements of the TTS system derived by the error model to a subjective measurement of the quality of the TTS system.

Description

I

M&C Folio: GBP9 1503 Method and Apparatus for assessing Text-to-Speech Synthesis Systems This invention relates to a method and apparatus for testing and assessing text-to-speech (TTS) synthesis systems.

In a TTS system a speech waveform for a desired text or utterance is generated from text alone and the TTS system has to model the cognitive processing that humans perform to generate suitable intonation for an utterance prior to speaking.

The general form of a text-to-speech synthesis system is shown in Figure 1. A TTS system 1 generally comprises a text processing component 3 which analyses the text to be synthesised and decides how it is to be generated as synthesised speech and a waveform generation component 5 which generates the actual speech waveform samples based on the output of the text processing component.

Text 7 to be synthesised by the TTS system is first passed to a text analysis stage 11 which performs a number of functions including text normalisation 1 la (e.g. the expansion of abbreviations and digits into word strings), lexical analysis lib (in which the text is converted into a pronunciation', a "phone" sequence that represents the sounds to be produced, and positions of lexical stress are identified) and syntactic/semantic analysis 11 c (in which the syntactic/semantic structure and the form and function of each word in the text is decided).

The next stage in the system is the generation of a suitable prosody for the synthesised speech from the features derived by the text analysis stage. The prosody generation stage 13 includes a number of sub-stages (13a-13e) to calculate the following; a prosodic phrasing for the utterance I 3a, the location and duration of pauses within the utterance 1 3b, which words are to be accented within the utterance I 3c, the duration of each phone in the phone sequence 1 3d and a target pitch contour for the synthesised speech 13e.

During the prosody generation stage, the system derives an internal representation 20 of the original text which is passed to the waveform generation component 5 of the system for generation of the actual samples of the synthesised speech waveform.

The internal representation comprises all the information needed by the waveform generation component (such as speaking rate and volume, the phone sequence and their durations, the position and duration of pauses and the target pitch contour), and may additionally include details of the lexical analysis, syntactic/semantic analysis, prosodic phrase structure and word accenting that are output from each of the sub-stages Jib, c 13a, c. That part of the internal representation that could have alternatively been derived by human annotation of the speech waveform and text is referred to as TTS annotation' (syntactic/semantic analysis, phone sequence and durations, pause locations and durations, prosodic phrasing, word accenting, and pitch contour).

The perceived quality of the synthesised speech is affected by how well the text analysis, prosody generation and waveform generation components perform in some perceptual sense.

Currently the only way to determine the perceived quality of speech synthesised by a TTS system is to carry out subjective texts using human listeners in order to assess various aspects of the quality of the synthesised speech, such as the naturalness and intelligibility. While such assessments provide useful information for researchers, system developers and customers choosing a TTS system for a particular application, they are costly and time consuming. Furthermore, there is no standardised method for comparative benchniarking of ITS systems.

The perceptual quality of speech that has been distorted by transmission through a speech codec, over a telephone or mobile network or using Voice over IP has been considered in the paper by Rix, Beerneds, Hollier and Hekstra titled "PESQ - the new ITU standard for an end-to-end speech quality assessment" [AES 109th Convention, Los Angeles, 2000 September 2225]. The assessment solution adopted in this paper has been to define objective measures from which the subjective speech quality can be predicted. However, in the coding or network scenario of this paper, the assessment method is able to compare an undistorted version of real speech with a distorted version of the same speech that has the same intonation but which has been distorted by coding or transmission.

By contrast in a TTS system the speech waveform is generated from text and the intonation is predicted by the system and may differ from that of the original speech.

The above assessment method does not therefore provide a complete method of assessing the quality of the TTS system, since it does not account for these variations between synthesised and original speech.

It is an object of the present invention to provide a method and apparatus for assessing the subjective quality of TTS systems which substantially overcomes or mitigates the above mentioned problems.

Accordingly, in a first aspect of the present invention, there is provided an assessment system for assessing a Text-to-Speech (TTS) system comprising i) a database comprising a plurality of test texts, associated natural speech waveforms for the plurality of test texts and time-aligned annotations of the natural speech waveforms relating to the plurality of test texts ii) a perceptual model arranged in use, for both speech data output by the TTS and natural speech data from the database, to align speech waveforms and their associated annotations and to map speech waveforms to a loudness scale iii) an error model for comparing, for each test text, data generated by the perceptual model from the TTS output and corresponding data derived by the perceptual model from the natural speech waveforms and their annotations from the database in order to derive objective error measurements for the quality of the TTS system and iv) a cognitive model capable of mapping the objective error measurements for the quality of the TTS system derived by the error model to a subjective measurement of the quality of the TTS system.

Correspondingly, in a second aspect of the present invention, there is provided a method of assessing a Text-to-Speech (TTS) system comprising the steps of i) receiving speech data including speech waveforms and associated annotations generated by a TTS system from a set of test texts and also including natural speech data from a database, the database comprising a plurality of test texts, associated natural speech waveforms for the plurality of test texts and time-aligned annotations of the natural speech waveforms relating to the plurality of test texts ii) for both speech data output by the TTS and natural speech data from the database, aligning the speech waveforms and their associated annotations and mapping speech waveforms to a loudness scale iii) comparing, for each test text, the data derived in step (ii) from the speech data output by the TTS system with corresponding data derived in step (ii) from the natural speech data in order to derive objective error measurements for the quality of the TTS system iv) mapping the objective error measurements for the quality of the TTS system to a subjective measurement of the quality of the TTS system.

The TTS assessment system of the present invention objectively compares perceptual data derived from the natural speech data in the database and perceptual data derived from the speech data output by the TTS system (synthesised speech and TTS annotations) in order to assess the errors within the TTS system.

The annotations in the database may conveniently include all or some of the following: time-aligned phone sequence and phone durations, location of pauses and their durations, pitch contour, prosodic phrasing and word accenting information (for example as represented by the ToBI tone and break tier), and details of lexical analysis, syntactic/semantic structure.

Preferably the perceptual model converts speech data (waveforms and annotations) into a form (the perceptual data) that can be used by the error model to derive error measures that are perceptually significant to human listeners. The perceptual data output by the perceptual model may include all or some of the following for both natural and synthesised speech, depending on the areas of performance to be assessed by the error model: waveforms represented as Sone-scale loudness densities, timealigned pitch contours, phone sequence and durations, location and duration of pauses, prosodic phrasing and word accenting, syntactic/semantic structure.

Preferably the perceptual model incorporates steps to align the listening levels (power), to apply a filter to model the generic characteristics of headphones used in subjective evaluations, to convert the speech waveforms into sequences of analysis frames and to align the waveforms and pitch contours in time based on the phonetic transcriptions and an auditory transform step which converts the time-domain analysis frames to Sonescale loudness densities via a frequency warping to the Bark scale.

Preferably the database comprises waveforms of natural speech for each of a set of test texts spoken by the same speaker(s) as the TTS voice(s) and also time-aligned annotations of the natural speech waveforms that correspond to the form of annotation output by the TTS system when generating speech.

However, where this is not the case the perceptual model may additionally include steps to accommodate the fact that different TTS systems may use different voices by additionally/optionally including a step of voice conversion in the auditory transform applied to the natural speech in order to map the voice speech characteristics (such as mean pitch, pitch range and differences in formant positions in voiced phones) of the speaker(s) of the natural speech in the database to that of the speaker(s) on which the TTS voice(s) was trained.

The perceptual model may optionally/additionally include a set of annotation mappings to accommodate the fact that different TTS systems may use different internal representations, for example different phone sets, and different representations of syntactic structure, prosodic phrasing and accenting. Phone mappings are applied before the time alignment stage.

The perceptual model may additionally/optionally include a complexity mapping to modify the granularity of the annotations to a simplified form to be used by the error model (depending on what level of complexity is perceptually relevant), for example the number of levels of prosodic phrasing to be considered by the error model.

Preferably the error model compares those parts of the perceptual data that are relevant to the part or parts of the TTS system that are being assessed (for example one or more of text analysis, prosody generation or waveform generation) and does so in a way that is perceptually significant to human listeners.

The error model may include a waveform error component to compare Sonescale loudness densities. The waveform error component can use the difference between synthesised and natural speech loudness densities as a measure of the audible error. The differences (in each time-frequency cell) may be passed through an auditory masking array to account for the fact that small differences in loudness are inaudible in the presence of loud signals. The resulting disturbance density may further be aggregated in time and symmetrically and asymmetrically in frequency for each synthesised utterance, to create two disturbances, one symmetric and one asymmetric. The symmetric and asymmetric disturbances account for the fact that distortion introduced by the addition of frequency components is more noticeable than distortion introduced by the removal of frequency components. For TTS assessment, the two disturbances are additionally averaged over all the test utterances to give two objective error measures (waveform errors) which are passed on to the cognitive model. The waveform errors account for differences in spectral characteristics between two synthesised waveforms.

In one variant of the invention the database may supply the TTS system under test with text and its own annotations. This effectively means that only the waveform generation capabilities of the TTS system are assessed (the text analysis and prosody generation components of the TTS system are not used in this variant as the relevant information has been supplied by the database itself).

The error model may alternatively or additionally include a prosody error component to assess differences between the TTS generated prosody and that of the natural speech, and to account for general prosodic preferences at a more symbolic level (for example a listener's general tolerance to insertionldeletion of pauses) based on differences in annotations. The prosody error model preferably calculates some or all of the following objective error measures: * an objective measure of the differences in pitch contours may be calculated as the mean squared error between time-aligned pitch contours of the synthesised and natural speech utterances, summed over all test utterances * An objective measure of gross errors in phone durations may be calculated as the mean squared error of phone durations summed over all test utterances.

* Differences in prosodic phrasing may be measured objectively as F-score and/or accuracy of phrase break assignment, calculated over all test utterances.

* Differences in accenting may be represented objectively by F-score andlor accuracy of accent tag assignment calculated over all test utterances.

* Differences in pause position may be measured objectively by F-score andlor accuracy of assignment of pauses to word junctures.

* Differences in pause durations may be assessed objectively by mean squared error, summed over all correctly placed pauses for all test utterances.

The error model may alternatively or additionally include a text analysis error component to assess differences between the text analysis derived by the TTS system and the annotations in the database. The form of the objective error measurement(s) will depend on the form of the syntactic and semantic analysis used. For example, an objective measurement of errors in part-of speech tagging might be the accuracy calculated as the percentage of words in the test texts for which part-of-speech tags are correctly assigned.

Many uses of TTS systems will involve TTS-generated speech being transmitted over a network, for example a telephone or mobile network or over the Internet. Therefore the perceptual model may also additionally include network-related components to account for the distortion of synthesised speech caused by its transmission across a network.

Such components include an additional time-alignment stage using envelopebased delay estimation performed after an initial alignment of natural and distorted synthesised speech based on the phone sequences, equalisation of the Bark spectrum of the natural speech to compensate for the frequency response of the network and gain equalisation of the degraded synthesised speech to compensate for time-varying gain of the network prior to conversion of the Bark spectrum to the Sone-scale loudness.

Further uses of TTS systems will involve TTS-generated speech being used in noisy environments, for example in in-car navigation systems. The effect of noise on the perceived quality of synthesised speech may be conveniently assessed by addition of representative noise to the synthesised speech waveform prior to passing it to the perceptual model.

The cognitive model may conveniently comprise a linear or non-linear mapping function that maps the objective error measurements derived by the error model to a subjective measurement of the overall quality of the TTS system. An example of such a mapping function would be a third order polynomial regression model arranged to minimise mean-squared error.

Alternatively, the cognitive model may comprise a neural network. Still further, the cognitive model may be any function /means capable of mapping an input value to an output.

The cognitive model may conveniently be trained to provide a subjective measurement of the overall quality of synthesised speech, for example as represented by a Mean Opinion Score (MOS), using an absolute category rating (ACR) of 1 (bad) to 5 (excellent).

Alternatively, the cognitive model may be trained to provide a subjective measurement of other performance measures, such as naturalness, intelligibility, listening effort.

As a further alternative, the cognitive model may be trained to provide a set of subjective measurements of different aspects of speech quality (such as naturalness and intelligibility) which are further combined by another trained mapping to provide a subjective measurement of overall speech quality.

As a further alternative, the cognitive model may conveniently be trained to provide subjective measurements of the quality of one or more of the components (text analysis, prosody generation, waveform generation) of the TTS system, which are further combined by another trained mapping to provide a subjective measurement of the overall speech quality. Such a model represents how the quality of the various components of the TTS system contribute to the overall quality. Alternatively, the cognitive model may be trained to provide a subjective measurement of the performance ofjust one of the TTS components by omitting the contribution from other components from the final mapping in the cognitive model that combines the individual component contributions.

As a further alternative, the cognitive model may conveniently provide an assessment of the quality of the waveform generation component alone by substituting the TTS annotations calculated by the TTS system and used as input to the waveform generation component by the ideal reference annotations from the database, thus zeroing the objective error measures due to other components.

The TTS assessment system and method of the first and second aspects of the present invention may conveniently be used to select a suitable TTS engine from a number of candidate TTS engines. Selection can be made on the basis of the TTS system that maximises the perceived subjective speech quality.

Therefore in a third aspect of the present invention there is provided a method of selecting a TTS system from a plurality of available TTS systems for use in a predetermined environment comprising the steps of i) assessing each of the plurality of TTS systems using a TTS assessment system according to the first aspect of the present invention, and; ii) selecting a ITS system based on the subjective quality measurement derived by the TTS assessment system.

The TTS assessment system and method of the first and second aspects of the present invention may also conveniently be used to benchmark TTS systems.

Therefore in a fourth aspect of the present invention there is provided a method of benchmarking a plurality of TTS systems comprising the steps of i) assessing each of the plurality of TTS systems using a TTS assessment system according to any of claims 1 - 20; ii) rating each TTS system based on the subjective quality measurement derived by the TTS assessment system.

It is noted that benchmarking may conveniently be performed by training various TTS systems on a standardised database.

Alternatively, the provision of annotation mapping and voice conversion components as noted above with the TTS system would conveniently allow the TTS systems to be benchmarked even if they have been trained on different databases.

The invention may be embodied in computer software, as a computer program product for loading directly into the internal memory of a digital computer, comprising software code portions for performing the steps of the method described above when said product is run on a computer.

In a fifth aspect the invention comprises a computer program product stored on a computer usable medium, comprising: computer-readable program means for causing a computer to compare TTS speech data generated by a TTS system from a test text with corresponding natural speech data in order to derive an objective error measurement of the TTS system and for causing the computer to map the objective error measurement of the TTS system to a subjective measurement of the quality of the TTS system.

The computer program product may be embodied on any suitable carrier readable by a suitable computer input device, such as CD-ROM, optically readable marks, magnetic media, punched card, or on an electromagnetic or optical signal.

In a sixth aspect of the present invention there is provided an assessment system for assessing a Text-to-Speech (TTS) system comprising i) a database comprising a plurality of test texts, associated natural speech utterances for the plurality of test texts and time-aligned annotations of the natural speech waveforms relating to the plurality of test texts ii) a perceptual model for converting speech waveforms and their associated annotations obtained from both the database and the TTS under assessment into perceptually meaningful representations iii) an error model for comparing, for each test text, perceptual data generated by the perceptual model from the TTS output and corresponding perceptual data derived from the natural speech waveforms and their annotations from the database in order to derive objective error measurements for the quality of the TTS system and iv) a cognitive model capable of mapping the objective error measurements for the quality of the TTS system derived by the error model to a subjective measurement of the quality of the TTS system.

The present invention will now be described with reference to the following non- limiting preferred embodiments in which: Figure 1 shows a schematic illustration of a text-to-speech (TTS) synthesis system Figure 2 shows a schematic illustration of a TTS assessment system in accordance with the present invention Figure 3 shows a schematic illustration of the perceptual model of the TTS assessment system described in Figure 2 Figure 4 shows an alternative embodiment of part of the TTS assessment system described in Figure 2 Figure 5 shows a further alternative embodiment of part of the TTS assessment system described in Figure 2 Figure 6 shows a schematic illustration of a TTS assessment system in accordance with the present invention as used in a network environment Figure 7 shows a schematic illustration of a TTS assessment system in accordance with the present invention as used in a noisy environment Figure 8 shows assessment of the waveform generation component of a TTS system using the TTS assessment system described in Figure 2.

In the following description like numerals are used to denote like features between the Figures.

A text-to-speech assessment system in accordance with the present invention is shown in Figure 2. Tn Figure 2, a TTS assessment system 30 comprises a perceptual model 40, an error model 32, a database 34 of standard test texts 7, associated natural speech waveforms and timealigned annotations, a TTS system under test 36 and a cognitive model 38.

In use the perceptual model 40 receives data from both the database 34 and the TTS system 36 under test. The perceptual model converts the speech data from the TTS system and database (synthesised speech waveform 48 and natural speech waveform 54; and TTS annotation 53 and database annotation 55) to perceptually meaningful representations (perceptual data 42) which are used by the error model 32 to calculate objective error measurements 46. The objective error measurements are sent to a cognitive model 38 which predicts the subjective quality 44 of the speech synthesised by the TTS system 36.

Specifically, the assessment of the TTS system 36 comprises the following steps: i) the TTS system 36 synthesises speech for the same test texts that are stored in the database; ii) For the set of test texts, perceptually significant data representations 42 are derived for the natural speech waveforms and time-aligned annotations in the database 34 and also for the synthesised speech waveforms and annotations produced by the ITS system 36 and these data are supplied to the error model 32; iii) The error model 32 objectively compares the perceptual data derived from the natural speech data in the database and that derived from the synthesised speech data output from the TTS system in order to identify errors between the corresponding information. This error information is averaged over some or all of the available test texts in order to provide objective error measurements 46 for the TTS system 36 under test; iv) The objective error measurements determined by the error model are supplied to the cognitive model 38 which is capable of mapping the derived objective measurements of error into a predicted subjective assessment of the synthesised speech 44.

The database 34 comprises a set of standard test texts and associated natural speech for a variety of different speakers, male and female. Prior to inclusion within the database each test text and associated speech waveform is annotated such that the information required by the perceptual model 40 is available for storage within the database.

Generally, the natural speech waveforms will have undergone automatic (i. e. machine) annotation in which the time-aligned phonetic transcription, location of pauses and pitch contour is determined followed by some human correction of the results. Further hand annotation of, for example, the prosodic phrasing will also have been determined. In addition, the text is subject to automatic syntactic/semantic analysis and hand correction.

The database 34 will therefore include the original test texts, natural speech waveforms and time-aligned annotations. In the simplest embodiment, the voices' of the natural speech in the database are the same as those in the TTS system and the annotations in the database match the form of the annotations that are available from the TTS system (e.g. the same phone set, type of syntactic analysis, prosodic phrase structure and accenting annotation). This might be the case, for example in a simple benchmarking scenario where a number of TTS systems from different competitors are trained on the same speech database, but is unlikely in practice for comparing TTS systems in general.

Additional steps are needed to accommodate the differences in annotations and voices, which would otherwise lead to gross errors in the predicted speech quality.

The error model 32 depicted in Figure 2 comprises three sub-models (32a, 32b, 32c) which allow the text analysis, prosody generation and waveform generation components of the TTS system to be objectively assessed.

The components of the perceptual model 40 are shown in Figure 3. The perceptual model converts speech data (waveforms 48, 54, and annotations 53, 55) into a form (the perceptual data 42) that can be used by the error model to derive error measures that are perceptually significant to human listeners. The perceptual model contains a number of pre-processing components 40a, b, c, e, f that operate on the speech waveforms 48, 54 to align and equalise them prior to calculating an auditory transform 40d.

First, the level alignment component 40a aligns the listening levels of the natural and synthesised speech to a constant target power level based on the gains calculated from the average power in filtered versions of the natural and synthesised speech.

In the next stage 40b, the signals are filtered with a filter that approximates the general characteristics of the headphones used in perceptual evaluation experiments. The same filter can be used regardless of the actual filter used in the original subjective evaluations as it is assumed the objective measures are relatively insensitive to the actual filtering.

The next stage in the perceptual model is time-alignment 40c of the natural 54 and synthesised 48 speech waveforms. The auditory transformcomponent 40d requires two sequences of analysis frames created from the natural and synthesised speech waveforms, which are time-aligned. Due to differences in the overall duration of natural and synthesised speech waveforms (caused by differences in phone and pause durations and insertion and deletion of pauses by the TTS system), the number of frames derived from the natural speech and the number of frames derived from the synthesised speech are not usually equal.

The purpose of the time-alignment component is to account for the insertion and deletion of pauses and to align the frames of the synthesised speech to those of the corresponding natural speech. Delays in frame alignment correspond to instances where frames of the synthesised speech will be duplicated during analysis. To accommodate insertion and deletion of pauses by the TTS system, periods of silence corresponding to pauses in either waveform are ignored for purposes of the time alignment and auditory transform and only active speech frames are used. The errors due to pauses are assessed via the prosody error model.

The first step in time-alignment is to convert the active parts of each speech waveform into a sequence of analysis frames in time, typically using a frame size of 32ms with a 50% overlap.

The next step in time-alignment is to locate corresponding phone boundaries in the synthesised and natural speech using the time-aligned phone sequences in the annotations 53, 55. The phone boundaries are then used to align frames in the synthesised speech associated with a particular phone with the frames in the natural speech associated with that phone by selecting the alignment that maximises the crosscorrelation between the energy envelopes of the two frames. When the frame of synthesised speech straddles a phone boundary in either the natural speech or synthesised speech, the straddling/next frame for the next phone in the natural speech should also be considered in the alignment. Using the frame alignment, the pitch contours are also aligned.

A difference in speaking rate is also accommodated by the time-alignment, since differences in speaking rate are manifest by differences in phone and pause durations, and insertion andlor deletion of pauses and prosodic phrase boundaries.

To accommodate differences in annotations between TTS and database, such as the phone set, syntactic, prosodic phrasing and accenting representations, a set of annotation mappings 40e can optionally be included in the perceptual model. Phone mappings are applied before the time-alignment stage 40c.

A complexity mapping stage 40f can optionally be included in the perceptual model to modify the granularity of the annotation output from the TTS systems to a simplified form to be used by the error model (depending on the level of complexity that is determined to be perceptually significant), for example the number of levels of prosodic phrasing or type of accent that are distinguished by the system.

The auditory transform 40d models the psychophysical processes in the human auditory system by converting the time-domain speech frames into a representation of the perceived loudness in time and frequency. It is calculated for active speech frames only and frames corresponding to pauses are ignored. The auditory transform involves the steps of calculation of the short-term FFT for each frame, conversion to a Bark frequency scale and finally conversion to Sone-scale loudness densities.

To accommodate differences in voices, for example differences in formant positions in voiced phones and differences in baseline pitch and pitch range, the auditory transform for the natural speech may additionally/optionally include a step of voice conversion to map the voice characteristics of the speaker(s) of the natural speech (such as mean pitch, pitch range and differences in formant positions in voiced phones) in the database to that of the speaker(s) on which the TTS voice was trained.

The perceptual model 40 outputs perceptual data 42 which may comprise some or all of the following information: i. waveform information in the form of Sone-scale loudness densities ii. prosody information in the form of time-aligned pitch contours, phone sequences and phone durations, location and duration of pauses, prosodic phrasing structure and word accenting iii. text analysis information including syntactic/semantic structure and lexical information such as position of lexical stress within each word For TTS data, (ii) and (iii) are derived from the TTS annotation component 53 of the internal representation 20 referred to above with reference to Figure 1 and for natural speech, (ii) and (iii) are derived from the annotation 55 in the database 34.

As noted above, the error model 32 depicted in Figure 2 comprises three sub-models (32a, 32b, 32c) which allow the text analysis, prosody generation and waveform generation components of the TTS system to be objectively assessed.

It should be noted, however, that in order to obtain a measure of the overall performance of the ITS system under test it is sufficient to only analyse the prosody generation and waveform generation capabilities of the TTS system. It is not necessary to analyse the text analysis capabilities of the TTS system because the results of text analysis are utilised by the prosody generation components so errors in text analysis are already inherent in the errors in these components.

A text analysis model 32a may however be included in the overall error model 32 in order to more accurately determine how errors in the text analysis component specifically affect the performance of the TTS system.

In the various error sub-models (32a, 32b, 32c) may use various assessment metrics in order to determine objective error measurements. For example: i) the waveform error measurements may include the absolute errors between synthesised and natural speech loudness densities, which are modified by an auditory masking array and aggregated symmetrically and asymmetrically in time and frequency over all test utterances to give two objective error measurements.

ii) The prosody error measurements may include some or all of the following: for differences in pitch contours, the mean squared error between the pitch for corresponding time-aligned frames of the synthesised and natural speech utterances, summed over the utterances for all test texts; for differences in phone durations, the mean squared error between phone durations in the synthesised speech and natural speech, summed over all test utterances; for differences in prosodic phrasing, the F-score and/or accuracy of phrase break assignment, calculated over all test utterances; for differences in accenting, the F- score and/or accuracy of accent tag assignment calculated over all test utterances; for differences in pause position, the F-score and/or accuracy of assignment of pauses to word junctures; for differences in pause durations, the mean squared error between pause durations in the synthesised speech and pause durations in the natural speech, summed over all pauses correctly predicted by the TTS system in all test utterances.

iii) Text analysis error measurements may vary depending on the form of the syntactic/semantic analysis carried out by the TTS system and how they are represented in the annotation. For example, an objective measurement of errors in part-of speech tagging may include the accuracy of part-of-speech tag assignment calculated as the percentage of all words in the test texts for which part-of-speech tags are correctly assigned.

The above objective error measurements are calculated over some or all of the test texts in the utterance and objectively represent the differences between the speech data output by the TTS system and natural speech data, averaged over a range of test texts.

The output of the error model is finally passed to the cognitive model 38 which converts the objective error measurements from the error model 32 into a subjective measurement of the quality of the TTS system.

In order for the cognitive model 38 to be able to map objective error measurements to subjective quality measures the cognitive model first requires training with data from real subjective evaluations.

Training is achieved by calculating objective error measurements using the error model for a range of different TTS systems and then comparing the measurements with actual subjective test results from a group of human subjects.

The form of the subjective measurement can vary depending on which aspect of quality the assessment system is being trained to predict. For example, the system may be trained to predict the overall quality of the synthesised speech represented as a Mean Opinion Score (MOS) for the test texts, measured on an absolute category rating (ACR) of 1 (bad) to 5 (excellent). Alternatively, the cognitive model may be trained to provide a subjective measurement of other performance measures, such as naturalness, intelligibility, listening effort.

The cognitive system may map objective error measurements derived for TTS systems under test to predictions of the subjective quality of the synthesised speech produced by these systems by means of a non-linear mapping function such as third-order polynomial regression. Alternatively, the model may comprise a neural network or other mapping function.

Figure 4 shows an alternative arrangement for the cognitive model. As described previously in relation to Figure 2 the error model 32 comprises three sub-models 32a, 32b and 32c (respectively a text analysis error model, a prosodic error model and a waveform error model). In this arrangement the cognitive model comprises three quality sub-models - text analysis quality model 38a, prosodic quality model 38b and waveform quality model 38c. The sub- models relate the objective error measurements for specific components of the TTS system to the same aspect(s) of quality as assessed by the overall "system "quality predictor, 38d. The prosodic quality sub-model 38b may alternatively be further sub-divided to represent specific aspects of prosody, such as the quality of prosodic phrasing, accenting, and pitch contour. An overall "system" quality measurement 44 is calculated by the system quality predictor 38d, another trained mapping which combines the individual quality assessments from the sub-models 38a- c.

In this arrangement of the cognitive model, instead of combining all objective error measurements to produce a subjective measurement of overall quality, the objective error measurements from the output of each of the error sub-models 32a-c are passed to corresponding quality submodels 38a-c in the cognitive model 38.

The cognitive model of Figure 4 allows the assessment of the quality of individual components of the TTS system under test. For example, the quality of the prosody generation or waveform generation components may be assessed individually. The final mapping provides an indication of the relative importance of the subjective quality of individual components of the TTS system to the subjective quality of overall system.

In order for the quality sub-models 38a-c to be able to map objective error measurements from the error sub-models to subjective quality measurements for specific components of the TTS system, further training data is required in addition to that required to train the version of the cognitive model depicted in Figure 2. The additional training data comprises real subjective evaluations relating to the individual speech components (for example subjective tests from human assessors will be required that relate the performance of the prosody generation component, as expressed by the prosody objective error measurements, to subjective evaluations of the quality of the prosody, or particular aspects of it, such as prosodic phrasing).

In a similar manner to that described above, training is achieved by using the error model to calculate objective error measurements for individual components of the TTS system for a range of different TTS systems and then comparing the measurements derived from the error sub- models with actual subjective test results from a group of human subjects. Additionally, the final mapping of the subjective measurements from each of the quality sub-models to a subjective measurement of the overall quality is derived by comparing the subjective test results for individual TTS components with the subjective test results for overall speech quality.

Figure 5 shows a further embodiment for the cognitive model 38 in which the cognitive model comprises a set of quality sub-models which provide subjective measurements of different aspects of speech quality (naturalness 38i and intelligibility 38n are illustrated here, but other aspects of speech quality, such as listening effort, may additionally/alternatively be included). The cognitive model further combines the quality sub-models by another trained mapping 38e to provide a subjective measurement of overall speech quality. The final mapping provides an indication of the relative importance of different aspects of subjective speech quality (such as naturalness and intelligibility) to the subjective measurement of overall quality 44, In use, the output of the error model 32 is provided to the sub-models 38i and 38n and predictions of subjective measurements of the specific aspects of quality (naturalness, i.e. does the synthesised speech sound natural, and intelligibility, i.e. can the synthesised speech be understood, are illustrated) are derived which are combined by the overall quality predictor 38e to give a subjective measurement of overall speech quality 44.

Once again additional training material for the cognitive model is required, in the form of additional subjective tests (for example represented by MOS values) which relate to subjective assessment of the specific aspects of speech quality included in the cognitive model, such as naturalness and intelligibility.

In a similar maimer to that described above, training is achieved by using the error model to calculate objective error measurements for individual components of the TTS system for a range of different TTS systems and then comparing the measurements derived from the error sub- models with actual subjective test results for naturalness and intelligibility from a group of human subjects. Additionally, the final mapping 38e of the subjective measurements from each of the quality sub- models 38i and 38n to a subjective measurement of the overall quality 44 is derived by comparing the subjective test results for the assessment of the individual aspects of quality with the subjective test results for assessment of overall speech quality.

The cognitive model of Figure 5 will therefore be able to assess the aspects of speech quality represented by the sub-models, such as the naturalness of the TTS system as well as its intelligibility and also how these aspects contribute to the subjective assessment of the overall quality of synthesised speech Figure 6 shows an extension to the assessment system described in Figures 2-4 wherein the synthesised speech is additionally transmitted over a network (e.g. Voice over IP or telephone network). It should be noted however that the network factors considered below could be generalised to include any environmental factors that have a filtering effect upon the synthesised speech andlor affect its timing.

The system described in Figure 6 is generally similar to the assessment system depicted in Figure 2 in that the TTS system 36 generates waveforms for synthesised speech from the test texts and inputs their annotations 53 to an enhanced perceptual model 41.

In this instance, in contrast to Figure 2, the waveforms of synthesised speech 48 from the TTS system 36 do not pass directly to the perceptual model, but are additionally passed through a network 50. The network 50 will generally degrade the waveforms, for example by introducing linear and non-linear distortions, time delays and loss of speech frames (by clipping and packet loss) and these degraded waveforms 52 are also input to an enhanced perceptual model 41, along with the natural speech waveforms and annotations from the database 34 as described in relation to Figure 2.

The enhanced perceptual model 41 includes additional processing steps to compensate for the effects of the network that would degrade the performance of the assessment method. These effects include linear filtering, gain variations, time-varying time delays, clipping and packet loss.

To compensate for additional time delays and frame loss, the timealignment step 40c is augmented to include a full envelope-based delay estimation after an initial alignment is established from the phone sequences.

To compensate for the effects of linear filtering, an additional step is included in the auditory transform 40d, in which the Bark spectrum per frame of the natural speech is equalised by compensation factors that give an average estimate (in time) of the network transfer function. The compensation factors are calculated using the ratio of the average Barkscale power spectrum of the distorted synthesised speech to the average Bark-scale power spectrum of the undistorted synthesised speech.

To compensate for time-varying gains, the Bark spectrum per frame of the degraded synthesised speech is equalised by compensation factors calculated using the ratio of the power in the Bark spectrum of the undistorted synthesised speech frame to the power in the Bark spectrum of the distorted synthesised speech frame.

The cognitive model 38 maps objective error measurements to predictions of subjective quality as described before. Training of the cognitive model in this case will have included human derived subjective test results relating to the evaluation of synthesised speech that has been passed through a network.

Figure 7 illustrates an example where the TTS system operates in a noisy environment.

In this case, the environmental effects are purely additive and the additional time- alignment step is not necessary.

Figure 8 shows how the TTS assessment system depicted in Figure 2 may be used to assess only the waveform generation component of the TTS system. In this case, the waveform generation component is fed text and associated annotations direct from the database as opposed to the annotation derived by the text analysis and prosody generation components of the TTS system.

Claims

CLAIMS: 1. An assessment system for assessing a Text-to-Speech (TTS)

system comprising i) a database comprising a plurality of test texts, associated natural speech waveforms for the plurality of test texts and time-aligned annotations of the natural speech waveforms relating to the plurality of test texts ii) a perceptual model arranged in use, for both speech data output by the TTS and natural speech data from the database, to align speech waveforms and their associated annotations and to map speech waveforms to a loudness scale iii) an error model for comparing, for each test text, data generated by the perceptual model from the TTS output and corresponding data derived by the perceptual model from the natural speech waveforms and their annotations from the database in order to derive objective error measurements for the quality of the TTS system and iv) a cognitive model capable of mapping the objective error measurements for the quality of the TTS system derived by the error model to a subjective measurement of the quality of the TTS system.
2. An assessment system as claimed in claim 1 wherein the time-aligned annotations of the natural speech waveforms in the database correspond to the form of annotation output by the TTS system under test.
3. An assessment system as claimed in claim 1 or 2 wherein the database comprises some or all of the following: time-aligned phone sequences and phone durations, location of pauses and their durations, pitch contours, prosodic phrasing and word accenting information, details of lexical analysis, syntactic/semantic structure.
4. An assessment system as claimed in any preceding claim wherein the perceptual model converts speech data into a representation that is perceptually significant to a human listener.
5. An assessment system as claimed in any preceding claim wherein the perceptual model outputs, for both natural and synthesised speech, some or all of the following to the error model: waveforms represented as Sonescale loudness densities, time-aligned pitch contours, phone durations, location and duration of pauses, prosodic phrasing and word accenting, syntactic/semantic structure.
6. An assessment system as claimed in any preceding claim wherein the perceptual model is arranged to convert the speech waveforms into sequences of analysis frames and to align the waveforms and pitch contours in time based on the phonetic transcriptions and energy in each frame and further is arranged to convert the time- domain analysis frames to Sone-scale loudness densities via a frequency warping to the Bark scale.
7. An assessment system as claimed in any preceding claim wherein the perceptual model further comprises a set of annotation mappings in order to be capable of interacting with a variety of different TTS systems.
8. An assessment system as claimed in any preceding claim wherein the perceptual model further comprises a complexity mapping component to modify the granularity of the annotations to a simplified form to be used by the error model.
9. An assessment system as claimed in any preceding claim wherein the error model comprises a waveform error component to compare speech wavefonns relating to synthesised and natural speech as output by the perceptual model.
10. An assessment system as claimed in claim 9 wherein the waveform error component compares synthesised and natural speech loudness densities as a measure of the waveform errors.
11. An assessment system as claimed in any preceding claim wherein text and annotations from the database are input to the TTS system under test.
12. An assessment system as claimed in any preceding claim wherein the error model comprises a prosody error component to assess differences between the TTS generated prosody and that of the natural speech.
13. An assessment system as claimed in claim 12 wherein the prosody error component calculates some or all of the following error measurements: differences in pitch contours as measured by mean square error; errors in phone durations as measured by mean square error; differences in prosodic phrasing as measured by F-score and/or accuracy phrase break assignment; differences in accenting as represented by F-score and/or accuracy of accent tag assignment; differences in pause position as measured by Fscore and/or accuracy of assignment of pauses to word junctures; differences in pause durations assessed by mean squared error.
14. An assessment system as claimed in any preceding claim wherein the error model comprises a text analysis error component.
15. An assessment system as claimed in claim 14 wherein the text analysis error component derives an error measurement by measurement of errors in part-of speech tagging.
16. An assessment system as claimed in any preceding claim wherein the perceptual model further comprises network-related components to account for the distortion of synthesised speech caused by transmission across a network.
17. An assessment system as claimed in any preceding claim wherein the cognitive model comprises a linear or non-linear mapping function in order to map objective error measurements derived by the error model to a subjective measurement of the quality of the TTS system.
18. An assessment system as claimed in claim 17 wherein the non-linear mapping function is a third order polynomial regression model arranged to minimise mean- squared error.
19. An assessment system as claimed in any preceding claim wherein the cognitive model provides a subjective measurement of the quality of one or more of the following features of the TTS system: text analysis, prosodic phrasing, waveform generation.
20. An assessment system as claimed in any preceding claim wherein the cognitive model provides a subjective measurement of the naturalness and intelligibility of the TTS system.
21. A method of assessing a Text-to-Speech (TTS) system comprising the steps of i) receiving speech data including speech wavefonns and associated annotations generated by a TTS system from a set of test texts and also including natural speech data from a database, the database comprising a plurality of test texts, associated natural speech waveforms for the plurality of test texts and time-aligned annotations of the natural speech waveforms relating to the plurality of test texts ii) for both speech data output by the TTS and natural speech data from the database, aligning the speech waveforms and their associated annotations and mapping speech waveforms to a loudness scale iii) comparing, for each test text, the data derived in step (ii) from the speech data output by the TTS system with corresponding data derived in step (ii) from the natural speech data in order to derive objective error measurements for the quality of the TTS system iv) mapping the objective error measurements for the quality of the TTS system to a subjective measurement of the quality of the TTS system.
22. A method of assessing a Text-to-Speech (TTS) system as claimed in claim 21 wherein the time-aligned annotations of the natural speech waveforms in the database correspond to the form of annotation output by the TTS system under test.
23. A method of assessing a Text-to-Speech (TTS) system as claimed in claim 21 or 22 wherein the database comprises some or all of the following: time-aligned phone sequences and phone durations, location of pauses and their durations, pitch contours, prosodic phrasing and word accenting information, details of lexical analysis, syntactic/semantic structure.
24. A method of assessing a Text-to-Speech (TTS) system as claimed in any of claims 21 to 23 wherein step (ii) converts speech data into a representation that is perceptually significant to a human listener.
25. A method of assessing a Text-to-Speech (TTS) system as claimed in any of claims 21 to 24 wherein step (ii), for both natural and synthesised speech, outputs some or all of the following: waveforms represented as Sone-scale loudness densities, time-aligned pitch contours, phone durations, location and duration of pauses, prosodic phrasing and word accenting, syntactic/semantic structure.
26. A method of assessing a Text-to-Speech (TTS) system as claimed in any of claims 21 to 25 wherein step (ii) converts the speech waveforms into sequences of analysis frames and to align the waveforms and pitch contours in time based on the phonetic transcriptions and energy in each time frame and further is arranged to convert the time- domain analysis frames to Sone-scale loudness densities via a frequency warping to the Bark scale.
27. A method of assessing a Text-to-Speech (TTS) system as claimed in any of claims 21 to 26 wherein step (ii) comprises an annotation mapping sub-step in order to be capable of interacting with a variety of different TTS systems.
28. A method of assessing a Text-to-Speech (TTS) system as claimed in any of claims 21 to 27 wherein step (ii) further comprises a complexity mapping sub- step to modify the granularity of the annotations to a simplified form to be used in step (iii).
29. A method of assessing a Text-to-Speech (TTS) system as claimed in any of claims 21 to 28 wherein step (iii) compares speech waveforms relating to synthesised and natural speech as output by step (ii).
30. A method of assessing a Text-to-Speech (TTS) system as claimed in claim 29 wherein step (iii) compares synthesised and natural speech loudness densities as a measure of the waveform errors.
31. A method of assessing a Text-to-Speech (TTS) system as claimed in any of claims 21 to 30 wherein text and annotations from the database are input to the TTS system under test.
32. A method of assessing a Text-to-Speech (TTS) system as claimed in any of claims 21 to 31 wherein the step (iii) assesses prosody differences between the TTS generated prosody and that of the natural speech.
33. A method of assessing a Text-to-Speech (TTS) system as claimed in claim 32 wherein some or all of the following prosody error measurements are calculated: differences in pitch contours as measured by mean square error; errors in phone durations as measured by means square error; differences in prosodic phrasing as measured by F-score and/or accuracy phrase break assignment; differences in accenting as represented by Fscore andlor accuracy of accent tag assignment; differences in pause position as measured by F-score and/or accuracy of assignment of pauses to word junctures; differences in pause durations assessed by mean squared error.
34. A method of assessing a Text-to-Speech (TTS) system as claimed in any of claims 21 to 33 wherein step (iii) comprises a text analysis error sub-step.
35. A method of assessing a Text-to-Speech (TTS) system as claimed in claim 34 wherein the text analysis error sub-step derives an error measurement by measurement of errors in part-of speech tagging.
36. A method of assessing a Text-to-Speech (TTS) system as claimed in any of claims 21 to 35 wherein step (ii) further comprises a network-related sub-step to account for the distortion of synthesised speech caused by transmission across a network.
37. A method of assessing a Text-to-Speech (TTS) system as claimed in any of claims 21 to 36 wherein a linear or non-linear mapping function maps objective error measurements to a subjective measurement of the quality of the TTS system.
38. A method of assessing a Text-to-Speech (TTS) system as claimed in claim 37 wherein the non-linear mapping function is a third order polynomial regression model arranged to minimise mean-squared error.
39. A method of assessing a Text-to-Speech (TTS) system as claimed in any of claims 21 to 38 wherein a subjective measurement of the quality of one or more of the following features of the TTS system is provided: text analysis, prosodic phrasing, waveform generation.
40. A method of assessing a Text-to-Speech (TTS) system as claimed in any of claims 21 to 39 wherein the mapping step provides a subjective measurement of the naturalness and intelligibility of the TTS system.
41. A computer program for performing the steps of any of claims 21-40.
42. A computer program product directly loadable into the internal memory of a computer, comprising software code portions for performing the steps of claims 21-40 when said product is run on a computer.
43. A computer program product stored on a computer usable medium, comprising: computer-readable program means for causing a computer to compare TTS speech data generated by a TTS system from a test text with corresponding natural speech data in order to derive an objective error measurement of the TTS system and for causing the computer to map the objective error measurement of the TTS system to a subjective measurement of the quality of the TTS system.
44. A method of selecting a TTS system from a plurality of available TTS systems for use in a predetermined environment comprising the steps of i) assessing each of the plurality of TTS systems using a TTS assessment system according to any of claims 1 - 20; ii) selecting a TTS system based on the subjective quality measurement derived by the TTS assessment system.
45. A method of benchmarking a plurality of TTS systems comprising the steps of i) assessing each of the plurality of TTS systems using a TTS assessment system according to any of claims 1 - 20; ii) rating each TTS system based on the subjective quality measurement derived by the TTS assessment system.
46. A method of benchmarking a plurality of TTS systems as claimed in claim 45 wherein the plurality of TTS systems are trained on a standardised database.
47. A method of benchmarking a plurality of TTS systems as claimed in claim 45 wherein the perceptual model comprises means to map the voice speech characteristics of the natural speech in the database to that of the training material on which a TTS system was trained and wherein the perceptual model comprises a set of annotation mappings to accommodate differing internal representations between the plurality of TTS systems.
48. An assessment system for assessing a Text-to-Speech (TTS) system comprising i) a database comprising a plurality of test texts, associated natural speech utterances for the plurality of test texts and timealigned annotations of the natural speech waveforms relating to the plurality of test texts ii) a perceptual model for converting speech waveforms and their associated annotations obtained from both the database and the TTS under assessment into perceptually meaningful representations iii) an error model for comparing, for each test text, perceptual data generated by the perceptual model from the TTS output and corresponding perceptual data derived from the natural speech waveforms and their annotations from the database in order to derive objective error measurements for the quality of the TTS system and iv) a cognitive model capable of mapping the objective error measurements for the quality of the TTS system derived by the error model to a subjective measurement of the quality of the TTS system.