WO2022229806A1

WO2022229806A1 - Method and device for real-time conversion of a whispered speech into synthetic natural voice

Info

Publication number: WO2022229806A1
Application number: PCT/IB2022/053771
Authority: WO
Inventors: Aníbal João DE SOUSA FERREIRA; Clara FERREIRA CARDOSO; João Miguel PINTO PEREIRA DA SILVA; Marco António DA MOTA OLIVEIRA
Original assignee: Universidade Do Porto
Priority date: 2021-04-26
Filing date: 2022-04-22
Publication date: 2022-11-03

Abstract

The present invention relates to a method and device for performing real-time conversion of whispered speech into synthetic natural speech, through efficient and compact parametric signal analysis, signal modelling, and signal synthesis techniques. More particularly, according to the present invention, the whispered speech signal, which is produced by the human phonetic apparatus without the contribution of the vocal folds in the larynx, is taken as input and synthetic periodic signals components are implanted in selected regions of the whispered signal, thereby restoring the missing periodic components as a consequence of inoperative or non-existing vocal folds. This is done in such a fashion as to enhance the linguistic content of the speech, to improve the voice projection, and to convey elements of the sound signature of a specific speaker.

Description

DESCRIPTION

METHOD AND DEVICE FOR REAL-TIME CONVERSION OF A WHISPERED SPEECH INTO

SYNTHETIC NATURAL VOICE FIELD OF THE INVENTION

The present invention is enclosed in the field of voice rehabilitation and speech enhancement. More particularly, the present invention relates to methods and devices for real-time conversion of whispered speech into synthetic natural speech. PRIOR ART

The electrolarynx is a battery operated external medical device that is used to artificially transform the sound of a dysphonic voice of a patient, caused for example by the loss of the active voice organs usually due to cancer of the larynx, into a voice sound that is more audible and intelligible. A voice can be classified as "dysphonic" when there are abnormalities or impairments in one or more of the following parameters of voice: pitch, loudness, quality, and variability. For example, abnormal pitch can be characterized by a voice that is too high or too low whereas abnormal loudness can be characterized by a voice that is too quiet or too loud. Similarly, a voice that has frequent, inappropriate breaks characterizes abnormal quality while a voice that is monotone (i.e., very flat), or inappropriately fluctuates (e.g., diplophonia), characterizes abnormal variability. The electrolarynx works in a mechanical way though inducing pharyngeal vibration at a constant fundamental frequency by placing the device on the neck of the patient, close to the throat. The electrolarynx can function either indirectly by contacting the cervical skin, which induces pharyngeal vibrations, or directly through intraoral contact, which induces oral cavity vibrations through the muscles of articulation (e.g., tongue and lips). This technology is old and has hardly evolved since its invention over 100 years ago. As a consequence, the speech produced by such device is monotonic and robotic with the lack of effective pitch control, therefore generating an unnatural voice. Solutions exist in the art, such as the case of patent application EP3753018, which discloses a method to convert whispered speech to normal speech intelligently through deep learning, therefore whispered speech can be more robust to interference and more intelligible to the listener. More specifically, the method comprises the following steps: a first audio signal is received including first whispered speech; a first plurality of computations are performed on the first audio signal to extract a first features; the first features are provided as input to a trained deep neural network (DNN) model to obtain output of the DNN model comprising second features; and an inverse of the first plurality of computations are performed on the second features to produce a second audio signal corresponding to a non-whispered version of the first whispered speech.

In order to convert a signal of non-audible murmur obtained through an in-vivo conduction microphone into a speech signal that is recognizable for a receiving person with maximum accuracy, patent application US12375491 discloses a speech processing method comprising: a learning step for conducting a learning calculation of a model parameter of a vocal tract feature value conversion model indicating conversion characteristic of acoustic feature value of vocal tract, on the basis of a learning input signal of non-audible murmur recorded by an in-vivo conduction microphone and a learning output signal of audible whisper corresponding to the learning input signal recorded by a prescribed microphone, and then, storing a learned model parameter in a prescribed storing means; and a speech conversion step for converting a non-audible speech signal obtained through an in-vivo conduction microphone into a signal of audible whisper, based on a vocal tract feature value conversion model, with a learned model parameter obtained through the learning step set thereto.

Patent application JP2002225393 provides a device that observes a speaker's action of speaking intentionally without vibrating the vocal cords, approximately generates a voice signal which would be obtained if the speaker vibrates the vocal cords as usual according to observation information, and sends the signal out through a communication line for use for communication by an electronic means. More specifically, a whisper voice is artificially converted into a normal voice signal by a means derived according to given pitch control information and a scheduled uniform or scene- adaptive rule as to time modified short-time autocorrelation function of a filtered input signal. Further, a compromise filter is provided which arbitrarily filters a synthetic voice sound signal as the conversion result suitably to audibility.

In conclusion, the existing solutions can be embodied into two distinct approaches: one that involves the complete reconstruction of the voice signal based on learned signal models, or a second approach that uses the whispered speech signal as a raw material in which its missing parts are generated and then added in order to reconstruct the speech signal.

PROBLEM TO BE SOLVED

The present invention intends to develop a method that artificially transforms the sound of a dysphonic voice into a voice sound that is more audible and intelligible, in order to enhance the linguistic content of the speech, improve projection and be able to convey elements of the sound signature of a specific speaker. The known methods are unable to achieve a desired precision and quality in generating a natural speech, by not allowing a dynamic and independent control of the synthetic periodic components thereby allowing to flexibly calibrate the quality of the signal conversion process.

The present solution intends to innovatively overcome such issues.

DESCRIPTION OF FIGURES

Figure 1 - representation of an embodiment of the processing modules of the device for real-time conversion of a whispered speech into synthetic natural voice according to the invention. The numeric references represent:

1 - Whispered speech signal (input);

2 -Time and frequency domain analysis module;

3 - Phonetic class detection module;

4 - Phonetic-oriented signal segmentation module;

5 - Signal modelling and synthesis module; 6 - Synthetic voicing implantation module;

7 - Local whispered speech calibration module;

8 -Adaptative merge reconstructed speech module;

9 - Natural speech (output).

SUMMARY OF THE INVENTION

The objective of the present invention is to allow individuals suffering from a temporary or permanent health condition that prevents them from producing a natural speech (i.e., meaning that their oral communication can take place on the basis of whispered speech only), to have a natural and pleasant dialogue.

According to the principles of the present invention, this objective can be achieved by relying on efficient parametric signal representation, processing and signal transformation techniques. More particularly, the whispered speech signal (1), which is produced by the human phonetic apparatus without the contribution of the vocal folds (or vocal cords) in the larynx, is taken as input and is converted into synthetic natural speech (9), in real-time, so that synthetic periodic signals components can be implanted in selected regions of said whispered signal (1), thereby restoring the missing periodic components as a consequence of inoperative or non-existing vocal folds. As a consequence, the linguistic content of the speech is enhanced, voice projection is more effective, and the sound signature of a given speaker is audible, personalized, and easily recognizable.

In order to achieve these advantages, the method and device developed are heavily based on advanced signal processing techniques so that efficient and compact parametric signal analysis, signal modelling and signal synthesis techniques are implemented, which allow the independent control of the defining parameters of the synthetic periodic components thereby allowing to calibrate the quality of the signal conversion process with such a flexibility that is not permitted by competing technologies. It is therefore an object of the present invention a method for real-time conversion of a whispered speech (1) into synthetic natural (9). The method is comprised by a sequence of steps that will be described below.

In a first step of the method, a set of time-based, spectral-based and cross-correlation-based features are extracted that determine a probability distribution of a plurality of phonetic classes characterizing each region of an input stream of audio samples corresponding to a detected whispered speech (1).

In a second step, a dynamic segmentation phonetic classification of the input stream of audio samples is performed, using the probability distribution of phonetic classes and in such a manner as to maximize the plausibility of an estimated phonetic class, in order to identify candidate regions for implantation of a synthetic voicing signal.

Then, in a third step of the method, linguistic models for synthetic voicing are built, based on the phonetic class segmentation executed in the second step.

In a fourth step, a synthetic voicing signal is generated, based on the linguistic models and on the dynamic segmentation phonetic classification of the input stream of audio samples.

In a fifth step, the magnitude and the phase properties of the synthetic voicing signal using frequency-domain synthesis and/or time-domain synthesis are configured, in order to allow the explicit control of prosodic rules governing the fundamental frequency (F0) contour, coarticulation rules and signal features that represent idiosyncratic traits of a voice signature. In more detail, the synthetic voicing signal to be implanted over the whispered speech signal (1) has a periodic nature, although variable over time, which has a harmonic structure in the frequency domain that is, a harmonic spectral structure, being controllable in two perspectives: regarding its spectral structure of harmonic phase and as to its spectral structure of harmonic magnitude. The spectral structure of harmonic phase is mainly defined by (i.e., it is a consequence of the specific pattern of) vibration of the speaker's vocal folds and, as such, has eminently idiosyncratic information. This phase structure is reasonably independent of the linguistic content of the speech produced by the speaker. Therefore, it can be programmed as a fixed model or as several models that are adaptively used depending on the type of phoneme implanted. On the other hand, in relation to the spectral structure of the harmonic magnitude, it conveys the most important linguistic information, containing above all the meaning of the specific phoneme (for example a given vowel) but also incorporating some idiosyncratic component that results from resonant frequencies specific to a given speaker (and that result from the specific dimensions and configurations - that is, idiosyncratic- of the vocal tract, and the tissues that cover him). Given that the phonetic diversity of the phonemes to be synthesized and implanted is dictated by the phonetic diversity of the language, several pre- programmed models can be used adaptively in the synthesis in order to correspond to the cues that may be inferred from the whispered speech. Therefore, a completely innovative aspect of the method developed, is that in the synthesis of the synthetic signal to be implanted, it acts independently both in the harmonic phase and in the harmonic magnitude, before its combination. This is a great advantage as it provides additional flexibility in the sound configuration of the synthesized sound, that is, it helps to emphasize a sound signature of a given desired voice.

In a sixth step of the method, the synthetic voicing signal is implanted on the input stream of audio samples in the candidate regions identified in the second step.

In a seventh step, a calibrated stream of audio samples is generated by calibrating the segmented phonetic classified input stream of audio samples in the candidate regions merging with synthetic voicing signal and in coarticulation regions representing a transition between voiced and unvoiced speech.

Finally, in an eighth step of the method developed, a natural speech (9) is generated by merging the synthetic voicing signal and the calibrated stream of audio samples.

It is another object of the present invention, a device for real-time conversion of a whispered speech (1) into synthetic natural voice (9). Said device comprises processing means adapted to implement: — A phonetic class detection module (3) programmed to determine a probability distribution for a plurality of phonetic classes characterizing each region of an input stream of audio samples corresponding to a detected whispered speech signal (1);

— A phonetic-oriented signal segmentation module (4) programmed to perform a dynamic segmentation phonetic classification of the input stream of audio samples using the probability distribution of phonetic classes determined by the phonetic class detection module (3) in order to identify candidate regions for implantation of a synthetic voicing signal;

— A signal modelling and synthesis module (5) programmed to build linguistic models for synthetic voicing, based on the phonetic class segmentation determined by the phonetic-oriented signal segmentation module (4);

— A synthetic voicing implantation module (6) adapted to receive an input from the signal modelling and synthesis module (5) and an input from the phonetic-oriented signal segmentation module (4), and programmed to: i.) generate a synthetic voicing signal, based on the linguistic models built and on the dynamic segmentation phonetic classification of the input stream of audio samples, ii.) configure the magnitude and phase properties of the synthetic voicing signal using frequency-domain synthesis and/or time- domain synthesis; and iii.) implant the synthetic voicing signal on the input stream of audio samples in the candidate regions identified by the phonetic- oriented signal segmentation module (4);

— A local whispered speech calibration module (7) adapted to receive an input from the phonetic-oriented segmentation module (4) and to interact with the synthetic voicing implantation module (6), and programmed to generate a calibrated stream of audio samples by calibrating the segmented phonetic classified input stream of audio samples in the candidate regions merging with synthetic voicing signal and in coarticulation regions representing a transition between voiced and unvoiced speech; and An adaptative merge of reconstructed speech module (8) adapted to receive inputs from the synthetic voicing implantation module (6), the local whispered speech calibration module (7) and the signal modelling and synthesis module (5); and is programmed to generate a natural speech sound by merging the synthetic voicing signal and the calibrated stream of audio samples.

It is another object of the present invention, a system for real-time conversion of a whispered speech (1) into synthetic natural voice (9). Said system comprises: the device already described; a microphone adapted to collect an input whispered speech signal; and an output audio signal module adapted to drive at least one electroacoustic transducer configured to reproduce a natural speech sound.

DETAILED DESCRIPTION

The more general and advantageous configurations of the present invention are described in the Summary of the invention. Such configurations are detailed below in accordance with other advantageous and/or preferred embodiments of implementation of the present invention.

The present invention consists of a method and device for the real-time conversion of whispered speech (1) into synthetic natural speech (9). This is achieved by a speaker talking to a microphone inputting whispered speech signal (1) which is produced by the human phonetic apparatus without the contribution of the vocal folds in the larynx, and by implanting synthetic periodic signals components, in selected regions of the whispered signal, thereby restoring the missing periodic components as a consequence of inoperative or non-existing vocal folds, in real-time. This is done in such a fashion as to enhance the linguistic content of the speech, to improve the voice projection, and to convey elements of the sound signature of a specific speaker. ln a preferred aspect of the method developed, the following additional sequence of steps are executed prior to the first step related with determining the probability distribution of phonetic classes.

In a first additional step, an input whispered speech signal (1) is converted into the input stream of audio samples. In a particular embodiment of the method, converting the input whispered speech signal (1) into the stream of audio samples comprises the step of executing an analog-to-digital conversion of the input whispered speech (1) in order to generate the stream of audio samples.

In a second additional step, the input stream of audio samples is analysed in the time and frequency domains. In one embodiment of the method, this step uses two scales: a first scale that is adapted to capture localized signal features in the time domain; and a second scale that is adapted to capture localized signal features in the frequency domain. More particularly, the first scale may be characterized by a high time resolution of at most 5.8 ms and a low frequency resolution of at least 86 Hz; and the second scale may be characterized by a high frequency resolution of at most 21.5 Hz and a low time resolution of at least 23.2 ms. In this context, the temporal resolution will be all the greater the shorter the time interval, and the frequency resolution will be lower the higher the bandwidth.

In a third additional step of the method, a whisper activity detection is performed by extracting a plurality of signal features from the time and frequency domain analysis of the input stream of audio samples in order to detect a whispered speech signal (1). The plurality of signal features extracted from the time and frequency domain analysis of the input stream of audio samples may relate to short-term energy gradient, long-term energy gradient, spectral phase purity, spectral magnitude structure and consistency.

In one preferred embodiment of the method, the plurality of phonetic classes may relate to "silence", "vowels", "plosives" and "fricatives", as these classes can be used to determine if a given segmented region of whispered speech should be a candidate for synthetic voicing or not. ln another preferred embodiment of the method, the linguistic models built in step five of the method may relate to at least one of or a combination of the following models: fundamental frequency (F0) contour model, spectral envelope model, harmonic phase structure model and energy time dynamics model. These models deal with the most elementary attributes of a voice signal and as such they can be used in the configuration of the synthetic voice sound signal, in order for this to be able to carry the perception of an intended phoneme.

Regarding the device of the present invention, in a preferred embodiment, it is further comprised by an audio signal input module and by a time and frequency domain analysis module (2). The signal input module is adapted to convert an input whispered speech signal (1) into the input stream of audio samples. The time and frequency domain analysis module (2) is programmed to analyse the input stream of audio samples generated by the audio signal input module. More particularly, in another embodiment, the time and frequency domain analysis module (2) is adapted to analyse the stream of audio samples in the time and frequency domains using two scales: a first scale adapted to capture localized signal features in the time domain; and a second scale adapted to capture localized signal features in the frequency domain. The first scale may be characterized by a high time resolution of at most 5.8 ms and a low frequency resolution of at least 86 Hz, and the second scale is characterized by a high frequency resolution of at most 21.5 Hz and a low time resolution of at least 23.2 ms. In this context, the temporal resolution will be greater if the time interval is shorter, and the frequency resolution will be lower the higher the bandwidth.

In another embodiment of the device, the phonetic class detection module (3) is further programmed to perform whisper activity detection by extracting a plurality of signal features from the time and frequency domain analysis in order to detect whispered speech (1). The plurality of signal features extracted from the time and frequency domain analysis of the input stream of audio samples may relate to short term energy gradient, long-term energy gradient, spectral phase purity, spectral magnitude structure and consistency. In another embodiment of the device, the plurality of phonetic classes may relate to "silence", "vowels", "plosives" and "fricatives".

In another embodiment of the device, the linguistic models built by the signal modelling and synthesis module (5) may relate to at least one of or a combination of the following models: fundamental frequency (F0) contour model, spectral envelope model, harmonic phase structure model and energy time dynamics model.

Regarding the system of the present invention, in one embodiment the device is a smartphone. Additionally, in another embodiment, the electroacoustic transducer is a loudspeaker.

As will be clear to one skilled in the art, the present invention should not be limited to the embodiments described herein, and a number of changes are possible which remain within the scope of the present invention.

Of course, the preferred embodiments shown above are combinable, in the different possible forms, being herein avoided the repetition all such combinations.

Claims

1. Method for real-time conversion of a whispered speech (1) into synthetic natural voice (9) comprising the following steps:

— Determining a probability distribution of a plurality of phonetic classes characterizing each region of an input stream of audio samples corresponding to a detected whispered speech (1);

— Performing a dynamic segmentation phonetic classification of the input stream of audio samples using the probability distribution of phonetic classes, in order to identify candidate regions for implantation of a synthetic voicing signal;

— Building linguistic models for synthetic voicing based on the phonetic class segmentation;

— Generating a synthetic voicing signal, based on the linguistic models and on the dynamic segmentation phonetic classification of the input stream of audio samples;

— Configuring the magnitude and the phase properties of the synthetic voicing signal using frequency-domain synthesis and/or time-domain synthesis;

— implanting the synthetic voicing signal on the input stream of audio samples in the candidate regions identified;

— Generating a calibrated stream of audio samples by calibrating the segmented phonetic classified input stream of audio samples in the candidate regions merging with synthetic voicing signal and in coarticulation regions representing a transition between voiced and unvoiced speech; and

— Generating a natural speech (9) by merging the synthetic voicing signal and the calibrated stream of audio samples.

2. Method according to claim 1, comprising the following steps prior to the step of determining the probability distribution of phonetic classes: — Converting an input whispered speech signal (1) into the input stream of audio samples;

— Analysing the input stream of audio samples in the time and frequency domains; and

— Performing a whisper activity detection by extracting a plurality of signal features from the time and frequency domain analysis of the input stream of audio samples in order to detect a whispered speech signal (1).

3. Method according to claim 2, wherein converting the input whispered speech signal (1) into the stream of audio samples comprises the step of executing an analog-to-digital conversion of the input whispered speech (1) in order to generate the stream of audio samples.

4. Method according to claims 2 or 3, wherein the step of analysing the input stream of audio samples in the time and frequency domains uses two scales: a first scale that is adapted to capture localized signal features in the time domain; and a second scale that is adapted to capture localized signal features in the frequency domain.

5. Method according to claim 4, wherein the first scale is characterized by a high time resolution of at most 5.8 ms and a low frequency resolution of at least 86 Hz; and the second scale is characterized by a high frequency resolution of at most 21.5 Hz and a low time resolution of at least 23.2 ms.

6. Method according to any of the claims 2 to 5, wherein the plurality of signal features extracted from the time and frequency domain analysis of the input stream of audio samples relates to: short-term energy gradient, long-term energy gradient, spectral phase purity, spectral magnitude structure and consistency.

7. Method according to any of the previous claims, wherein the plurality of phonetic classes comprises: "silence", "vowels", "plosives" and "fricatives".

8. Method according to any of the previous claims, wherein the linguistic models built relate to at least one of or a combination of the following models: fundamental frequency (F0) contour model, spectral envelope model, harmonic phase structure model and energy time dynamics model.

9. Device for real-time conversion of a whispered speech into synthetic natural voice comprising processing means adapted to implement:

— A phonetic class detection module (3) programmed to determine a probability distribution for a plurality of phonetic classes characterizing each region of an input stream of audio samples corresponding to a detected whispered speech signal;

— A local whispered speech calibration module (7) adapted to receive an input from the phonetic-oriented segmentation module (4) and to interact with the synthetic voicing implantation module (6), and programmed to generate a calibrated stream of audio samples by calibrating the segmented phonetic classified input stream of audio samples in the candidate regions merging with synthetic voicing signal and in coarticulation regions representing a transition between voiced and unvoiced speech; and — An adaptative merge of reconstructed speech module (8) adapted to receive inputs from the synthetic voicing implantation module (6), the local whispered speech calibration module (7) and the signal modelling and synthesis module (5); and is programmed to generate a natural speech (9) sound by merging the synthetic voicing signal and the calibrated stream of audio samples.

10. Device according to claim 9, further comprising:

— an audio signal input module, adapted to convert an input whispered speech signal (1) into the input stream of audio samples; and — a time and frequency domain analysis module (2) programmed to analyse the input stream of audio samples generated by the audio signal input module.

11. Device according to claim 10, wherein the time and frequency domain analysis module (2) is adapted to analyse the stream of audio samples in the time and frequency domains using two scales: a first scale adapted to capture localized signal features in the time domain; and a second scale adapted to capture localized signal features in the frequency domain.

12. Device according to claim 11, wherein the first scale is characterized by a high time resolution of at most 5.8 ms and a low frequency resolution of at least 86 Hz; and the second scale is characterized by a high frequency resolution of at most 21.5 Hz and a low time resolution of at least 23.2 ms.

13. Device according to any of the previous claims 10 to 12, wherein the phonetic class detection module (3) is further programmed to perform whisper activity detection by extracting a plurality of signal features from the time and frequency domain analysis in order to detect whispered speech.

14. Device according to claim 13, wherein the plurality of signal features extracted from the time and frequency domain analysis of the input stream of audio samples relates to: short-term energy gradient, long-term energy gradient, spectral phase purity, spectral magnitude structure and consistency.

15. Device according to any of the previous claims 9 to 14, wherein the plurality of phonetic classes comprises: "silence", "vowels", "plosives" and "fricatives".

16. Device according to any of the previous claims 9 to 15, wherein the linguistic models built relate to at least one of or a combination of the following models: fundamental frequency (F0) contour model, spectral envelope model, harmonic phase structure model and energy time dynamics model.

17. System for real-time conversion of a whispered speech into synthetic natural voice comprising:

— a device according to claims 8 to 16;

— a microphone adapted to collect an input whispered speech signal (1); and — an output audio signal module adapted to drive at least one electroacoustic transducer configured to reproduce a natural speech (9) sound.

18. System according to claim 17, wherein the device is a smartphone.

19. System according to claim 17 or 18, wherein the electroacoustic transducer is a loudspeaker.