WO1998055991A1

WO1998055991A1 - Method and apparatus for reproducing a recorded voice with alternative performance attributes and temporal properties

Info

Publication number: WO1998055991A1
Application number: PCT/GB1998/001463
Authority: WO
Inventors: Ken Lomax
Original assignee: Isis Innovation Limited
Priority date: 1997-06-02
Filing date: 1998-05-21
Publication date: 1998-12-10
Also published as: EP0986807A1; JP2002502510A; GB9711339D0

Abstract

The voice morpher creates a hybrid performance of a work, such as a song, using two separate encodings of the work. One of the encodings is reproduced with alternative performance attributes and temporal properties taken from the other of the two encodings. The voice morpher is particularly suitable for use with a karaoke machine as it enables the singer to actually sound like the original performer.

Description

METHOD AND APPARATUS FOR REPRODUCING A RECORDED VOICE WITH ALTERNATIVE PERFORMANCE ATTRIBUTES AND

TEMPORAL PROPERTIES

The present invention relates to a method and apparatus for reproducing a recorded voice with alternative performance attributes and temporal properties. The present invention is particularly suited for use as an alternative to a conventional karaoke machine. For the sake of simplicity, the present invention is referred to herein as a voice morpher. Karaoke machines have become increasingly popular both in Japan where they originated and elsewhere. A karaoke machine enables a person, hereafter referred to as the user, to sing along to a backing track of any familiar song, hereafter referred to as the current song. The user sings into a microphone and his/her voice is mixed with the accompaniment before being played out through an amplifier and speakers. The output from the speakers is thus a combination of the pre-recorded accompaniment track and the user's voice. The user has the impression therefore of being a singer in a band. Unfortunately, whilst most people sing reasonably well, there are many singers who wish to sound better or more like the original artist who initially sang the current song. The karaoke machine does not accommodate this problem as the user's own voice is reproduced exactly.

Additionally it is becoming increasingly common in certain circumstances for professional singers to mime to an existing recording when performing. This can be disconcerting to an audience when the singer fails to mime accurately. Indeed, extremely small discrepancies between the movements of a singer's mouth and the sound heard are noticeable and distracting to an audience.

With the present invention, the performance of an original artist singing the current song is analysed and encoded. When the user then sings the same song certain properties of the user's voice, hereafter referred to as vocal attributes, control the manner in which the artist's performance is reproduced in real time. By reproducing the artist's voice under guidance of the vocal attributes of the user's voice, the user is given the impression of singing the current song in the voice of the original artist. The present invention provides apparatus for reproducing a recorded voice with alternative performance attributes and temporal properties comprising template means having an encoding of an artist's performance of a work; vocal attribute means for determining the vocal attributes of a separate performance of the work; a locator for performing temporal mapping between the encoding of the artist's performance and the vocal attributes; and a synthesis device for combining data from the encoding of the artist's performance with one or more of the vocal attributes to produce a hybrid performance of the work. Ideally, the apparatus is arranged to reproduce the recorded voice with alternative performance attributes and temporal properties substantially in real time. That is to say, the vocal attribute means, the locator and the synthesis device are adapted to generate their respective outputs in a time period which is sufficiently short to be unnoticeable. Separately, the present invention provides a vocal encoder for generating an encoding of an artist's performance of a work suitable for use with the above mentioned template means, the vocal encoder comprises sampling means for dividing the artist's vocal performance of the work into a plurality of samples, each of the samples partially overlapping at least one other sample; and an analyser for separately extracting data representative of the vocal performance from each of the plurality of samples and for generating a vox encoding consisting of the extracted data identified with respect to the location of the respective sample in the work. The type of data extracted by the analyser is chosen to enable the substitution or the combining of alternative vocal characteristics so that the data can be used to generate a hybrid performance of the work.

Preferably, the artist's vocal performance is encoded to separately include data on voiced and unvoiced components, the fundamental frequency and associated harmonics of the voiced components, the spectral tilt and the amplitude. Moreover, in a preferred embodiment the template means includes the vocal encoder for encoding the artist's performance of the work. The vocal attribute means may provide data on the presence of voiced or unvoiced components, the fundamental frequency of the voiced components, the spectral tilt and the amplitude. Both the encoding of the artist's performance and the vocal attributes may additionally include separate cue data enabling temporal mapping. In a preferred embodiment the apparatus further includes accompaniment means for storing and reproducing an accompaniment to the work. The synthesis device preferably combines at least one of the following vocal attributes: fundamental frequency, spectral tilt and amplitude with data from the vox encoding of the artist's performance. The present invention provides in a further aspect a method of reproducing a recorded voice with alternative performance attributes and temporal properties comprising determining vocal attributes of a performance of a work; performing temporal mapping between an encoded artist's performance of the work and the vocal attributes; and combining data from the encoding of the artist's performance with one or more of the vocal attributes to produce a hybrid performance of the work.

A feature of this invention is the potential for temporal modification of the encoded vocal performance. This may have at least two other applications in addition to the karaoke application described below. For example, in the case of miming sung performances, the application could be used to improve synchronisation between the artist's pre-recorded voice and lip movements. Secondly the application may be used in the film industry to synchronise voice-overs more precisely with the lip movements of an actor. Although reference is made generally herein to singing, the apparatus and method may be employed for other types of vocal works such as speeches, readings and recitals.

An embodiment of the present invention will now be described by way of example with reference to the accompanying drawing, in which Figure 1 is a schematic diagram of a voice morpher in accordance with the present invention.

The voice morpher will be described with reference to its application as an improved karaoke machine. As shown in Figure 1 , the voice morpher comprises four main components; accompaniment means 14 for storing and re-synthesising musical accompaniments, template means 17 for encoding and storing an artist's vocal rendition of one or more songs, a microphone tracker 10 and a voice synthesiser 20, each of which will be described in detail.

Accompaniment means for storing and re-synthesising musical accompaniments

The accompaniment means 14 for storing and re-synthesising musical accompaniments consists of a MIDI sequencer 15 and a MIDI synthesiser 16. The MIDI sequencer 15 holds data defining accompaniments to one or more songs. When a song has been selected, i.e. the current song, by actuation of appropriate control means, the MIDI sequencer 15 transmits its sequence of MIDI commands to the MIDI synthesiser 16 under guidance of a timer 22, thus causing the accompaniment for the current song to be played through one or more speakers 24. The microphone tracker

When the accompaniment is played through the speakers 24, the user sings along to the accompaniment into a microphone 1 1 . The signal from the microphone is sampled into an input buffer 12 of length typically 500 samples at a sampling rate typically of 16 kHz and a resolution of 16 bits. The precise sampling rate, sampling resolution and length of the input buffer may vary between applications. When the input buffer 12 is full, its data, hereafter termed the input signal, is supplied to a first vocal analyser 13, preferably for immediate analysis. Meanwhile the input buffer 12 is cleared and begins once again to fill with data sampled from the microphone 1 1.

The vocal analyser 13 performs analysis of the input signal to determine the vocal attributes of the input signal. The vocal attributes may include amplitude, voicing characteristics, fundamental frequency and spectral tilt. Reference to spectral tilt is intended as reference to the variation in the intensity of harmonics for a given fundamental frequency. This vocal attribute is characteristic of the strength with which a note is sung. Other vocal attributes may additionally be analysed where necessary. Considering the vocal attributes referred to above in turn, the amplitude is determined by the maximum sample in the input signal. The voicing characteristic may be one of three alternatives; silent, voiced or unvoiced and can be determined using one of several known voicing analysis procedures. Preferably, the voicing characteristic is determined using a zero-crossing and amplitude analysis. This analysis involves the following steps: the maximum sample in the input signal is located, if the magnitude of the maximum sample does not surpass a preset silence threshold, the voicing characteristic is deemed silent; if the silence threshold is surpassed, the number of zero-crossings in the input signal is determined and if the number of zero-crossings surpasses a zero-crossing threshold the input signal is deemed unvoiced, otherwise voiced. When the voicing characteristic of the input signal is deemed to be voiced, the fundamental frequency is derived. The fundamental frequency may be determined using one of several known fundamental frequency estimating procedures, the most common of which is cepstral analysis. Cepstral analysis yields an approximation of the fundamental frequency which may be used to identify the precise location of the fundamental peak in the Fourier transform of the input signal, by means of, for example, a parabolic interpolation. This enables an accurate estimation of the fundamental frequency. The spectral tilt is defined simply as the slope of the best straight-line fit to the Fourier transform of the input signal.

Additional parameters which may be described by the vocal analyser include a description of the user's lip movements by means of a visual analyser (Not shown). Other routines may be used to determine the vocal attributes to those specified above, if appropriate. The analysis procedures described above which are used to determine the vocal attributes can operate at a high speed and can take typically 0.01 seconds to produce an output Thus, the vocal attributes output from the microphone tracker 10 which describe the user's voice can be generated substantially in real time (that is, sufficiently fast that any delay is unnoticeable).

In alternative applications of the voice morpher this speed of analysis may not be necessary. Where the analysis need not be performed in real time, the analyser 13 may perform a more detailed analysis than that described above. Also, data on additional vocal attributes may be generated.

Template means for encoding and storing an artist's vocal rendition of the current song The template means 17 comprises a second vocal analyser 18 and means for storing the resulting encoding 19. A complete vocal track of an artist singing the current song is sampled by the second vocal analyser 18 at a sampling rate typically of 44 kHz and a sampling resolution typically of 16 bits producing a waveform. The precise sampling rate and resolution may vary between applications. When the vocal track is sampled a sequence of analysis windows of duration typically 0.05 seconds and centred typically 0.01 seconds apart are applied to the waveform producing a sequence of short-time analysis waveforms. Thus, each of the analysis windows overlaps with neighbouring analysis windows.

Each short-time analysis waveform is then analysed using established procedures to determine its properties. The analysis can provide, amongst others, descriptions of the voiced component, the unvoiced component, the fundamental frequency, voicing characteristics, amplitude and lip position. The resulting data for each short-time analysis waveform is then stored as a respective analysis frame. The result is a sequence of analysis frames describing the changing voiced component, unvoiced component, voicing characteristic, amplitude and lip positions over the duration of the entire song. The sequence of analysis frames is referred to hereafter as a vox track and is stored in the memory 19. The analysis techniques employed in generating the vox track preferably includes the same techniques as those employed by the microphone tracker albeit at a much higher sampling rate. In this way the substitution or the combining of data from the vox track and the microphone tracker is simplified. However, the vox track includes additional data on the voiced and unvoiced components which is sufficient to enable the artist's performance to be reproduced. For example, for an unvoiced component the envelope function of the sample waveform may be determined by interpolating between local maxima identified along the frequency axis of the Fourier transform. Although the template means 17 is described with both the analyser

18 and the memory 19, where the voice morpher is to be used as a type of karaoke machine the template means 17 may only consist of the memory

19 in which is stored a plurality of encodings of different songs. The memory 19 may employ conventional means for data storage such as laser discs.

In a further alternative, a single analyser may be employed to function as both the analyser for the microphone tracker 10 and the template means 17. The template data may be compressed to maximise the amount of data stored. For example, variable resolution of the data analysis may be employed so that a continuous sound such as silence or a voiced component lasting up to 0.5 seconds could be recorded as a single data entry. Where compression of this nature is employed the need for accurate data on the location of the template data within the song is essential.

The voice synthesiser 20

The voice synthesiser 20 oversees the interaction between the microphone tracker 10, the accompaniment means 14, the stored encoding of the vox track 19 and a locator 21. Upon request by the user the voice synthesiser 20 resets and starts the timer 22. This causes the MIDI sequencer 15 to pass its encoded MIDI signals to the MIDI synthesiser 16 at a rate determined by the timer 22. The output of the MIDI synthesiser is sent to the speakers 24 causing the accompaniment to the current song to be played. Upon hearing the accompaniment, at the appropriate time, the user sings the current song into the microphone 11. The output from the microphone tracker 10 in the form of vocal attributes describing the user's voice which are generated at regular intervals of typically 0.02 seconds, i.e. substantially in real time, are input into the locator 21 of the voice synthesiser 20.

The locator 21 is connected to two output buffers 25, 26 via a synthesis device 23. When either output buffer 25 or 26 is empty (initially both of them), the empty buffer sends a request to the locator 21 for data. In its simplest form the locator 21 queries the timer 22 to determine the current temporal location of the accompaniment and retrieves from the memory 19 the analysis frame within the vox track which lies closest to the current time T, hereafter referred to as the current analysis frame. The locator 21 therefore applies a linear mapping between the temporal location of the accompaniment and the position of the current analysis frame in the vox track.

It is possible, however, that the user may sing words in the song at a slightly different rate to the artist. For example, a vowel may be prolonged or a consonant reached early. To cater for this effect and so improve synchronisation between the user's voice and the artist's vox track, the locator 21 may apply a non-linear mapping between the accompaniment and the vox track. This operates by comparing the vocal attributes input from the microphone tracker 10 and a plurality of neighbouring analysis frames in the vox track about time T. If, for example, the user changes from a voiced to an unvoiced sound earlier than as stored in the vox track, the rate of advance through the vox track may be accelerated. Similarly, if for example the user changes from a voiced to an unvoiced sound later than as forecast by the vox track, the rate of advance through the vox track may be deccelerated. Thus, the rate of advance along the vox track can be controlled by the user's voice. The locator 21 may also use information describing the lip position of the user and those described in the analysis frames of the vox track to improve synchronisation. The current analysis frame selected by the locator 21 is input to the synthesis device 23 which also receives the vocal attributes from the microphone tracker 10. The synthesis device 23 then generates a waveform using a combination of this data. The voiced and unvoiced components of the waveform are shaped by the voiced and unvoiced components received from the selected analysis frames of the vox track and the vocal attributes of the waveform are determined by the vocal attributes received from the microphone tracker 10. This data is interpolated appropriately to ensure a smooth modification of audio properties in the synthesised waveform using known techniques. The waveform generated by the synthesis device 23 is thus a hybrid of the encoded artist's performance and the user's vocal attributes.

The lip position specified in the current analysis frame input from the locator 21 can be used to provide a graphical illustration of the lip movements in a video display 27.

The audio synthesis routines described are established techniques based upon spectral modelling with additive synthesis, although alternative audio synthesis procedures may be used. The waveform generated by the synthesis device 23 is sent to the currently empty output buffer 25 or 26 which is then queued for output to the speakers 24. The length of the output buffers is typically 500 samples though this may vary between applications. Once the output buffer is played it will again send a request to the locator 21 which will in turn cause more data to be synthesised and sent to it. Meanwhile the other output buffer will be playing the data it has received. In this way the waveforms from the synthesis device 23 are supplied to the buffers 25, 26 alternately and are fed from the buffers 25, 26 to the speakers 24 alternately. This process repeats until the entire accompaniment has been played.

In this way the sound emerging from the speakers consists of the accompaniment and a hybrid performance of the current song which retains the vocal timbre of the artist but incorporates the vocal attributes of the user and the user's temporal progression through the song.

Instead of or in addition to the hybrid performance being supplied to speakers, the waveforms generated by the synthesis device may be stored in a memory or recorded.

In addition to the use of the voice morpher to enable a singer to sound like someone else, the voice morpher may be used with a prerecording of the singer's own voice so that in circumstances where a performer wishes to mime to their own music this can be done without the attendant difficulties with lip synch. The voice morpher also has use in the film industry. In many films speech is separately dubbed after filming. This requires the actors to carefully follow filmed lip movements whilst still ensuring all the necessary expression and vocal dynamics are produced. Using the voice morpher, a poor quality vocal recording is taken at the time of filming and a good quality vocal recording separately produced later which closely follows the film but without the demand for exact synchronisation. The later produced good quality recording is then encoded as the desired voice and the poor quality speech recorded during filming is analysed by the microphone tracker. The resultant final recording is a combination of the high quality sound of the later recording with the expression and intensity of the filmed scene. As the voice morpher automatically ensures substantially exact synchronisation, the need for exact dubbing in post production is removed. Furthermore, more detailed analysis of both vocal tracks may be performed as the demand for real- time operation is removed.

The voice morpher may also be used with other forms of vocal performance works such as speeches, readings and recitals. With works such as these where there is no clear accompaniment, alternative means are employed to identify the exact location within the work. This may be in the form of time cues, supplied to the user preferably in a way which does not interfere with the user's performance of the work.

Claims

1. Apparatus for reproducing a recorded voice with alternative performance attributes and temporal properties comprising template means having an encoding of an artist's performance of a work; vocal attribute means for determining the vocal attributes of a separate performance of the work; a locator for performing temporal mapping between the encoding of the artist's performance and the vocal attributes; and a synthesis device for combining data from the encoding of the artist's performance with one or more of the vocal attributes to produce a hybrid performance of the work.

2. Apparatus as claimed in claim 1 , further including accompaniment means for storing and reproducing an accompaniment to the work.

3. Apparatus as claimed in claim 2, wherein the locator is in communication with the accompaniment means whereby the encoding of the artist's performance is mapped to the vocal attributes with reference to the location of the accompaniment in the work.

4. Apparatus as claimed in any one of claims 1 to 3, wherein the vocal attribute means includes a microphone, an input buffer connected to the output of the microphone and a vocal analyser connected to the input buffer for identification of vocal attributes of the sound signals sequentially received into the input buffer.

5. Apparatus as claimed in claim 4, wherein the vocal analyser includes voicing classification means for classifying the voicing characteristic of each output from the buffer as being either voiced, unvoiced or silent.

6. A method of reproducing a recorded voice with alternative performance attributes and temporal properties comprising: determining vocal attributes of a performance of a work; performing temporal mapping between an encoded artist's performance of the work and the vocal attributes; and combining data from the encoding of the artist's performance with one or more of the vocal attributes to produce a hybrid performance of the work.

7. A method as claimed in claim 6, wherein the steps of determining vocal attributes of a performance, performing temporal mapping between an encoded performance and the vocal attributes and combining data from the encoded performance with one or more of the vocal attributes are performed substantially in real time.

8. A method as claimed in claim 7, wherein an accompaniment is provided to cue the performance of the work.

9. A method as claimed in claim 8, wherein the encoded performance includes cue data and the encoded performance and the vocal attributes are mapped by matching the accompaniment and the cue data.

10. A vocal encoding method for generating an encoding of an artist's performance of a work, the method comprising sampling the artist's vocal performance of the work to create a plurality of samples, with each sample partially overlapping at least one other sample; analysing each sample to extract data representative of the vocal performance from each of the plurality of samples; and generating a vox encoding consisting of the extracted data identified with respect to the location of the respective sample in the work.

1 1. A vocal encoding method as claimed in claim 10, wherein each sample is analysed to determine the maximum amplitude of the sample.

12. A vocal encoding method as claimed in either of claims 0 or 1 1 , wherein each sample is analysed to classify the voicing characteristic of the sample as being either voiced, unvoiced or silent.

13. A vocal encoding method as claimed in claim 12, wherein classification of the voicing characteristic is performed using at least one of zero- crossing and amplitude analysis.

14. A vocal encoding method as claimed in either of claims 12 or 13, wherein each sample classified as having a voiced characteristic is analysed to determine the fundamental frequency of the sample.

15. A vocal encoding method as claimed in claim 14, wherein the fundamental frequency of each sample is determined using cepstral analysis.

16. A vocal encoding method as claimed in claim 15, wherein each sample classified as having a voiced characteristic is additionally analysed to determine the spectral tilt of each of the samples by identification of the best straight-line fit to the Fourier transform of each sample.

17. A vocal encoder for generating an encoding of an artist's performance of a work, the vocal encoder comprising sampling means for dividing the artist's vocal performance of the work into a plurality of samples, each of the samples partially overlapping at least one other sample; and an analyser for separately extracting data representative of the vocal performance from each of the plurality of samples and for generating a vox encoding consisting of the extracted data identified with respect to the location of the respective sample in the work.