GB2422755A - Audio signal processing - Google Patents

Audio signal processing Download PDF

Info

Publication number
GB2422755A
GB2422755A GB0501744A GB0501744A GB2422755A GB 2422755 A GB2422755 A GB 2422755A GB 0501744 A GB0501744 A GB 0501744A GB 0501744 A GB0501744 A GB 0501744A GB 2422755 A GB2422755 A GB 2422755A
Authority
GB
United Kingdom
Prior art keywords
signal
pitch
feature
time
acoustic feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
GB0501744A
Other versions
GB0501744D0 (en
Inventor
Phillip Jeffrey Bloom
William John Ellwood
Jonathan Newland
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Synchro Arts Ltd
Original Assignee
Synchro Arts Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Synchro Arts Ltd filed Critical Synchro Arts Ltd
Priority to GB0501744A priority Critical patent/GB2422755A/en
Publication of GB0501744D0 publication Critical patent/GB0501744D0/en
Priority to PCT/GB2006/000262 priority patent/WO2006079813A1/en
Priority to DE602006018867T priority patent/DE602006018867D1/en
Priority to AT06709573T priority patent/ATE492013T1/en
Priority to PL06709573T priority patent/PL1849154T3/en
Priority to CN2006800034105A priority patent/CN101111884B/en
Priority to JP2007552713A priority patent/JP5143569B2/en
Priority to EP06709573A priority patent/EP1849154B1/en
Priority to ES06709573T priority patent/ES2356476T3/en
Publication of GB2422755A publication Critical patent/GB2422755A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/02Editing, e.g. varying the order of information signals recorded on, or reproduced from, record carriers
    • G11B27/022Electronic editing of analogue information signals, e.g. audio or video signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/36Accompaniment arrangements
    • G10H1/361Recording/reproducing of accompaniment for use with an external source, e.g. karaoke systems
    • G10H1/366Recording/reproducing of accompaniment for use with an external source, e.g. karaoke systems with means for modifying or correcting the external signal, e.g. pitch correction, reverberation, changing a singer's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/10Indexing; Addressing; Timing or synchronising; Measuring tape travel
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/10Indexing; Addressing; Timing or synchronising; Measuring tape travel
    • G11B27/11Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information not detectable on the record carrier
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/031Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal
    • G10H2210/066Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal for pitch analysis as part of wider processing for musical purposes, e.g. transcription, musical performance evaluation; Pitch recognition, e.g. in polyphonic sounds; Estimation or use of missing fundamental
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2240/00Data organisation or data communication aspects, specifically adapted for electrophonic musical tools or instruments
    • G10H2240/325Synchronizing two or more audio tracks or files according to musical features or musical timings
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2250/00Aspects of algorithms or signal processing methods without intrinsic musical character, yet specifically adapted for or used in electrophonic musical processing
    • G10H2250/025Envelope processing of music signals in, e.g. time domain, transform domain or cepstrum domain
    • G10H2250/031Spectrum envelope processing

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Electrophonic Musical Instruments (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
  • Diaphragms For Electromechanical Transducers (AREA)
  • Soundproofing, Sound Blocking, And Sound Damping (AREA)

Abstract

A digitised audio signal 310, such as an amateur's singing, and a digital guide audio signal 312 are supplied to a time alignment process 320 that produces a time-aligned new signal 330, time-aligned to the guide signal. Pitch along the time-aligned new signal 330 and along the guide signal 312 is measure in processes 340 and 345 which supply these measurement to a pitch adjustment calculator 370 which calculates a pitch correction factor C'(Fps) from these measurements and the nearest octave ratio of the signals. A pitch changing process 380 modulates the pitch of the time-aligned new signal 330 to produce a time-aligned and pitch adjusted new signal 390.

Description

Methods and Apparatus for use in Sound Modification P J. Bloom, W.J.
Ellwood, J. Newland The present invention relates to sound modification and more specifically to solving the problem in sound modification of one signal based on features in another signal where the relative timing of specified features in both signals must be established before feature modification can be applied. This invention provides methods and apparatus to automatically modify one or more signal characteristics of a second audio signal to be a function of specified features in a first audio signal and especially in the case in which corresponding features of the first and second audio signals are initially not time aligned It has applications in audio processing which requires the replacement of one audio sigral with another and/or the creation of a new audio signal to be added to others. It will also allow complex audio characteristics of a professional performance to be transferred to and thereby enhance the audio performance of a less skilled person.
One specific example application is that of automatically adjusting the pitch of a new audio signal (New Signal") to follow the pitch of another ("Guide") audio signal. A primary benefit of this invention is that the resulting pitch-modified audio signal can also be automatically synchronized to a Guide track sound recording to provide a time-aligned replacement for the Guide track. Moreover, if time-aligned, the replacement track would consequently have the same lip-sync properties as the Guide audio and can be used to accompany a corresponding moving image. An application of this invention is within a karaoke-style recording and playback system using digitized music videos as the original source. In this system, during a playback of the original audio and optional corresponding video, the user's voice is digitized and input to the apparatus (as a new recording). With this invention, a new voice signal is created that is time and pitch corrected automatically, such that when the modified voice signal is played back synchronously with the original video, the user's voice can accurately replace one of the original performer's voice in terms of both pitch and time. The playback of the karaoke performance will be even more effective if the original voice track is turned off during playback with the user's modified voice.
The benefits of this system are firstly that the features of the New audio signal do not initially have to be precisely in time with the original Guide audio signal. Secondly, in many cases, no sets of rules for adjustment of the New Signal need to be pre-defined. For example if the pitch of the New Signal is to be corrected to match the pitch of the original singer, the acceptable pitch values do not need to be defined or set. Instead, the user's voice will be adjusted to the values that are already present in the original voice recording.
Another benefit of this method is that the New Signal does not have to be restricted to resemble the Guide Signal or be generated by the same type of acoustic processes as the Guide signal. For example, monotonic speech could be time and pitch modified to follow a solo woodwind instrument or a bird chirping. As long as both signals have some time-varying features that can be deemed to be related, this method will be capable of creating an output waveform with appropriately modified properties. Furthermore, features of the New Signal and the Guide signal may be offset in frequencies from one another. For example, the pitch of one signal may be an octave or more apart from the other signal.
It should also be noted that one or both audio signals may be in the ultra sound or infra sound regions.
A further important and novel benefit is that the complex and skilled pitch variations (and, optionally other characteristics) found in the performance of a professional singer can be completely transferred to the voice of a user (e.g. amateur) singer, thereby enhancing many aspects of the user's digitized performance to the professional's level. A further application is in the field of automatic dialogue replacement or ADR in which this invention could be utilized to enhance a studio-recorded performance by modifying characteristics such as pitch, energy level and prosodic features to match or follow those of the original actor's guide signal recorded on set with the image.
In addition, the system is flexible in the range of processes that can be applied. For example, in the case of pitch adjusting, further pitch changing functions, such as time-aligned harmony generation, can be introduced as functions of the pitch adjustment function to create alternative output signals. Additionally, one measured feature in the Guide Signal can be mapped by an arbitrary function to control another entirely different feature in the New Signal.
Furthermore, in an alternative embodiment, a modified version of the New Signal can be produced that is not time-modified to align with the Guide Signal, but has still been modified according to specified features of the Guide Signal which have instead been time-mapped to be applied to the corresponding parts of the New Signal.
It is well known that it is difficult for a member of the public (and often even professionals) to speak or sing along with an audio or audio / video clip such that the new performance is a precisely synchronised repetition of the original actor or singer's words. Consequently, a recording of the new performance in such circumstances is very unlikely to have its start and detailed acoustic properties synchronized with those of the original audio track. Similarly, features such as the pitch of a new singer are not likely to be as accurate or intricately varied as those of the original singer.
There are many instances in the professional audio recording industry and in consumer computer-based games and activities such as Karaoke in which a sound recording is made of the voice and the musical pitch of the newly recorded voice would benefit from pitch adjustment, generally meaning correction, to put it in tune. In addition, as mentioned above, a recording of a normal amateur singing, even in tune will not have the skilful vocal style and capabilities of a professional singer.
Musical note-by-note pitch adjustment can be applied automatically to recorded or live singing by commercially available hardware and software devices, which generally tune incoming notes to specified fixed grids of acceptable note pitches. In such systems, each output note can be corrected automatically, but this approach can often leads to unacceptable or not pleasing results because it can remove natural and desirable "human" variations The fundamental basis for target pitch identification in the known software and hardware devices is a musical scale, which is basically a list of those specific notes' frequencies to which the device should first compare the input. Most devices come with preset musical scales for standard scales and allow customisation of these, for example to change the target pitches or ignore altering certain pitched notes.
The known software devices can be set to an automatic mode, which is also generally how the hardware devices work: the device detects the input pitch, identifies the closest scale note in a user-specified preset scale, and changes the input signal such that the output pitch matches the specified scale's note's pitch. The rate at which the output pitch is slewed and retuned to the target pitch, sometimes described as "speed", is controlled to help maintain natural pitch contours (i.e. pitch as a function of time) more accurately and naturally and allow a wider variety of "styles".
However, the recorded singing of an amateur cannot be enhanced by such known automatic adjustment techniques to achieve the complex and skilled pitch variations found in the performance of a professional singer.
Essentially, existing pitch processing devices and even those with pitch control scores cannot apply these changes automatically by automatic comparison with a pre-recorded performance. We have realised that this shortcoming arises because any control signals coming from an arbitrary source would not be likely to synchronise sufficiently with the incoming sound containing the pitch to be corrected.
There exists, therefore, the need for a method and apparatus that both establishes a timing relationship between the time-varying features of a new vocal performance and corresponding features of a guide vocal performance and uses this timing alignment path as a time map to provide the pitch adjustments correctly to the new vocal performance at precisely the right times. If done correctly, this permits all the nuances and complexity found in the guide vocal performance including vibrato, inflection curves, glides, jumps, etc. to be effectively transferred from the guide vocal performance and be imposed on the new vocal performance. Furthermore, we have realized that other features in addition to or as an alternative to pitch can be controlled, for example glottal characteristics (e.g. breathy or raspy voice), vocal tract resonances, EQ, and others, if time alignment is applied.
Embodiments of the present invention can be used in the cases where an original Guide audio signal exists whose pitch provides a time-varying target pitch to which the New Signal's pitch is to be adjusted. Because the detailed features of the New Signal are not likely to be sufficiently time-aligned with the corresponding features of the Guide Signal, before pitch corrections can be made, these embodiments include a means of first aligning relevant acoustical features in the New Signal to corresponding features in the Guide Signal. Without this critical step, any pitch or other adjustments made to the New Signal would be highly likely to be made to the wrong parts of the signal.
This invention, therefore, in one embodiment, provides a first step of creating automatically a time-aligned version of the New Signal whereby time-varying acoustic features (such as short-term spectral energy patterns measured from a digitized recording of the New Signal) are made to align with corresponding features in a digitized and recorded Guide signal before the pitch adjustments to the time-aligned New Signal are made. In an alternative embodiment, the time alignment function can inversely be utilized to map pitch contours from the Guide to the appropriate times in the New Signal. After the pitch changes are made to the New Signal, it may optionally be further edited to have the timing of the Guide signal.
There already exist systems and methods for time alignment of audio signals. A method and apparatus for automatically time-aligning one audio or speech signal to another has been described in GB2 117168 and US4591928 (Bloom et. al.) Other techniques for time alignment are described in J Holmes and W Holmes, (2001), "Speech synthesis and recognition, 2 Edition", Taylor and Francis, London. A method and apparatus for providing the means to replace original voices in digitized film or video clips with automatically lip synced user recordings has been described in Bloom and Eliwood (W02004040576). In this known system, the automatic substitution of the edited output signal is achieved by muting the original audio signal (containing the singing and background music), playing back a specially prepared audio track which omits the vocal signal being replaced, and simultaneously playing back (starting at the correct time) the user's new aligned track(s) which is (are) mixed and played in correct synchronism with the specially prepared track and, if available, an accompanying video signal.
It is hence known that such a system can be implemented as, for example, an integrated computer program which runs on a PC or games machine. The known system also provides an automatic means both for indicating to the user when to record the replacement signal and for providing visual cues to the timing of the main acoustical events such as words.
However, whilst providing elements of the processing and an environment for recording and playback of time-modified signals, neither of these known lip sync systems modifies the recorded replacement voice other than by time modification (non-linear time compression and expansion) and optional well-known standard audio processing such as equalization, reverberation and simple distortion.
This present invention, therefore, extends the opportunities and possibilities for automatic audio feature modification, in particular, for example, pitch correction and/or modification, beyond those presently available by other methods.
In further embodiments, other features of a sound signal besides pitch can be modified to follow those in a Guide Signal, once a time alignment function has been created. The additional types of time-synchronous modifiable features include the modification of sound signal features such as instantaneous loudness, equalization, speech formant or resonant patterns, reverberation and echo characteristics, and even words themselves, given a suitable mechanism for analysis and modification of the specified feature is available.
In the present invention, a video signal is not necessary, and the input audio signal may be required to only accompany or replace another audio signal.
Brief Description of the Drawings
FIG. I is a block diagram of a computer system suitable for use in implementing the present invention.
FIG. 2 is a block diagram showing additional software components that can be added to the computer in FIG. I to implement the present invention.
FIG. 3 is a block diagram of one embodiment of the present invention showing the signals and processing modules used to create an output audio signal with pitch adjustments based on an input signal with different pitch and timing characteristics.
FIG. 4 is a graph showing an example of pitch measurements as a function of time for a professional recorded Guide singer's voice and the same measurements on a recorded New Signal from an untrained user before time alignment and pitch correction.
FIG. 5 is a graph representing a Time Warping or Alignment path.
FIG. 6 is a graph showing against the left frequency axis the pitch of the Guide signal and the Aligned New Signal pitch from FIG. 4 (before pitch correction) and computed smoothed pitch Correction Factor against the right vertical axis FIG. 7 is a graph of the pitch of the Guide Signal and the Corrected New Signal pitch that was shown uncorrected in FIG. 6 FIG. 8 is a block diagram of another embodiment of the present invention showing the signals and processing modules used to create an output audio signal with any general signal feature modifications based on time-aligned features of an arbitrary input signal.
FIG. 9 is a block diagram of a further embodiment having in accordance with the present invention processing in which the features of the New Signal are modified with or without simultaneous time alignment to a Guide Signal.
FIG. 10(a) is a graphic representation of an example of the relative positions and shapes of the analysis windows used to decompose a signal s'(n) into sections.
FIG 10 (b) is a graphic representation of an example of the relative positions and shapes of the synthesis windows used to create a signal s"(n) using overlap and add synthesis.
Computer systems capable of recording sound input whilst simultaneously playing back sound and/or video signals from digitized computer video and audio files have been described in W02004040576 (Bloom and Ellwood). The components of a typical PC system and environment that can support these functions are provided in FIG. 1 and can be used with the software in FIG. 2 as the basis of providing the hardware and software environment for one embodiment of this present invention.
In FIG 1 of the accompanying drawings a conventional computer system 100 is shown which consists of a computer 110 with a CPU (Central Processing Unit) 112, RAM (Random Access Memory) 118, user interface hardware typically including a pointing device 120 such as a mouse, a keyboard 125, and a display screen 130, an internal storage device 140 such as a hard disk or further RAM, a device 160 for accessing data on fixed or removable storage media 165 such as a CD ROM or DVD ROM, and optionally a modem or network interface to provide access to the Internet 175 The pointing device 120 controls the position of a displayed screen cursor (not shown) and the selection of functions displayed on the screen 130.
The computer 110 may be any conventional home or business computer such as a PC or Apple Macintosh, or alternatively one of the latest dedicated games machines" such as a Microsoft XboxTM or Sony Playstation 2TM with the pointing device 120 then being a game controller device. Some components shown in FIG. I may be absent from a particular games machine. FIG. 2 illustrates software that may be installed in the computer 110.
A user may obtain from a CD ROM, the Internet, or other means, a digital data file 115 containing an audio and optional accompanying video clip which, for example, could be in a common format such as the avi or QuickTime movie format and which is, for example, copied and stored on the hard disk 140 or into RAM. The computer 110 has a known operating system 135 such as that provided by any of the available versions of Microsoft Windows or Mac OS, audio software and hardware in the form of a sound card 150 or equivalent hardware on the computer's mother board, containing an ADC (Analogue to Digital Converter) to which is connected a microphone 159 for recording and containing a DAC (Digital to Analogue Converter) to which is connected one or more loudspeakers 156 for playing back audio.
As illustrated in FIG. 2, such an operating system 135 generally is shipped with audio recording and editing software 180 that supports audio recording via the sound card 150 and editing functions, such as the "Sound Recorder" application program shipped with Windows .
The recording program can use sound card 150 to convert an incoming analog audio signal into digital audio data and record that data in a computer file on the hard disk drive 140.
Audio/video player software 190, such as Windows Media Player shipped with Windows , is used for playing composite digital video and audio files or just audio files through the Sound card 150, further built-in video hardware and software, the display screen 130 and the speakers 156. Composite video and audio files consist of video data and one or more parallel synchronized tracks of audio data. Alternatively, audio data may be held as separate files allocated to store multiple streams of audio data. The audio data may be voice data such as dialog or singing, instrumental music, "sound effects", or any combination of these. Blocks and 190 can also, in concert with 135 and 110, represent the software and hardware that can implement the system described in Bloom and Ellwood (W02004040576) or the signal processing systems that will be described herein.
It is an object of the present invention to provide a method and apparatus for automatically altering selected features of a second digitized audio signal to match or, alternatively, be a specified function of selected features of a first digitized audio signal.
Accordingly, in this invention, this goal is achieved in part by introducing a means for determining a time alignment function or time warping path, that will provide a optimal time mapping between the time varying features of the second audio signal corresponding with timevarying features in the first audio signal. This mapping ensures that the time-varying alterations are based on the specified features in the portion of the first (control) signal that corresponds to the appropriate portion of the second signal being modified.
Measurements of specific time-varying features used for determining the time alignment (which, it is important to note, are likely to be different features than both those being altered and those used as a control) are made every T seconds, on short portions or windows of the sampled signal waveforms, each window being of duration T', and T' may be different from T Measurements are made on a successive frame-by-frame basis, usually with the sampling windows overlapping. This is standard for "short-time" signal analysis, as described in classic references such as L.R. Rabiner and R.W. Schafer (1978) "Digital Processing of Speech Signals," Prentice Hall.
Before processing begins, a function describing the initial relationship between the altered feature parameters and the control feature parameters must be defined and input to the system. For example, the modification function might be to set the pitch of the modified signal to match that of the Guide signal. This definition of the modification function can itself be varied with time if desired. The modification function can be saved as a data array of output vs input values, or as a mathematical function or set of processing rules in the audio processing computer system.
In further steps, the specified feature to be modified in the second signal and the specified control feature in the first signal are both measured as functions of time and these measurements are stored as data.
In a next step, the time alignment function is used to map the control feature function data to the desired signal modification process, which accesses the second digitized signal and modifies it as required, to create a new third audio signal from the second audio signal with the third signal having the desired time varying features determined by the specified features of the first audio signal.
In one embodiment, the new signal is time-modified (non-linearly time compressed or expanded) using the mapping information from the time alignment function so that its time- varying features align in time with the first audio signal. This time alignment can take place before or after the desired modifications described above have taken place.
In an alternative embodiment, the time alignment process is not performed on the new or modified waveform, but the time-warping path is used to map the Guide Signal audio control parameters to the New Signal modification processor in order to affect the appropriate parts of the New Signal's waveform and keep its original timing.
A specific example of this invention can be applied in a consumer Karaoke product that lets the consumer record their voice singing a pop song to a music video in a computer based- system. When the user's recorded voice is modified and then subsequently played back, the played back modified voice is both lip-synchronized to the original singer's mouth movements and has the same pitch variation as the replaced singer's voice in the music video. Similarly, this process can be applied to speech recordings in Automatic Dialogue Replacement and lip- syncing, for example to add a special vocal characteristic achieved in a Target Voice to another actor's recorded voice.
Detailed Description of Preferred Embodiments
The system of Figs. 1 and 2 allows the audio playback of the original performer singing a song with or without an accompanying video. The user can play back the song and the system will digitize and record (store) the user's voice onto the computer's hard disk or other memory device. As there is a requirement to measure accurately features of the original singer's voice, it is preferred to have that voice signal separate from the backing music track.
This can most effectively be achieved by requesting the isolated voice from the record company or organization providing the media content. Alternatively, there are signal processing methods for extracting a voice from other audio (such as US Patent 5,960,391, Tateishi, et al., 1999) which could provide a Guide audio signal. However, these techniques often require either considerable "training data" to establish either the signal or noise characteristic, or the "noise" component to be constrained to have limited variability and, therefore, such methods may not provide suitable reliability or quality audio data from which to make precise measurements.
In the present embodiment a Guide signal is used which is a digitized recording of the singer performing a song in isolation (e.g. the solo vocal track transferred from a multi-track recording from the original recording session), preferably without added processing such as echo or reverberation. Such digitized Guide signals, g(n), can be provided to the user's system on CD or DVD/ROM 165 or via the Internet 175. Alternatively, the required features of a Guide signal (for both the time alignment and for the feature modification control) can be pre-analysed in another system to extract the required data and this data can be input to the system 100 for use as data files via 165, 175 or via other data transfer methods. Data stores and processing modules of the embodiment are shown in Fig 3 The user, operating a sound recording and playback program such as that described in W02004040576 (Bloom and Eliwood) can play the desired song with the original singer audible or not audible and sing at the same time. The user's singing is digitized and recorded into a data file in a data store 310. This digitized signal is the New Signal, s(n).
FIG. 3 is a block diagram showing the preferred embodiment for the invention being applied to correcting the pitch and timing of the user's New Signal to mimic the pitch and timing of the Guide Signal. In this case, the feature in the Guide signal being used as a control function and the feature being modified in the New Signal are the same feature, namely the pitch contour of the respective signal. A process tracking the differences between time-aligned New Signal pitch measurements and the Guide Signal pitch measurements is used to compute a pitch adjustment function to make a modified New Signal's pitch follow that of the Guide Signal.
It is assumed that the New Signal, s(n) is similar in phrasing, content and length to the Guide Signal, g(n). For an application such as Karaoke, this is a reasonable assumption, because the user is normally trying to mimic the original vocal performance in timing, pitch, and words.
The following describes the main steps, which can be performed on the digital audio data in non-real time (i.e. off line). Using additional input and output signal buffering, the process could also be performed in real time.
In alternative embodiments, features in the New Signal do not have to be measured or input to the New Signal feature adjustment calculations and can simply be modified based on measurements of a feature or features of the Guide Signal. An example of this could be the application of reverberation or EQ to the New Signal that are functions of those features in the Guide Signal.
Input Signal Description and Measurement
The New Signal and the Guide Signal are highly unlikely to be adequately time aligned without processing. References, including US4591928 (Bloom et. al), describe the differences between the energy patterns of non-time aligned but similar speech signals and the use of energy-related measurements such as filterbank outputs as input to a time alignment process.
FIG. 4 illustrates for the purposes of explanation only the time series, referred to hereinafter as a pitch contour 401, obtained by measuring the pitch of a professional female singer's Guide Signal Pg(M), as a function of measurement frame M, where M 0, 1, 2, ... N and the pitch contour 402 of a typical amateur's New Signal (male voice), Ps(M), before time alignment along the same time scale. This figure not only shows the differences in the pitch contours of both signals but also their misalignment in time A first signal, which is not aligned in time with a second signal, cannot be directly used as a control or target pitch function for the second signal.
A data point shown as zero in a pitch contour 401 or 402 indicates the corresponding pitch measurement frame contains either silence or unvoicedspeech. The non-zero measurements indicate the pitch measurement of the respective signal in that frame.
In Fig 4 the non-zero value segments (pulses) of voiced sound in the New Signal pitch contour 402 generally both lag behind the corresponding features in the Guide Signal pitch contour 401 and have different durations. Also the voiced sounds of two pitch contours are in different octaves A further point to note is that the pitch range variation in each Guide Pitch contour pulse is much wider than in the corresponding pulse in the New Signal's pitch contour. This is expected since the Guide Pitch contour 401 is taken from a professional singer. It is such details and the timing of the Guide pitch contour that this invention will impose on the user's recorded singing.
Step I - Time Alignment of New Signal In this embodiment shown in FIG. 3, the sampled New Signal waveform, s(n), read from data store 310, is first aligned in time to the Guide Signal, g(n), read from data store 312, using a technique such as that described in US 4,591,928 to create an intermediate audio signal, the Time Aligned New Signal, s'(n), which is stored, e.g. on disk 330. This ensures that the details of the energy patterns in s'(n) occur at the same relative times as those in the Guide Signal. It further ensures that any required lip-syncing will be effective and any transfer of features from the Guide Signal to the New Signal needs no further time mapping. The sampling frequency used in creating the New Signal, s(n) and the Guide Signal g(n) in this example is 44.1kHz.
The Time Alignment process described in US 4,591,928 measures spectral energy features (e.g. a filterbank output) every lOms, and generates a time alignment or "time warping" path with a sample every lOms that associates similar spectral features in the New Signal with the closest corresponding features in the Guide Signal.
FIG. 5 shows an example of a time warping path, W(k), k= 0, 1,2, .. with k sampled every lOms. Such a warping path is created within a timealignment processing module 320, and this path is used to control the editing (i.e. Time-Compression/ -Expansion) in the module 320 of s(n) in the creation of the time aligned New Signal s'(n) stored on disk 330. As described in US 4,591,928, the Time-Aligned New Signal, s'(n), is created by the module 320 by building up an edited version of s(n) in which portions of s(n) have been repeated or deleted according to W(k) and additional timing error feedback from the editing system, which is constrained to making pitch synchronous edits when there is voiced sound Step 2 - Generate Pitch Contour of New Signal The raw pitch contour, Ps'(M), of the aligned New Signal, s'(n), is created from measurements of s'(n) taken using a moving analysis Hann window in consecutive discrete pitch measurement frames, M=1,2,3, . . . To obtain accurate pitch measurements it is recommended that the length of the analysis window be 2.5 to 3.0 times the length of the lowest period being measured. Therefore, in the current embodiment, to measure pitch as low as 72Hz with a period of approximately 0.0139 s., a 1536 sample (at 44.1kHz sampling frequency) analysis window (or approximately 35 ms) is used. The analysis window of the pitch estimator module 340 is centred in each pitch measurement frame of samples. For each pitch measurement frame, an estimate is made of the pitch using one of the well-known methods for pitch estimation (e.g. auto-correlation, comb filtering etc). Detailed descriptions of these techniques can be found in references such as Wolfgang Hess (1983) "Pitch Determination of Speech Signals. Algorithms and Devices," Springer-Verlag; R.J. McAulay and T.F. Quatieri. (1990); "Pitch estimation and voicing detection based on a sinusoidal model," Proc. mt Conf. on Acoustics, Speech and Signal Processing, Albuquerque, NM, pp. 249-252; and I F. Quatieri (2002) "Discrete-Time Speech Signal Processing. Principles and Practice," Prentice Hall.
The measurements may be taken without overlap of analysis windows, but overlap of the successive windowed data of between 25 and 50% is generally recommended. In this embodiment, the measurement frame rate of M is 100Hz (i.e. lOms intervals), which provides a sufficient overlap and also coincides conveniently with the measurement of the time alignment function. In order to make the first and last few measurements correctly, in which the analysis window necessarily extends beyond the available data samples, we pad both the start and end of the signal with up to one analysis window's length of zero magnitude samples before taking those measurements.
To create a final smoothed pitch contour, P's'(M) for the time-aligned New Signal, the pitch measurements of the individual frames are smoothed at a filter module 350 using a 3 point median filter followed by an averaging filter. In addition, silence and unvoiced frames of the time- aligned New Signal s'(n) are marked in P's'(M) as having zero pitch.
Step 3 - Generate Pitch Contour of Guide Similarly a pitch contour, Pg(M) of the Guide Signal, g(n) is created, using the same methods and parameters as described in the previous section of measurement at a pitch estimator module 345 and smoothing at a filter module 355 to create P'g(M) .
Step 4 - Calculate Pitch Adjustment The next step is to calculate the pitch adjustment or correction factor for each frame of the time-aligned New Signal. This is done by a pitch adjustment module 370 and takes into account the ratio of the Guide pitch to the New Signal pitch and any desired shifts in octave.
The pitch of any unvoiced frame remains uncorrected. A low pass filter within module 370 then smoothes the correction factors.
Determine Octave There are two main options considered with regard to the adjustment of pitch: a) adjust the output pitch to be the same as the Guide Pitch or b) maintain the pitch range of the input New Signal so that the adjusted voice sounds the most natural. This latter effect is achieved in this embodiment by applying an octave adjustment. An octave adjustment module 358 computes an octave multiplier, Q, which is kept constant for the duration of the signal. In detail, the calculation of Q is performed in module 358 as follows: For each pitch analysis frame M of the time-aligned New Signal, we use the unsmoothed pitch estimates to calculate a local pitch correction, CL(M), limiting the calculation to those frames where the New Signal and its corresponding Guide frame are both voiced. That is, both frames have a valid pitch. In those frames, the local pitch correction factor, CL (M), at the Mth frame, which would make the new pitch the same as the guide pitch, is given by CL (M) = Pg(M) /Ps'(M) (1) This ratio is then mapped onto its nearest rounded octave by selecting powers of 2.
Ratio CL (M) Octave Comment 0.5. up toO. 75 0.5 New Signal is one octave higher 0.75 up to 1.5 1.0 New Signal is same octave 1.5 up to 3 2.0 New Signal is one octave lower 3.0 up to 6.0 4.0 New Signal is two octaves lower etc We enter all the mapped Octave values into a histogram and choose the Octave correction value, Q, that occurs most frequently. Note that Q is not a function of time in this case, but it can be in alternative embodiments. If desired, Q could be multiplied by another factor to achieve a desired pitch frequency offset.
We can then use Q, the Octave parameter to modify equation (1) so that we get an octave- corrected pitch correction factor, 0(M) where C(M) = P'g(M)/(Q * P's'(M)) (2) where C(M) is the correction factor at frame M of the signals.
P's'(M) and P'g(M) are the smoothed estimated pitch at frame M of the time-aligned New Signal and the Guide Signal respectively.
To generate the pitch correction signal we apply Equation (2) over all frames of the time- aligned New Signal. This method ensures that the register of the modified time-aligned New Signal most closely matches that of the original New Signal.
If no corresponding Guide pitch exists at a frame M', (i.e. either the Guide is unvoiced or the time-aligned New Signal is slightly longer than Guide signal) the last correction factor value at M'-l is reused. It would also be possible to use extrapolation to get a better estimation in this instance. Examples of correction processing are: A correction factor, C(k), for frame k, of 1.0 means no change to s'(n) at frame k; 0.5 means lower the pitch by one octave, 2.0 means raise the pitch by one octave, and so on.
Step 5 - Shift Pitch of New Signal Because of likely differences between the frame rate of the pitch correction factor and the frame rate of the pitch-shifting process, this step requires the computation of an interpolated pitch correction factor, which is input to the pitch shifting process. This is described below.
Each sample of the Pitch Correction factor, C(M), provides the correction multiplier needed for each corresponding frame M of samples of the timealigned New Signal, s'(n). In this example, the frame rate of C(M) is chosen to be the same as that used by the time alignment algorithm, which is 100 frames per second or fps. In other words C(M) will have one hundred samples for every second of s'(n) To function correctly, some pitch-shifting algorithms must have a frame rate much lower than that of the time-alignment algorithm; i.e. the sampling interval is much longer. For example, time domain pitch shifting techniques usually have a frame rate of around 25 to 30 fps if they are to work down to frequencies of 50 to 60 Hz. However their frame rate need not be constant throughout the signal, as described in the references cited previously, and the rate can be varied, say, with the fundamental pitch of the signal s'(n). In this present embodiment, however, we use a fixed frame rate.
Because the frame-rates of the pitch correction factor and the pitch shifting process are different in this embodiment, we apply linear interpolation to derive an estimate of the pitch correction needed at the centre of each analysis frame of the pitch shifting process from the samples of C(M) closest in time. This interpolated correction factor is derived as follows: In terms of the signal waveform sampling rate of s'(n) , the factor C(M) has a frame interval or length of Lc samples where Lc is given by: Lc = New Signal's sampling rate / frame rate of C(M) (3) e. Lc is the reciprocal of the frame rate of C(M) expressed in samples of s(n).
Similarly, for the pitch-shifting algorithm, we need to determine the sample number along s'(n), at the centre of each of the pitch shifter's analysis frames at which we require an estimate of the pitch correction..
Let Nc(Fps-1) be the sample number along s'(n) at the centre of the pitchshifter's analysis frame Fps - 1. The sample number at the centre of the next frame, Fps will be: Nc(Fps) = Nc(Fps-1) Ls(Fps, To(Fps-1)) (4) Where: Fps is the pitch-shifter's frame number, Fps 0, 1, 2, ... and Ls(Fps, To(Fps-1)) = New Signal's sampling rate / pitch-shifter's Frame Rate(Fps); In this general case, Ls is a function of the frame number and To(Fps-1), the pitch period duration at Fps-1, to allow for a timevarying frame rate. In this embodiment, Ls(Fps) is constant and set to 1536 samples, i.e. 34.83 ms.
For the initial conditions, Nc(-1) and Nc(0) are the sample numbers at the centre of the initial frame before the first computed frame and the first computed frame respectively. These values are dependent on the pitch-shifting algorithm. In this embodiment Nc(-1) = 0.5* To(-1) and and Nc(0) = 0.
Using Nc(Fps) and Lc we can calculate the frame numbers Fc(M) of C(M) which bound or include the sample at the centre of a specific analysis frame Fps in the pitch-shifter; i e.
Fc(Fps) = Nc(Fps) / Lc. (5) Where: I represents integer division Fc(Fps) is the frame of C(M) occurring just before or at the centre of the pitchshifter's frame Fps.
Lc is the fixed length of the correction analysis frame as defined above.
If Fc(Fps) is the frame occurring just before or at the centre of the pitch-shifter's frame then (Fc(Fps) +1) will be the next frame occurring after its centre.
We can now use linear interpolation between the pitch corrections C(Fc(Fps)) and C(Fc(Fps)+1))to get an estimate of the correction factor at the centre of the pitch-shifter's analysis frame to control to the pitch shifter.
Cs(Fps) = C(Fc(Fps)) * (1 - alpha) + alpha * C(Fc(Fps) + 1) (6) Where: alpha (Nc(Fps) - Lc*Fc(Fps)) / Lc.
and where I represents integer division other symbols as above.
The value Cs(Fps) is smoothed by simple low pass filtering to become C's(Fps) and is represented as the output of module 370 which is supplied to the pitch changer module 380.
Each frame, Fps, of the time-aligned New Signal is shifted dynamically in pitch by its smoothed correction factor at module 380 and the pitch corrected and time aligned New Signal, s"(n), is written to disk 390 for subsequent playback with the backing music and optionally the music video. This output signal, s"(n) will have both the required time-alignment and pitch correction to be played back as a replacement for g(n) or synchronously with it.
An example of the time-aligned and corrected pitch contour 701 that would be observed in s"(n) as a result of multiplying pitch values of the timealigned New Signal by the corresponding correction factor values illustrated in Fig. 6 is shown in FIG. 7. Note that most of the details of the Guide pitch contour 401 now appear in this example of a computed modified pitch contour 701.
The pitch shifting performed by the module 380 to create the pitch corrected time-aligned output signal waveform, s"(n) at store 390 can be achieved using any of the standard pitch- shifting methods such as TDHS, PS-OLA, FFT, which are described in references such as K. Lent (1989), "An efficient method for pitch shifting digitally sampled sounds," Computer Music Journal Vol. 13, No.4; N. Schnell, G. Peeters, S. Lemouton, P. Manoury, and X. Rodet (2000), "Synthesizing a choir in real-time using Pitch Synchronous Overlap Add (PSOLA)," International Computer Music Conference; J. Laroche and M. Dolson (1999), "New Phase- Vocoder Techniques for Pitch-Shifting, Harmonizing and other Exotic Effects." Proc. 1999 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics; G. Peters (1998), "Analyse-Synthese des sons musicaux par Ia methode PSOLA," Journees Infomathique Musicale, Agelonde, France; and V. Goncharoff and P. Gries (1998), "An algorithm for accurately marking pitch pulses in speech signals", International Conference Signal and Image Processing.
In this embodiment a time domain algorithm substantially as described in D. Malah (1979) "Time Domain Algorithms for Harmonic Bandwidth Reduction and Time Scaling of Speech Signals", IEEE Transactions Acoustics, Speech and Signal Processing, Volume 27, No.2, pages 121-133, April1979 is preferably used to shift the pitch of the signal s'(n).
For every frame, Fps, of s'(n) we measure its pitch period, which we define here as To(Fps).
Note that for simplicity in the following descriptions, variables based on computations that include To(Fps) will also be a variable of Fps, but for simplicity in notation, we will not explicitly include the parameter Fps in those expressions.
In this embodiment we decompose s'(n) into a sequence of windowed samples of the signals obtained by multiplying s'(n) with a sequence of analysis windows 801 translated in time as shown in Fig. 10(a) s'(u,n) = h(n) * s'(n - ta(u)) (7) where h(p) p 0, 1, 2, ... P-I, is the pitch shifter analysis window of length P samples, the length of which in time is equal to twice the measured pitch period of the frame Fps, i e. 2*To(Fps). In this embodiment h(p) is a Hann window of P samples.
ta(u) is the u-th analysis instance that is set at a pitch synchronous rate for voiced frames, such that ta(u) - ta(u-1) = To(Fps), where u = 0, 1,2... . For unvoiced frames ta(u) is set to a constant rate of I Oms. It could also be set to the last valid value of To from a voiced frame.
From the smoothed pitch correction C's(Fps) we can calculate the new output period To'(Fps) of the corrected signal. (For unvoiced signals, in frame Fps, we make To'(Fps) = To(Fps)).
To'(Fps) = To(Fps) I C's(Fps) (8) From this we can generate a stream of short-term synthesis windows ts(v) as shown as 802, synchronized to the new output period To'(Fps) such that ts(v) - ts(v-1) To'(Fps) (9) Where: ts(v) is the v-th synthesis instance in the output frame.
As depicted in Fig. 10 (a) and (b), for each ts(v) we choose the closest (in time) window ta(u) of s'(n) data and add that data to an output stream buffer.
We generate the output signal stream, s"(n), one frame at a time by the method of overlap and add as described in the above references to combine all the frame's synthesis sections, ts(v). In effect we are recombining the short-time analysis signals s'(u,n) with a pitch period of To'(Fps) rather than with a period of To(Fps).
Further embodiments will now be described here.
Pitch, which includes vibrato and inflection curves, is only one characteristic of a signal, and many other aspects, including those listed previously such as instantaneous loudness, glottal characteristics, speech formant or resonant patterns, equalization, reverberation and echo characteristics, etc., are measurable and also can be modified. Moreover, the New and Guide Signals are not necessarily restricted to having prosodic, rhythmic or acoustical similarities.
Accordingly, a further embodiment of this invention is shown in FIG. 8, in which a feature analysis operation is shown on the New Signal and the Guide at modules 840 and 850 respectively, to create fs(N) and fg(M). These are indicated in bold as feature vectors, specifying the selected features measured at frames N and M respectively and, moreover, these vectors do not have to be the same features. While fg (M) must contain at least one feature, fs(N) can, in a further embodiment, be a null vector with no feature.
A Feature Adjustment function, A(fs(N), fg(M), M), must be provided and here is input to the system as a signal from a source 865. This function defines the desired relationship between the two signals' feature vectors at frames N and M, where these may or may not be the same frame, the elapsed time, as represented by frame parameter M, and the time-varying signal modification process implemented in software and applied at module 870. This function and variations would generally be defined and input by the system programmer. Consequently these may be presented as a set of presets and/or offer user-defined variations that can be selected by the system user.
An example of using two different features in A(fs(N), fg(M), M), is having the loudness of the Guide signal control the centre frequency of a bandpass filter process on the New Signal with the condition that the New Signal contains energy within the moving bandpass filter's band.
Making A a function of M also generalizes the process to include possible time-based modifications to the function.
Another embodiment is shown in FIG. 9 in which a time-aligned New Signal waveform is not generated as a first step. Instead, in this embodiment, we use the time-alignment data, obtained as in the embodiment of Figs 3 and 8 in a module 920, to time distort in a module 960, the measured features of the Guide to the appropriate times in the New Signal. The time-aligned modifications are made by a module 970 to the New Signal. An optional time- alignment can be performed on the modified New Signal in the feature modification process module 970 or in a subsequent process module 975. The details of this approach are given below.
FIG. 5 can be alternatively viewed as showing an inverse of the previous time-alignment function, that is a mapping of a matching frame of the Guide signal at a frame jto each frame of the New Signal at frame k. If we specify Es to be a frame number of the New Signal and W(Fs) is the (inverse) time warping function (or mapping function) generated by the time alignment process module 920 then Fag(Fs) = W(Fs) (10) where Fag is the corresponding frame number of the time-aligned Guide.
From this mapping we can generate a time-aligned or warped version of the Feature Adjustment function which is used in adjustment module 960 in Fig. 9.
As an example, returning to the application in pitch correction, we can compute a warped version of the pitch correction function, based on equation (1), given by: C(Fs)= Pg(Fag(Fs))/Ps(Fs) (11) And from (11) we get C(Fs) = Pg(W(Fs))/Ps(Fs) (12) Where C(Fs) is the correction factor of frame Fs of the New Signal. Ps(Fs) is the estimated pitch of frame Es of the New Signal. W(Fs) is the corresponding frame in the Guide from the warping function. Further processing of C(Fs) as described previously, including the octave modifications (if desired) takes place in adjustment module 960 which then provides a modification function, based on equation (2), given by C(Fs) = P'g(W(Fs))/(Q * P's(Fs)) (13) This modification function is applied to s(n) at modification module 970 on a frame by frame basis to produce a modified output, s*()* The processing shown in Fig. 9 is generalized as in the description of Fig.8 to allow any features to be specified for analysis and modification.
One major difference in this embodiment, however, is that the modified output s*(n) in store 980 is not time-aligned with the Guide signal and instead has the timing of the original signal s(n).
Time alignment of s*(n) to g(n) can be achieved, in one embodiment for processes such as pitch modification, in a single process where feature modification in module 970 and time alignment in a module 975 are executed simultaneously. Descriptions of methods for implementing, for example, simultaneous pitch and time modification (which may reduce potential processing artefacts and improve computational efficiency) are found in references such as J. McAulay and T. Quatieri (1992), "Shape Invariant Time-Scale and Pitch Modification of Speech", IEEE Trans. Sig. Processing, IEEE Trans. Sig. Processing, March, Vol. 40 No. 3, pp 497-510 and 0. O'Brien and A. Monaghan (1999), "Shape Invariant Pitch Modification of Speech Using a Harmonic Model", EuroSpeech 1999, pp 1059- 1 062. It should be noted that these references do not refer to any external control for pitch or time modifications and instead assume either a constant shift or use measurements of the original signal to determine the amount of shift to apply. For example if unvoiced frames are detected in the original voice waveform, it is normal practise to switch off, or at least reduce, any time or pitch modifications applied during that frame.
In an alternative embodiment, the normal time alignment function can be applied separately to a non-linear editing process as indicated in module 975 to create s*(n) a time-aligned version of s*(n).
In further embodiments, other forms of time alignment can be applied, for example the signals can be divided into more coarsely defined units, such as individual words or phonemes. This example would be more appropriate for more slowly varying processes such as the addition of time-varying EQ or reverberation.
In further embodiments, the Guide Signal can be made up of a series of different individual signals, instead of one continuous signal.
It will be appreciated that the processing modules used in the embodiments described hereinbefore will be software modules when implemented in a system such as the system of Figs. I and 2 but may in alternative implementations be hardware modules or a mixture of hardware and software modules.

Claims (37)

1. A method for modifying at least one acoustic feature of an audio signal, the method comprising.
comparing first and second sampled audio signals so as to determine time alignment data from timing differences between the times of occurrence of time-dependent features in the second signal and the times of occurrence of time- dependent features in the first signal; measuring at selected positions along the first signal at least one acoustic feature of the first signal to produce therefrom a sequence of first signal feature measurements; processing the sequence of first signal feature measurements to produce a sequence of feature modification data; and applying the sequence of feature modification data to the second signal to modify at least one acoustic feature of selected portions of the second signal in accordance with the time alignment data.
2. A method according to claim 1, wherein the method includes the step of measuring at selected positions along the second signal the said at least one acoustic feature of the second signal to produce therefrom a sequence of second signal feature measurements, and the step of processing the sequence of first signal measurements includes comparing the first signal feature measurements with the second signal feature measurements and determining the feature modification data from such comparison.
3. A method according to claim I or 2, wherein the said step of applying the feature modification data includes the steps of using the time alignment data to produce from the second sampled signal a time-aligned second signal and applying the feature modification data to the timealigned second signal.
4 A method according to claim 2, wherein the said processing step includes the step of using the time alignment data with the first signal feature measurements to produce the feature modification data in time alignment with the second signal feature measurements.
5. A method according to any preceding claim, wherein the step of applying the feature modification data includes modulating the feature modification data in accordance with a predetermined function so as to modify the said at least one acoustic feature of the said selected portions of the second signal jointly by the feature modification data and the predetermined function.
6 A method according to any preceding claim, wherein the said at least one acoustic feature of the first signal is pitch.
7. A method according to any preceding claim, wherein the said at least one acoustic feature of the second signal is pitch.
8. A method according to any preceding claim, wherein the said timedependent features of the first and second signals are sampled spectral energy measurements.
9. A method according to claim 1, wherein the said at least one acoustic feature of the first signal is pitch and the said at least one acoustic feature of the second signal is pitch, and the said processing step includes the step of determining from values of ratio of pitch measurement of the first signal to time-aligned pitch measurement of the second signal a multiplier factor and so including the said factor in said steps of applying the feature modification data as to reduce the magnitude of pitch changes in the second signal in the modified selected signal portions.
10. A method according to claim 9, further including the step of scaling the said multiplier factor by a power of 2 so as to change pitch in the said modified selected signal portions in accordance with a selection of the said power of 2.
11. A method according to claim 2, wherein the step of measuring at selected positions along the second signal includes the steps of using the time alignment data to produce from the second sampled signal a time aligned second signal in which the times of occurrence of the said timedependent features of the second sampled signal are substantially coincident with the times of occurrence of the said time-dependent features in the first sampled signal, and measuring the at least one acoustic feature in the time aligned second signal at positions along the time aligned second signal selected to be related in timing with the said selected positions along the first sampled signal.
12. A method according to claim 11, wherein the said positions selected to be related in timing are substantially coincident in timing with the said selected positions along the first sampled signal.
13. A method according to claim 2, wherein the said at least one acoustic feature of the first sampled signal is pitch, the said at least one acoustic feature of the second sampled signal is pitch, the said step of applying the feature modification data includes the steps of using the time alignment data to produce from the second sampled signal a time aligned second signal and applying the feature modification data to the time aligned second signal to produce a pitch modified time aligned second signal.
14. A method according to claim 13, wherein the step of applying the feature modification data includes modulating the feature modification data in accordance with a predetermined function so as to modify pitch in the said selected portions of the second signal jointly by the feature modification data and the predetermined function.
15. A method according to claim 14, wherein the predetermined function is a function of the values of the ratio of pitch measurement in the first sampled signal to corresponding pitch measurement in the second sampled signal along the second sampled signal.
16. Apparatus for modifying at least one acoustic feature of an audio signal, the apparatus comprising: means for comparing first and second sampled audio signals so as to determine time alignment data from timing differences between the times of occurrence of time-dependent features in the second signal and the times of occurrence of time- dependent features in the first signal; means for measuring at selected positions along the first signal at least one acoustic feature of the first signal to produce therefrom a sequence of first signal feature measurements; means for processing the sequence of first signal feature measurements to produce a sequence of feature modification data; and means for applying the sequence of feature modification data to the second signal to modify at least one acoustic feature of selected portions of the second signal in accordance with the time alignment data.
17. Apparatus according to claim 16, further including means for measuring at selected positions along the second signal the said at least one acoustic feature of the second signal to produce therefrom a sequence of second signal feature measurements, and wherein the means for processing the sequence of first signal measurements includes means for comparing the first signal feature measurements with the second signal feature measurements and determining the feature modification data from such comparison.
18. Apparatus according to claim 16 or 17, wherein the said means for applying the feature modification data includes means for using the time alignment data to produce from the second sampled signal a time-aligned second signal and applying the feature modification data to the timealigned second signal.
19. Apparatus according to claim 17, wherein the said processing means includes means for using the time alignment data with the first signal feature measurements to produce the feature modification data in time alignment with the second signal feature measurements.
20. Apparatus according to claim 16, wherein the means for applying the feature modification data includes means for modulating the feature modification data in accordance with a predetermined function so as to modify the said at least one acoustic feature of the said selected portions of the second signal jointly by the feature modification data and the predetermined function.
21. Apparatus according to claim 16, wherein the said at least one acoustic feature of the first signal is pitch.
22. Apparatus according to claim 16, wherein the said at least one acoustic feature of the second signal is pitch.
23. Apparatus according to claim 16, wherein the said time-dependent features of the first and second signals are sampled spectral energy measurements
24. Apparatus according to claim 16, wherein the said at least one acoustic feature of the first signal is pitch and the said at least one acoustic feature of the second signal is pitch, and the said processing means includes means for determining from values of the ratio of pitch measurement of the first signal to time-aligned pitch measurement of the second signal a multiplier factor and so including the said factor in applying the feature modification data as to reduce the magnitude of pitch changes in the second signal in the modified selected signal portions.
25. Apparatus according to claim 24, further including means for scaling the said multiplier factor by a power of 2 so as to change pitch in the second modified selected signal portions in accordance with a selection of the said power of 2.
26. Apparatus according to claim 17, wherein the means for measuring at selected positions along the second signal includes means for using the time alignment data to produce from the second sampled signal a time aligned second signal in which the times of occurrence of the said timedependent features of the second sampled signal are substantially coincident with the times of occurrence of the said time-dependent features in the first sampled signal, and means for measuring the at least one acoustic feature in the time aligned second signal at positions along the time aligned second signal selected to be related in timing with the said selected positions along the first sampled signal.
27. Apparatus according to claim 26, wherein the said positions selected to be related in timing are substantially coincident in timing with the said selected positions along the first sampled signal
28. Apparatus according to claim 17, wherein the said at least one acoustic feature of the first sampled signal is pitch, the said at least one acoustic feature of the second sampled signal is pitch, the said means for applying the feature modification data includes means for using the time alignment data to produce from the second sampled signal a time aligned second signal and applying the feature modification data to the time aligned second signal to produce a pitch modified time aligned second signal.
29. Apparatus according to claim 28, wherein the means for applying the feature modification data includes means for modulating the feature modification data in accordance with a predetermined function so as to modify pitch in the said selected portions of the second signal jointly by the feature modification data and the predetermined function.
30. Apparatus according to claim 29, wherein the predetermined function is a function of the values of the ratio of pitch measurement in the first sampled signal to corresponding pitch measurement in the second sampled signal along the second sampled signal.
31. Audio signal modification apparatus comprising: a time alignment module arranged to receive a new signal and a guide audio signal and to produce therefrom a time-aligned new signal, a first pitch measurement module coupled to the time alignment module and arranged to measure pitch in the time-aligned new signal; a second pitch measurement module arranged to receive the guide audio signal and to measure pitch in the guide audio signal; a pitch adjustment calculator coupled to the first and second pitch measurement modules and arranged to calculate a pitch correction factor, and a pitch modulator coupled to the time alignment module to receive the time aligned new signal and to the pitch adjustment calculator to receive the pitch correction factor and arranged to modify pitch in the time aligned new signal in accordance with the pitch correction factor.
32. Audio signal modification apparatus comprising: a time alignment module arranged to receive a new audio signal and a guide audio signal and to produce therefrom a time aligned new signal; a first acoustic feature measurement module arranged to receive the guide audio signal and to measure at least one acoustic feature of the guide audio signal; an acoustic feature adjustment calculator coupled to the first acoustic feature measurement module and arranged to calculated an acoustic feature modification factor; and an acoustic feature modulator coupled to the time alignment module to receive the time aligned new signal and to the acoustic feature adjustment calculator to receive the acoustic feature modification factor and arranged to modify the said at least one acoustic feature of the time aligned new signal in accordance with the acoustic feature modification factor.
33. Audio signal modification apparatus according to claim 32, wherein a processing function module is coupled to the feature adjustment calculator to supply thereto a signal function, and the feature adjustment calculator is adapted to calculate the acoustic feature modification factor in dependence upon the signal function
34 Audio signal modification apparatus according to claim 32 or 33, wherein a second acoustic feature measurement module is coupled to the time alignment module and arranged to measure at least one acoustic feature of the time aligned new signal; and the acoustic feature adjustment calculator is coupled to the second acoustic feature measurement module.
35. Audio signal modification apparatus comprising: a time alignment module arranged to receive a new audio signal and a guide audio signal and to produce therefrom time alignment data; a first acoustic feature measurement module arranged to receive the guide audio signal and to measure at least one acoustic feature of the guide audio signal; an acoustic feature adjustment calculator coupled to the time alignment module and to the first acoustic feature measurement module and arranged to calculate time-aligned values of an acoustic feature modification factor; and an acoustic feature modulator coupled to receive the new audio signal and to the acoustic feature adjustment calculator to receive the time-aligned values of the acoustic feature modifications factor and arranged to modify the said at least one acoustic feature of the new audio signal in accordance with the time-aligned values of the acoustic feature modification factor so as to produce a modified new audio signal.
36. Audio signal modification apparatus according to claim 35, wherein a time aligner is coupled to the acoustic feature modulator to receive the modified new audio signal and to the time alignment module to receive the time alignment data and is arranged to produce a time aligned modified new signal in accordance with the said modified new audio signal and the time alignment data.
37. Audio signal modification apparatus according to claim 35 or 36, wherein a second acoustic feature measurement module arranged to receive the new audio signal and to measure at least one acoustic feature of the new audio signal; and the acoustic feature adjustment calculator is coupled to the second acoustic feature measurement module.
GB0501744A 2005-01-27 2005-01-27 Audio signal processing Withdrawn GB2422755A (en)

Priority Applications (9)

Application Number Priority Date Filing Date Title
GB0501744A GB2422755A (en) 2005-01-27 2005-01-27 Audio signal processing
ES06709573T ES2356476T3 (en) 2005-01-27 2006-01-26 PROCEDURE AND APPLIANCE FOR USE IN SOUND MODIFICATION.
PL06709573T PL1849154T3 (en) 2005-01-27 2006-01-26 Methods and apparatus for use in sound modification
DE602006018867T DE602006018867D1 (en) 2005-01-27 2006-01-26 METHOD AND DEVICES FOR USE IN SOUND MODIFICATION
AT06709573T ATE492013T1 (en) 2005-01-27 2006-01-26 METHOD AND APPARATUS FOR USE IN SOUND MODIFICATION
PCT/GB2006/000262 WO2006079813A1 (en) 2005-01-27 2006-01-26 Methods and apparatus for use in sound modification
CN2006800034105A CN101111884B (en) 2005-01-27 2006-01-26 Methods and apparatus for for synchronous modification of acoustic characteristics
JP2007552713A JP5143569B2 (en) 2005-01-27 2006-01-26 Method and apparatus for synchronized modification of acoustic features
EP06709573A EP1849154B1 (en) 2005-01-27 2006-01-26 Methods and apparatus for use in sound modification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB0501744A GB2422755A (en) 2005-01-27 2005-01-27 Audio signal processing

Publications (2)

Publication Number Publication Date
GB0501744D0 GB0501744D0 (en) 2005-03-02
GB2422755A true GB2422755A (en) 2006-08-02

Family

ID=34259792

Family Applications (1)

Application Number Title Priority Date Filing Date
GB0501744A Withdrawn GB2422755A (en) 2005-01-27 2005-01-27 Audio signal processing

Country Status (5)

Country Link
CN (1) CN101111884B (en)
AT (1) ATE492013T1 (en)
DE (1) DE602006018867D1 (en)
ES (1) ES2356476T3 (en)
GB (1) GB2422755A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2631910A1 (en) * 2012-02-27 2013-08-28 Sony Corporation Signal processing apparatus, signal processing method and program
US20150206540A1 (en) * 2007-12-31 2015-07-23 Adobe Systems Incorporated Pitch Shifting Frequencies
EP3389028A1 (en) * 2017-04-10 2018-10-17 Sugarmusic S.p.A. Automatic music production from voice recording.
US20220293136A1 (en) * 2019-11-04 2022-09-15 Beijing Bytedance Network Technology Co., Ltd. Method and apparatus for displaying music points, and electronic device and medium

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101533641B (en) 2009-04-20 2011-07-20 华为技术有限公司 Method for correcting channel delay parameters of multichannel signals and device
CN102307323B (en) * 2009-04-20 2013-12-18 华为技术有限公司 Method for modifying sound channel delay parameter of multi-channel signal
US9117461B2 (en) * 2010-10-06 2015-08-25 Panasonic Corporation Coding device, decoding device, coding method, and decoding method for audio signals
US9123353B2 (en) * 2012-12-21 2015-09-01 Harman International Industries, Inc. Dynamically adapted pitch correction based on audio input
CN104538011B (en) * 2014-10-30 2018-08-17 华为技术有限公司 A kind of tone adjusting method, device and terminal device
EP3549355A4 (en) * 2017-03-08 2020-05-13 Hewlett-Packard Development Company, L.P. Combined audio signal output
JP6646001B2 (en) * 2017-03-22 2020-02-14 株式会社東芝 Audio processing device, audio processing method and program

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1994022130A1 (en) * 1993-03-17 1994-09-29 Ivl Technologies Ltd. Musical entertainment system
GB2290685A (en) * 1994-06-24 1996-01-03 Roland Kk Sound effect adding system
US5750912A (en) * 1996-01-18 1998-05-12 Yamaha Corporation Formant converting apparatus modifying singing voice to emulate model voice
US5966687A (en) * 1996-12-30 1999-10-12 C-Cube Microsystems, Inc. Vocal pitch corrector
JP2003044066A (en) * 2001-07-31 2003-02-14 Daiichikosho Co Ltd Karaoke machine with pitch shifter

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1994022130A1 (en) * 1993-03-17 1994-09-29 Ivl Technologies Ltd. Musical entertainment system
GB2290685A (en) * 1994-06-24 1996-01-03 Roland Kk Sound effect adding system
US5750912A (en) * 1996-01-18 1998-05-12 Yamaha Corporation Formant converting apparatus modifying singing voice to emulate model voice
US5966687A (en) * 1996-12-30 1999-10-12 C-Cube Microsystems, Inc. Vocal pitch corrector
JP2003044066A (en) * 2001-07-31 2003-02-14 Daiichikosho Co Ltd Karaoke machine with pitch shifter

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150206540A1 (en) * 2007-12-31 2015-07-23 Adobe Systems Incorporated Pitch Shifting Frequencies
US9159325B2 (en) * 2007-12-31 2015-10-13 Adobe Systems Incorporated Pitch shifting frequencies
EP2631910A1 (en) * 2012-02-27 2013-08-28 Sony Corporation Signal processing apparatus, signal processing method and program
EP3389028A1 (en) * 2017-04-10 2018-10-17 Sugarmusic S.p.A. Automatic music production from voice recording.
WO2018189082A1 (en) 2017-04-10 2018-10-18 Sugarmusic S.P.A. Auto-generated accompaniment from singing a melody
US11087727B2 (en) 2017-04-10 2021-08-10 Sugarmusic S.P.A. Auto-generated accompaniment from singing a melody
US20220293136A1 (en) * 2019-11-04 2022-09-15 Beijing Bytedance Network Technology Co., Ltd. Method and apparatus for displaying music points, and electronic device and medium
US11587593B2 (en) * 2019-11-04 2023-02-21 Beijing Bytedance Network Technology Co., Ltd. Method and apparatus for displaying music points, and electronic device and medium

Also Published As

Publication number Publication date
CN101111884A (en) 2008-01-23
ATE492013T1 (en) 2011-01-15
DE602006018867D1 (en) 2011-01-27
CN101111884B (en) 2011-05-25
ES2356476T3 (en) 2011-04-08
GB0501744D0 (en) 2005-03-02

Similar Documents

Publication Publication Date Title
US7825321B2 (en) Methods and apparatus for use in sound modification comparing time alignment data from sampled audio signals
EP1849154B1 (en) Methods and apparatus for use in sound modification
GB2422755A (en) Audio signal processing
Corey Audio production and critical listening: Technical ear training
US9847078B2 (en) Music performance system and method thereof
US8290769B2 (en) Vocal and instrumental audio effects
JP4767691B2 (en) Tempo detection device, code name detection device, and program
JP2012037722A (en) Data generator for sound synthesis and pitch locus generator
US10885894B2 (en) Singing expression transfer system
Arzt et al. Artificial intelligence in the concertgebouw
JP5229998B2 (en) Code name detection device and code name detection program
JP2016509384A (en) Acousto-visual acquisition and sharing framework with coordinated, user-selectable audio and video effects filters
Lee et al. Toward a framework for interactive systems to conduct digital audio and video streams
WO2020162392A1 (en) Sound signal synthesis method and training method for neural network
Bozkurt A system for tuning instruments using recorded music instead of theory-based frequency presets
JP2002108382A (en) Animation method and device for performing lip sinchronization
Nakano et al. VocaRefiner: An interactive singing recording system with integration of multiple singing recordings
JP6171393B2 (en) Acoustic synthesis apparatus and acoustic synthesis method
JPH11259066A (en) Musical acoustic signal separation method, device therefor and program recording medium therefor
Simon et al. Audio analogies: Creating new music from an existing performance by concatenative synthesis
US20210350783A1 (en) Sound signal synthesis method, neural network training method, and sound synthesizer
Driedger Time-scale modification algorithms for music audio signals
JP2018155936A (en) Sound data edition method
JP2000010597A (en) Speech transforming device and method therefor
Rosenzweig Interactive Signal Processing Tools for Analyzing Multitrack Singing Voice Recordings

Legal Events

Date Code Title Description
WAP Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1)