EP3327723A1 - Procédé pour freiner un discours dans un contenu multimédia entré - Google Patents

Procédé pour freiner un discours dans un contenu multimédia entré Download PDF

Info

Publication number
EP3327723A1
EP3327723A1 EP16306550.1A EP16306550A EP3327723A1 EP 3327723 A1 EP3327723 A1 EP 3327723A1 EP 16306550 A EP16306550 A EP 16306550A EP 3327723 A1 EP3327723 A1 EP 3327723A1
Authority
EP
European Patent Office
Prior art keywords
segment
intervowel
speech
stretching
pitch period
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP16306550.1A
Other languages
German (de)
English (en)
Inventor
Aharon Roni LEVI
Martin PETKOVIKJ
Branislav GERAZOV
Yves Joseph Michel SERRA
Ronen OFFER
Ronen MIZRAHI
Igor SIMEVSKI
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Listen Up Technologies Ltd
Original Assignee
Listen Up Technologies Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Listen Up Technologies Ltd filed Critical Listen Up Technologies Ltd
Priority to EP16306550.1A priority Critical patent/EP3327723A1/fr
Priority to PCT/IL2017/051286 priority patent/WO2018096541A1/fr
Publication of EP3327723A1 publication Critical patent/EP3327723A1/fr
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/04Time compression or expansion
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/15Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being formant information

Definitions

  • the field of this invention is that of audio signal processing.
  • the invention relates to a method for slowing down speech in media content.
  • Audio-visual media are omnipresent nowadays, and a large part of the information is provided in the form of speech, often with a high speaking rate for faster distribution.
  • TSM Time Scale Modification
  • TSM may be performed by processing the audio signal directly in the time domain, or in a transformation domain, e.g. in the Fourier domain.
  • SOLA synchronized overlap-add
  • TSM Time-Domain TSM algorithms
  • a maximum is found in every search range defined with the pitch period and speech is divided according to these maximums. TSM is done to these segments with linear cross-fading.
  • the processed sections of the speech are inserted in between not processed sections to slow down speech. The processed sections replace the not processed sections to speed up the speech.
  • the output signals are composed according to a table, which contains a pattern of what segments to process and what segments to pass through. More specifically, low-energy segments, segments with low probability of containing a human voice, high-stationarity segments, and/or segments with no detected distortion are selected with priority. Linear cross-fading is used to make the data compression/expansion. The correlation is calculated for the subband that contains the highest energy and the subband that contains the pitch frequency.
  • the patent document US7412379 proposes a dual approach in which unvoiced frames are expanded using a parametric technique and voiced frames are expanded using a waveform based technique, such as SOLA.
  • the unvoiced frames are expanded by inserting noise colored using linear predictive coefficients extracted from the speech signal.
  • the present invention provides according to a first aspect a method for slowing down speech in an input media content received by an equipment comprising a processing unit, the input media content comprising an input audio signal constituted by a sequence of audio frames, the method being characterized in that it comprises performing by the processing unit steps of:
  • the invention provides an equipment comprising a processing unit configured to perform:
  • the invention proposes a computer program product, comprising code instructions for executing a method according to the first aspect for slowing down speech in an input media content; and a computer-readable medium, on which is stored a computer program product comprising code instructions for executing a method according to the first aspect for slowing down speech in an input media content.
  • the present method aims to slow down speech in an input media content comprising a received input audio signal, i.e. generating an output media content comprising as output signal the input signal which has been processed so to have a speed that is more comfortable for the listener.
  • the output audio signal is slowed down but the intonation and the spectrum are kept as similar as possible to the input audio signal.
  • the audio signal may undergo further treatments so as to change for example the pitch.
  • the input media content may only comprise an audio signal, or comprise both an audio and a video signal (visual).
  • the input media content is a TV stream.
  • the content format may be MPEG, H.264 or other formats, and may also comprise data that is not compressed.
  • the present method is performed by an equipment 1, which may be either an equipment able to directly play the output media content (for example a television, a computer, a smartphone, a tablet, etc.) or an equipment that outputs the output media content to another one which receives it and plays it (for example a set-top box, a server, etc.).
  • an equipment 1 which may be either an equipment able to directly play the output media content (for example a television, a computer, a smartphone, a tablet, etc.) or an equipment that outputs the output media content to another one which receives it and plays it (for example a set-top box, a server, etc.).
  • the equipment 1 is connected to a display 2, and the input media content is streamed from a server 3 connected to the equipment 1 through a network.
  • the equipment 1 comprises a processing unit 11, such as a processor, and a memory unit 12 for the buffers, such as RAM.
  • the equipment 1 further comprises a user interface 13 for controlling it.
  • the input media content comprise an audio input signal and advantageously a video audio signal.
  • the input audio signal is constituted of a sequence of audio frames
  • the input video signal is constituted of a sequence of video frames (image).
  • Each video frame has an accompanying audio frame.
  • a decoder decodes the data from the input data stream and outputs data packets each containing one video frame and the corresponding frame of audio samples.
  • the number of samples in each audio frame depends on the frame rate of the video signal Fr and the sampling rate of the audio Fs, and is equal to their ratio Fs / Fr.
  • a data packet is output from the decoder every 1/Fr seconds.
  • the decoder does not necessarily decompress the video data, if frame duplication is possible in the coded video stream.
  • the Input Buffers block comprises two ring buffers that store the frames of video and audio data. On demand of the processing unit 11, the Input Buffers forward these frames to it.
  • the processing unit 11 in turn, advantageously keeps track of how many frames are in the Input Buffers in order to prevent overflow and data loss.
  • the processing unit 11 performs the present method as to modify the audio signal by stretching the parts comprising speech, so that the speech rate is reduced to a target speech rate.
  • the processing unit 11 may produce several output streams that correspond to a set of such target speech rates. For each stream of processed audio, this block advantageously also generates a stream of video that is accordingly modified to maintain synchronization.
  • the audio/video data streams are output in frames to the Output Buffers, and the processing unit 11 makes sure that the output buffers always contain data.
  • the Output Buffers store the data frames from the different processed streams and output one to the Coder every 1 / Fr seconds, based on the settings made in the User Control interface 13.
  • the interface enables the user to select a desired output speech rate, in syllables per second (syll/s), and optionally an output lowering of the pitch, as well as to rewind the video, and fast forward through it.
  • the user can fast forward the processed stream only if there is a time lag between it and the original input stream, introduced by the stretching process.
  • the Encoder on the output end encodes the processed data back into the original data format, applying the appropriate re-compression to the audio signal and the video signal if necessary.
  • FIG. 3 The functional schematic of a preferred embodiment of the processing unit 11 is shown in figure 3 .
  • the present method starts with a step (a) of classifying the audio frames as speech, non-speech, or pause, so as to divide said audio signal of the media content into speech segments bounded by non-speech segments.
  • a “silence” is a frame without sound (a silent frame), a “speech” frame is one whose audio is speech, and a “non-speech” frame applies to anything which is not silence or speech, such as music or noise.
  • a silent frame among other silence frames is referred to as a "pause" frame.
  • each audio data frame read from the input buffers is preferably stored in an Auxiliary Circular Buffer (implemented within the processing unit 11).
  • This buffer stores a set of audio frames from which the central one is the current frame to be processed (as it will be seen, the other neighbouring audio frames to the current one are preferably needed for Speech/Non-speech/Pause classification).
  • the entire content of the Auxiliary Circular Buffer is normalized by the Peak Normalization block. This block amplifies the audio signal using an adaptive gain in the range of 0 -12 dB, that updates with each audio frame using a gain step factor seeking to amplify the signal up to the maximum 0 dBFS.
  • step (a) thus advantageously comprises for each audio frame (from the Auxiliary Circular Buffer):
  • the determination of an audio frame being silent can be made based on the following features:
  • the speech/non-speech classifier preferably uses a neural network comprising at least three layers: one input layer, one hidden layer and one output layer.
  • the input layer of this network is fed features extracted from the audio signal in the current and neighboring frames, and the output layer generates a probability that the content of the input audio signal is speech.
  • the number of neurons in the input layer equals the number of features used.
  • the number of hidden units is a critical parameter, as more neurons increase the network's complexity and thus its performance, but also decrease its power to generalize on unseen data. There is only one neuron in the output layer.
  • a training database of speech and non-speech recordings is used. Training is stopped when the network's ability to generalize degrades for 6 epochs in a series, as determined using a cross-validation set. The trained neural network's performance is then assessed using a test set.
  • step (a) further comprises:
  • One simple approach to smoothing is to make a decision for the current frame based on an average of the decisions made for neighboring frames. Another is to take into account the decisions for the neighboring frames using weights that are a function of their distance from the current frame.
  • the schematic includes a data gate and a demultiplexer block that are introduced to control the flow of audio/video frames according to the outputs of the decision blocks.
  • the current frame is preferably evaluated as proper or not, so as to control the gate.
  • step (a) further comprises:
  • the Speech or Pause Segment Buffer can contain either speech or pause, as its name implies.
  • the process Speech or Pause Segment block processes these frames differently according to their type, as it will be now explained.
  • the method comprises, for each speech segment:
  • the method also comprises, for each pause segment:
  • intervowel segment is a fragment of a speech segment between two successive vowels. In the case of a pause segment, the whole segment is treated as a single intervowel segment.
  • Speech or Pause Segment Buffer contains speech it is first analyzed by the Calculate Syllable Rate block as to perform step (b).1.
  • the Calculate syllable rate block estimates the average intervowel segment length, which corresponds to the average syllable rate of the speech segment, by first locating the positions of the vowels in it and then calculating the intervowel distances between them.
  • step (b).1 comprises:
  • a vowel is a sound in spoken language, pronounced with an open vocal tract, so that the tongue does not touch the lips, teeth, or roof of the mouth. This contrasts with consonants, which have a constriction or closure at some point along the vocal tract. Furthermore, a vowel carries the peak energy in a syllable.
  • Vowels have therefore acoustical characteristics that separate them from consonants, including pronounced periodicity and a concentration of energy in the lower frequency bands.
  • a neural network uses a set of Mel Frequency Cepstral Coefficients (MFCCs) extracted from short segments (frames) of the signal as input, and outputs the probability of that frame being a vowel.
  • MFCCs Mel Frequency Cepstral Coefficients
  • peaks and dips are identified in the vector of the output vowel probabilities.
  • the identified peaks are declared to be vowel candidates.
  • An example plot of the vowel probability function and the identified peaks and dips is shown in figure 6 .
  • Peaks declared to represent vowels are presented with vertical lines in figure 6 .
  • any other machine learning algorithm can be employed to the aim of finding the vowel positions.
  • other features of the signal can be used to the same end, whether on their own or in combination with each other. Such features include but are not limited to the amplitude envelope, the Low Frequency Modulated Energy (LMFE) and the high-to-low frequency energy ratio (ER).
  • LMFE Low Frequency Modulated Energy
  • ER high-to-low frequency energy ratio
  • T avg as the average of all of the intervowel distances T iv .
  • Another approach is to use the Kernel Density Estimation (KDE) algorithm to estimate T avg using Gaussian kernels fitted to the histogram of the intervowel distances T iv . The largest peak is then chosen to be the average intervowel distance T avg .
  • KDE Kernel Density Estimation
  • the stretching curve will be used to process the audio data in order to obtain an audio stream corresponding to the target speech rate.
  • each output stream may be processed with a resampling based pitch shift to generate a plurality of streams with different levels of pitch lowering, in particular 3.
  • the Calculate stretching curve block uses the average intervowel distance T avg to calculate the stretching transfer function of which an example is shown in figure 7 .
  • This function defines the mapping between the intervowel distances in the input speech segment T in and the output intervowel distance targets T out that need to be obtained using stretching in order to reach the set target intervowel distance T target .
  • the nonlinearity of the stretching curve assures larger stretching of smaller intervowel distances, which are harder to perceptually process, and less stretching for longer ones, which do not impede comprehension. Additionally, this nonlinearity deals effectively with bursts of increased speech rate that might be embedded within the speech segment.
  • Said non-linear stretching transfer function in the preferred implementation is determined as a logarithm function mapping the average intervowel distance T avg to the target intervowel distance T target .
  • An additional possible enhancement of the stretching curve is its adaptation to current speech dynamics through temporal group analysis of consecutive detected intervowel distances. This helps further targeting bursts of fast speech embedded in periods of slow speech. It also potentially discovers insertion errors made by the vowel detection algorithm.
  • the speech segment is split in intervowel segments based on the determined vowel locations, and it is processed one intervowel segment at a time in step (b).3.
  • FIG. 8a The structure of a first embodiment of a Process Intervowel Segment block performing said step (b).3 is shown in Fig. 8a .
  • a target duration of the segment is calculated by the stretching curve, from which the number of audio samples to be generated for the whole segment is calculated.
  • Frames of the audio segment are stored for processing in a Work Buffer by the Create Work Buffer block.
  • a target number of samples to be generated for the current audio frame in the Work Buffer is calculated by the Update Number of Frame Samples to Generate block.
  • This block updates the target number of samples to generate for the current frame, with the number of samples that were targeted but not achieved in the stretching of the previous frame. This number is input into the Stretch Audio block together with a calculated pitch period (see below).
  • said stretching process advantageously comprises for a frame:
  • the pitch period is determined by performing an iterative procedure as illustrated by the example of figures 9a-9b .
  • an autocorrelation of the speech signal in the frame of audio is calculated.
  • figure 9a shows the first iteration and figure 9b the second iteration of peak detection.
  • the pitch period is then calculated as the distance between the central peak of the autocorrelation and the lag of said selected peak.
  • the result is discarded and the previously determined pitch period is used.
  • the bounds may be set to 3 and 20 ms, which correspond to a pitch range between 50 and 333 Hz.
  • the Stretch Audio block stretches the current frame.
  • the Stretch Audio block's internal structure is shown in figure 10 .
  • the audio frame stored in the Work Buffer is preferably analyzed pitch period by pitch period, i.e. one pitch period long portions of the frame are considered.
  • the beginning pitch period portion is transferred to the First Pitch Period Buffer.
  • the next one is appended to it, and both are transferred to the Two Pitch Periods Buffer.
  • An example content of the Work Buffer and the Two Pitch Periods Buffer is shown in figure 11 a.
  • This block preferably stretches the audio frame in the Work Buffer in a two iteration process that assures maximum stretching quality.
  • the audio frames containing silence are stretched. Stretching silence periods gives almost no noticeable processing artifacts.
  • the second iteration frames with pronounced periodicity are stretched, as they are favorable for processing with the Pitch Synchronized Overlap Add (PSOLA) based algorithm.
  • PSOLA Pitch Synchronized Overlap Add
  • the two pitch period portion is identified as silence or speech. Then its autocorrelation is calculated, also shown in figure 11 a. If it is speech, the lag of the maximum autocorrelation peak above the minimum pitch period lag is determined. If it is silence, then the maximum autocorrelation peak above the current pitch lag is determined. This information is used to evaluate the suitability of the data for stretching. The data will be advantageously stretched only if the target number of samples to be generated has not been reached and either of the following two conditions are satisfied:
  • the Linear Cross Fade block stretches as explained the audio signal by generating a linear cross-fade between the two pitch period portion and its copy shifted by the calculated overlap, as illustrated in figure 11 b. This process results in a segment of three pitch periods instead of two.
  • the first two pitch periods are split from the third one and output to the Processed Work Buffer.
  • the third leftover pitch period is transferred back to the First Pitch Period Buffer where a new pitch period will be appended to it.
  • the two pitch periods are also split, the first one is output through the generate output block, and the second one is again returned to the First Pitch Period Buffer. If not all pitch periods from the Work Buffer have been processed, the algorithm repeats.
  • the Stretch Audio block is run an additional time by the Process Intervowel Segment block, which is the case when the target number of additional audio samples to be generated was not reached, the Stretch Audio block's internal structure simplifies to the one shown in figure 12 . All three pitch periods are extracted for output. This assures that successions of artificially generated pitch periods are kept at minimum length. Namely, if the processed intervowel segment is input to the Stretch Audio block in the Process Intervowel Segment block a second time, then a maximum of two generated consecutive pitch periods can occur in its output. If it is passed an additional third time, which is rarely needed, then this maximum grows to three consecutive pitch periods.
  • the processed output i.e. the stretched audio frame
  • the Generate Output Audio/Video Frame block This block accumulates the processed audio data, uses it to construct audio frames and combines these audio frames with the corresponding video frame received from the Create Work Buffer block. If there is more than one audio frame of accumulated data, multiple audio frames are created and combined with copies of the same video frame. In this way synchronization between the audio and video streams is maintained.
  • the Update Intervowel Number of Samples to Generate block calculates the new target number of samples to generate. If the contents of the Processed Intervowel Segment Buffer are the same with that of the input Intervowel Segment Buffer at the end of the processing, and additional samples need to be generated, the stretching criteria for pronounced periodicity evaluated in the Stretch Audio block are relaxed.
  • FIG. 8b The structure of a Process Intervowel Segment block performing a second embodiment of said step (b).3 is shown in Fig. 8b .
  • step (b).3 comprises for an intervowel segment:
  • the current intervowel segment is also stretched in an iterative manner to the target duration as calculated by the stretching curve.
  • the segment is evaluated for stretching, i.e. divided into "elementary segments" (sub-parts of the intervowel segment) such that all of the elementary segments that are favorable for stretching are identified.
  • This can be done using the amplitude and estimated periodicity of the speech signal in each frame of the intervowel segment. Frames are extracted using a sliding window.
  • the elementary segments are specifically classified between those which:
  • the portions favorable for stretching are found by thresholding the amplitude and estimated periodicity for each frame.
  • two thresholds are used for the normalized amplitude of the autocorrelation peak located in the pitch region, in a way in which a high amplitude pitch related peak is used to identify periodic segments favorable for stretching, and a low amplitude peak is used to detect aperiodic segments favorable for stretching.
  • this block also calculates the pitch period for each frame of the signal, in particular by using the method described for the first embodiment (see figures 9a-9b ).
  • the algorithm goes through the contents of the intervowel segment buffer that was segmented into stretchable and not stretchable parts. It loads these elementary segments one by one in the Extracted Segment Buffer and processes them accordingly. If the elementary segment has pronounced periodicity (periodic segment) it is forwarded to the process periodic segment block, if it has pronounced aperiodicity (aperiodic segment, including silences as explained), then it is forwarded to the process aperiodic segment block. Finally, if the elementary segment is not favorable for stretching (non-stretchable segment) it is forwarded directly to the Processed Intervowel Segment Buffer, where the data from the process periodic segment and process aperiodic segment is also output.
  • the Process Periodic Segment block's internal structure is shown in figure 13a .
  • the periodic segment processing is quite similar to the audio stretching of the first embodiment.
  • said periodic segment stretching process advantageously comprises for a periodic segment:
  • the first pitch period from the periodic audio segment stored in the Extracted Segment Buffer is output to the Processed Intervowel Buffer, based on the pitch found for that elementary segment.
  • the target number of samples to generate for this pitch period is calculated based on the target number of samples to be generated per input sample.
  • the last pitch period from the Processed Intervowel Segment Buffer is taken and the next pitch period from the Extracted Segment Buffer is appended to it. Both pitch periods are processed by the Generate New Pitch Period Block.
  • the data will be advantageously stretched only if the target number of samples to be generated has not been reached, and if the lag and amplitude of the autocorrelation peak are within the set range, as illustrated with the red rectangle in figure 11 a. This assures pronounced periodicity in the speech signal.
  • the Generate New Pitch Period Block stretches the audio signal by generating a linear cross-fade between the two pitch period portion and its copy shifted by the calculated overlap, as illustrated in figures 11 b. This process results in a segment of three pitch periods instead of two.
  • the newly generated pitch period is forwarded to the Processed Intervowel Segment Buffer where it is concatenated to the first pitch period, which was forwarded there previously.
  • the two pitch period portion is not stretched, nothing is forwarded to the Processed Intervowel Segment Buffer.
  • the target number of samples to generate is updated with the samples generated, i.e. if a new pitch period was generated then its length is subtracted from this target.
  • a suitable amount of input samples are extracted and forwarded directly to the Processed Intervowel Segment Buffer.
  • suitable it is meant in this case the number of input samples that would require a number of target samples to be generated equal to the ones already generated in excess.
  • the periodic segment processing comprises:
  • the number of samples accumulated in the Processed Intervowel Segment Buffer is evaluated, and if it has reached a length that corresponds to the video frame rate then a video frame is taken from the Extracted Segment Buffer corresponding to the second pitch period from the two used in the Generate New Pitch Period Block, and it is added to audio stream in the Processed Intervowel Segment Buffer.
  • the next pitch period is extracted from the Extracted Segment Buffer to the Processed Intervowel Segment Buffer.
  • the "suitable" number is preferably obtained when the pitch period is reduced by the number of samples that still have to be generated. This allows the algorithm to effectively catch up on the target number of samples to generate the next time it generates a new pitch period.
  • the target number of samples to generate is then updated for the added input samples to the Processed Intervowel Segment Block. If the end of the Extracted Segment Buffer has been reached the Process Periodic Segment algorithm terminates.
  • Two examples cover the cases when there is an excess of generated samples, and when there is an insufficiency of generated samples, in order to clarify the algorithm.
  • step (2°) inserts a shift in the pitch period portions which adds a randomization factor to the stretching that effectively increases the output quality.
  • Example 2 Insufficient stretching, figure 14b .
  • This time assuming 1.25 new samples have to be generated per input sample and the pitch period is again 200 samples, for the first pitch period in the Processed Intervowel Segment Buffer 200 * 1.25 350 new samples have to be generated (1°).
  • the processing means 11 shortens it by 150 samples, i.e. extracts only the first 50 samples from it (2°).
  • the target number of samples to generate will now equal 150 + 50 * 1.25 ⁇ 212.
  • a whole pitch period from the Processed Intervowel Segment Buffer is now taken, which will comprise the 50 samples extracted and forwarded from the Extracted Segment Buffer preceded by 150 samples from the pitch period previously generated.
  • the next pitch period from the Extracted Segment Buffer is then taken and a new pitch period is generated and forwarded to the Processed Intervowel Segment Buffer.
  • the Process Aperiodic Segment Block works in a similar fashion to the Process Periodic Segment Block, in that instead of working with pitch periods it works with intervals of the speech signal extending between a predetermined number N+1 (for example chosen between ten and fifty, advantageously around thirty) of zerocrossings with a positive slope, referred to as 'N-interzerocrossings interval'. More precisely, each said N-interzerocrossings interval is to be understood as the union of N consecutive sub-intervals each extending between two consecutive zerocrossings, i.e. the union of N '1-interzerocrossings intervals', the latter being referred to simply as an 'interzerocrossing' interval.
  • said aperiodic segment stretching process advantageously comprises for an aperiodic segment:
  • the speech signal in the Extracted Segment Buffer is analyzed so that all of the zerocrossings with a positive slope are located in it.
  • the algorithm is then initialized by copying the first N-interzerocrossings interval, and any preceding samples, into the Processed Intervowel Segment Buffer.
  • the target number of samples is calculated based on the length of this interval and the target number of samples to generate per input sample.
  • the next N-interzerocrossings interval is then extracted from the Extracted Segment Buffer and forwarded to the Generate new M samples Block, which checks whether the interval is truly aperiodic, by checking its length, in particular by verifying that the length of the signal is below a threshold, which is preferably an upper threshold.
  • a threshold which is preferably an upper threshold.
  • a lower threshold is advantageously also set in order to: 1) guarantee high-quality in the aperiodic stretching process, which gives worse results with shorter segments, and 2) in the case when the aperiodic segment for stretching represents a short plosive segment this stops it from being processed.
  • said next interval taken from the Extracted Segment Buffer is appended to the last interval forwarded to the Processed Intervowel Segment Buffer, so that the Generate new M samples Block, checks whether the whole resulting interval is truly aperiodic (in other words, as explained the checking of the aperiodicity may be for two N-interzerocrossings segments).
  • the target number of samples to generate is updated in a similar fashion to previously in the Process Periodic Segment Buffer. Again, if there is an excess of samples generated, a suitable amount of data is extracted from the Extracted Segment Buffer to the Processed Intervowel Segment Buffer. The difference here is that only whole interzerocrossings intervals are copied in this process. This means that the exact number of samples will almost never be copied, so as few interzerocrossings intervals as needed (i.e. a k-interzerocrossings interval with k ⁇ N as small as possible) are copied to get a positive target number of samples for generation.
  • the aperiodic segment processing preferably comprises, for an interval extending between N consecutive zerocrossings with positive slope assessed as truly aperiodic:
  • the number of added samples in the Processed Intervowel Segment Buffer is evaluated for adding a video frame from the Extracted Segment Buffer. If it is sufficient, the video frame that corresponds to the last interzerocrossing interval is added.
  • the number of generated samples is evaluated in terms of the target number of samples for that segment.
  • the number of generated intervowel samples is checked. If it has been reached, then the contents of the Processed Intervowel Segment Buffer are forwarded to the Write to Output Buffers Block. If not, then the whole process is repeated with the contents of the Processed Intervowel Segment Buffer forwarded to replace the contents of the Intervowel Segment Buffer. The number of samples to be generated at this iteration is updated accordingly. If the target number of samples has not been reached in two iterations of stretching, than the selection criteria used in the recognition of segments favorable for stretching, and the criteria used to check for periodicity and aperiodicity in the stretching block are relaxed. This produces more and larger segments for stretching, albeit reducing the quality of the stretching.
  • a final step (c) the processing unit 11 generates an output media content comprising as output signal the input audio signal wherein for each intervowel segment of each speech segment the corresponding audio frames have been replaced by the updated audio frames.
  • step (c) also comprises generating an output video signal synchronized with the output audio signal by duplicating video frames when needed, as explained.
  • the Read from Output Buffers block controls what is sent to the encoder that generates the output data stream.
  • the internal structure of the Read from Output Buffers block is shown in figure 15 .
  • This block advantageously further gives the user three functionalities, which can be accessed through the user interface 13 (the User Control block):

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing For Digital Recording And Reproducing (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
EP16306550.1A 2016-11-24 2016-11-24 Procédé pour freiner un discours dans un contenu multimédia entré Withdrawn EP3327723A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP16306550.1A EP3327723A1 (fr) 2016-11-24 2016-11-24 Procédé pour freiner un discours dans un contenu multimédia entré
PCT/IL2017/051286 WO2018096541A1 (fr) 2016-11-24 2017-11-26 Procédé et système de ralentissement de la parole dans un contenu multimédia d'entrée

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
EP16306550.1A EP3327723A1 (fr) 2016-11-24 2016-11-24 Procédé pour freiner un discours dans un contenu multimédia entré

Publications (1)

Publication Number Publication Date
EP3327723A1 true EP3327723A1 (fr) 2018-05-30

Family

ID=57485426

Family Applications (1)

Application Number Title Priority Date Filing Date
EP16306550.1A Withdrawn EP3327723A1 (fr) 2016-11-24 2016-11-24 Procédé pour freiner un discours dans un contenu multimédia entré

Country Status (2)

Country Link
EP (1) EP3327723A1 (fr)
WO (1) WO2018096541A1 (fr)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1997046999A1 (fr) * 1996-06-05 1997-12-11 Interval Research Corporation Modification non uniforme de l'echelle du temps de signaux audio enregistres
US6484137B1 (en) 1997-10-31 2002-11-19 Matsushita Electric Industrial Co., Ltd. Audio reproducing apparatus
US20040267524A1 (en) * 2003-06-27 2004-12-30 Motorola, Inc. Psychoacoustic method and system to impose a preferred talking rate through auditory feedback rate adjustment
US7065485B1 (en) * 2002-01-09 2006-06-20 At&T Corp Enhancing speech intelligibility using variable-rate time-scale modification
US7412379B2 (en) 2001-04-05 2008-08-12 Koninklijke Philips Electronics N.V. Time-scale modification of signals
US7853447B2 (en) 2006-12-08 2010-12-14 Micro-Star Int'l Co., Ltd. Method for varying speech speed
US20110004468A1 (en) * 2009-01-29 2011-01-06 Kazue Fusakawa Hearing aid and hearing-aid processing method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7337108B2 (en) * 2003-09-10 2008-02-26 Microsoft Corporation System and method for providing high-quality stretching and compression of a digital audio signal

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1997046999A1 (fr) * 1996-06-05 1997-12-11 Interval Research Corporation Modification non uniforme de l'echelle du temps de signaux audio enregistres
US6484137B1 (en) 1997-10-31 2002-11-19 Matsushita Electric Industrial Co., Ltd. Audio reproducing apparatus
US7412379B2 (en) 2001-04-05 2008-08-12 Koninklijke Philips Electronics N.V. Time-scale modification of signals
US7065485B1 (en) * 2002-01-09 2006-06-20 At&T Corp Enhancing speech intelligibility using variable-rate time-scale modification
US20040267524A1 (en) * 2003-06-27 2004-12-30 Motorola, Inc. Psychoacoustic method and system to impose a preferred talking rate through auditory feedback rate adjustment
US7853447B2 (en) 2006-12-08 2010-12-14 Micro-Star Int'l Co., Ltd. Method for varying speech speed
US20110004468A1 (en) * 2009-01-29 2011-01-06 Kazue Fusakawa Hearing aid and hearing-aid processing method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
GHAHREMANI; PEGAH; BAGHER BABA ALI; DANIEL POVEY; KORBINIAN RIEDHAMMER; JAN TRMAL; SANJEEV KHUDANPUR: "IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP", 2014, IEEE, article "A pitch extraction algorithm tuned for automatic speech recognition", pages: 2494 - 2498

Also Published As

Publication number Publication date
WO2018096541A1 (fr) 2018-05-31

Similar Documents

Publication Publication Date Title
US20220180879A1 (en) Automatic conversion of speech into song, rap or other audible expression having target meter or rhythm
US5828994A (en) Non-uniform time scale modification of recorded audio
EP2388780A1 (fr) Appareil et procédé pour étendre ou compresser des sections temporelles d'un signal audio
EP3190702B1 (fr) Organe de commande de niveleur de volume et procédé de commande
CN104079247B (zh) 均衡器控制器和控制方法以及音频再现设备
EP2979267B1 (fr) Appareils et procédés de classification et de traitement d'élément audio
CN111052232A (zh) 使用视觉信息增强视频中人类说话者的语音信号的方法和系统
CN104081453A (zh) 用于声学变换的系统和方法
Grofit et al. Time-scale modification of audio signals using enhanced WSOLA with management of transients
CN109616131B (zh) 一种数字实时语音变音方法
US20140019125A1 (en) Low band bandwidth extended
JP2015068897A (ja) 発話の評価方法及び装置、発話を評価するためのコンピュータプログラム
Obin et al. On the generalization of Shannon entropy for speech recognition
KR101674597B1 (ko) 음성 인식 시스템 및 방법
EP3327723A1 (fr) Procédé pour freiner un discours dans un contenu multimédia entré
OʼShaughnessy Formant estimation and tracking
WO2004077381A1 (fr) Systeme de reproduction vocale
JP2002169579A (ja) オーディオ信号への付加データ埋め込み装置及びオーディオ信号からの付加データ再生装置
KR101095867B1 (ko) 음성합성장치 및 방법
US11302300B2 (en) Method and apparatus for forced duration in neural speech synthesis
KR100384898B1 (ko) 발화속도 조절기능을 이용한 음성/영상의 동기화 방법
Jo et al. High-Quality and Low-Complexity Real-Time Voice Changing with Seamless Switching for Digital Imaging Devices
JP2007047313A (ja) 話速変換装置
Skrelin Allophone-and suballophone-based speech synthesis system for Russian
WO2016035022A2 (fr) Procédé et système de modification de signaux vocaux sur la base d'époques

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20161124

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

AX Request for extension of the european patent

Extension state: BA ME

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION HAS BEEN WITHDRAWN

18W Application withdrawn

Effective date: 20181105