WO2009024442A2 - Procédé de synchronisation de flux de données médiatiques - Google Patents

Procédé de synchronisation de flux de données médiatiques Download PDF

Info

Publication number
WO2009024442A2
WO2009024442A2 PCT/EP2008/060055 EP2008060055W WO2009024442A2 WO 2009024442 A2 WO2009024442 A2 WO 2009024442A2 EP 2008060055 W EP2008060055 W EP 2008060055W WO 2009024442 A2 WO2009024442 A2 WO 2009024442A2
Authority
WO
WIPO (PCT)
Prior art keywords
synchronization
data
media
audio
data streams
Prior art date
Application number
PCT/EP2008/060055
Other languages
German (de)
English (en)
Other versions
WO2009024442A3 (fr
Inventor
Jesus Fernando Guitarte Perez
Klaus Lukas
Original Assignee
Siemens Aktiengesellschaft
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Siemens Aktiengesellschaft filed Critical Siemens Aktiengesellschaft
Publication of WO2009024442A2 publication Critical patent/WO2009024442A2/fr
Publication of WO2009024442A3 publication Critical patent/WO2009024442A3/fr

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/434Disassembling of a multiplex stream, e.g. demultiplexing audio and video streams, extraction of additional data from a video stream; Remultiplexing of multiplex streams; Extraction or processing of SI; Disassembling of packetised elementary stream
    • H04N21/4341Demultiplexing of audio and video streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/4302Content synchronisation processes, e.g. decoder synchronisation
    • H04N21/4307Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen
    • H04N21/43072Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen of multiple content streams on the same device
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/10Indexing; Addressing; Timing or synchronising; Measuring tape travel
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/10Indexing; Addressing; Timing or synchronising; Measuring tape travel
    • G11B27/19Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information detectable on the record carrier
    • G11B27/28Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information detectable on the record carrier by using information signals recorded by the same method as the main recording
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/236Assembling of a multiplex stream, e.g. transport stream, by combining a video stream with other content or additional data, e.g. inserting a URL [Uniform Resource Locator] into a video stream, multiplexing software data into a video stream; Remultiplexing of multiplex streams; Insertion of stuffing bits into the multiplex stream, e.g. to obtain a constant bit-rate; Assembling of a packetised elementary stream
    • H04N21/2368Multiplexing of audio and video streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4394Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • G10L2021/105Synthesis of the lips movements from speech, e.g. for talking heads

Definitions

  • the present invention relates to a method and apparatus for synchronizing media data streams, such as multimedia audiovisual data.
  • audio and video data are transmitted via frame-based data streams.
  • Synchronization information which is indicative of a respective elapsed time in the audio or video data stream may be used.
  • the artificial synchronization information embedded in the data stream such as specific time or synchronization flags, which have a respective time identification, are then matched to one another during the playback of the multimedia content.
  • FIG. 1 schematically shows, for example, an audio data stream AU and a video data stream VI.
  • synchronization identifiers M1-M5 are embedded in the corresponding data frames which identify predefined points in time.
  • sync flags N1-N5 are provided in the video data stream VI, which correspond to the same times as the flags M1-M5.
  • the AU or VI may be subject to different delays.
  • the synchronization markers M1-M5, N1-N5 measured in relation to a linear time can also be distorted. This is illustrated in FIG. 1 by the irregular distances between the markers M1-M5 and N1-N5 with one another.
  • Multimedia data stream AV Multimedia data stream AV, however, these times must coincide, since, for example, the lip movement shown at the synchronization time N2 an audio signal for Synchronization time M2 must match.
  • This mapping to a common synchronous time MNR is represented by the arrows M.
  • the corresponding data contents When reproducing or decoding and merging the two data streams AU and VI, the corresponding data contents must thus be synchronized with one another such that the times of the markers M1 and N1, M2 and N2, M3 and N3, M4 and N4, M5 and N5 at the playback fall on common same dates. This is indicated in FIG. 1 by the markings MN0-MN5 of the combined multimedia stream AV.
  • US 2007/0153089 A1 proposes methods for the synchronization between audio and video data using lip and tooth characteristics. Accordingly, video frames the frames are analyzed on facial images and recognized depending on the oral form Viseme. Parallel and independently of this, an audio analysis of the audio data stream is carried out by a statistical evaluation of determined Fourier transformation data.
  • the proposed approach has the disadvantage that both the audio analysis as well as the video analysis each require a high computational effort.
  • a method for synchronizing media data streams wherein the media data streams each because media content of a given media class.
  • a temporal synchronization of the data streams takes place as a function of the medial data contents.
  • the medial data contents are continuously monitored and predetermined data contents are recorded as synchronization points in the media data streams.
  • monitoring of a second media data stream takes place as a function of a detected synchronization point of a first media data stream.
  • the invention provides, on the basis of the data content, that is to say for example the audio data in the form of speech or video data, to be made as image sequences, for example of faces.
  • data streams are understood as a continuous sequence of data records whose end can not be predicted in advance.
  • the individual data sets or frames within a data stream are of a fixed predetermined type, such as data frames having audio data.
  • the amount of data sets or frames per time unit can vary, so that a data rate of different data streams is different.
  • audio or video data each form its own media class. However, other media classes are conceivable which can be adapted to the respective application.
  • audio data may include phonemes and video data viseme as a data part content.
  • a phoneme is the smallest meaningful but meaningless unit of a language. stood. In this case, methods for phoneme recognition are known, in particular in speech recognition.
  • the term "viseme” refers to the smallest units of meaning of mouth or lip movements, to which a meaning can be assigned. For example, a phoneme / o / can be assigned a viseme, which designates the associated open mouth position.
  • a media data stream of the audio data in the case of a media data stream of the audio data, to monitor specified phonemes or phoneme combinations as synchronization points.
  • a media data stream of the video data can then be monitored, or its video data content can be monitored to recognize given visems as synchronization points.
  • a synchronization or an adjustment of the two data streams can take place.
  • a synchronization table is created which has media data sub-contents of a first media class, such as phonemes, and second media data sub-contents of a second media class, such as Viseme, associated with each other.
  • media data sub-contents of a first media class such as phonemes
  • second media data sub-contents of a second media class such as Viseme
  • the method for synchronizing media data streams preferably has one or more of the following method steps: Buffering a first and at least a second media data stream;
  • bilabial phonemes and / or plosive phonemes are possible as synchronization points in the case of audio data contents.
  • the phoneme recognition and determination of the respective synchronization point can be supported, for example, by continuously monitoring an audio energy of the audio contents.
  • bilateral visems are also available as synchronization points in a video data stream.
  • a continuous monitoring of a video energy or of a parameter, which as a visual energy characterizes a particularly rapid lip movement preferably performs the video content.
  • the respective energies of the audio or video data or parameters extractable from the video and audio data can be derived, for example, by determining transformation coefficients of trigonometric transformations, such as discrete cosine transformations, for example.
  • the lip images are preferably monitored and recorded.
  • a variant of the method provides for comparing the recorded lip images with a representation with a predefined base of lip modes. For example, it is possible to determine geometric lines as lip modes, which in a linear combination have a predetermined
  • a natural delay between a detected phoneme and a detected visem is preferably taken into account. For example, in certain sounds, it is necessary for the physiological speech system to prepare for the output of a particular phoneme, for example, by temporarily closing the mouth. In this case, an associated viseme is created before the actual utterance.
  • the invention further relates to a synchronization direction according to claim 23.
  • This synchronization device is designed such that a method for the synchronization of media data streams is carried out as described above.
  • the synchronization device preferably has a phoneme recognition unit, a visem recognition unit and a synchronization unit.
  • Preferred fields of application of corresponding synchronization devices are, for example, receiving devices for multimedia data streams.
  • a video conferencing system or even a mobile telephone as well as multimedia data reception equipment can be equipped with a corresponding synchronization device.
  • the advantage in this case is, in particular, that the receiving device is functional without a dedicated transmitting device equipping the predetermined data streams in a standardized form with synchronization pulses.
  • a particularly preferred implementation of the method or a synchronization device takes place as an embedded system, which can be provided in a variety of mobile communication devices
  • the invention relates to a computer program product, which causes the implementation of a corresponding method for synchronizing media data streams on a program-controlled computer device.
  • a program-controlled computer device for example, is a PC which has appropriate software that performs video conferencing or the reception of audio and video data.
  • the computer program product can be implemented, for example, in the form of a data carrier, such as a USB stick, floppy disk, CDROM, DVD, or else implemented on a server device as a downloadable program file.
  • FIG. 1 shows examples of media data streams with synchronization flags
  • FIG. 2 shows examples of data streams with media data contents without synchronization flags
  • FIG. 3 shows a flow chart of a variant of the method for synchronizing media data streams
  • FIG. 4 is a block diagram of one embodiment of a synchronization device
  • Fig. 5 is an overview of viscous modes
  • Fig. 6 is a table from phoneme to viseme assignments.
  • FIG. 2 schematically shows data streams of audio data AU and video data VI.
  • Conventional data streams have data frames Fl-FlO or G1-G7.
  • the data frames do not always have to have the same length and can also be delayed differently during the transmission of the respective data streams AU, VI. This is illustrated by way of example in the data frame G6, which is shorter in time than the data frames G5 and G7 lying around it.
  • the respective data frames comprise data contents, for example the data frame F3 an audio content Al, the data frame F4 an audio data content A2.
  • the video data stream VI carries, for example, video data contents V1 and V2 in the data frames G2 and G3.
  • the data streams AU and VI to be processed and received have no synchronization flags or time information. Rather, the audio data stream AU only has correspondingly coded audio data A1, A2 on, such as digitally coded speech signals.
  • the video data stream VI comprises data frames G1-G7 with video contents V1, V2, which comprise, for example, a sequence of coded individual images of a scene, in particular with the representation of faces and lip movements.
  • FIG. 3 shows an example of a synchronization of audio and video data, the audio data having coded voice signals and the video data corresponding scenes with lip movements as a schematic flow chart.
  • the time axis runs vertically downwards.
  • the audio and video data or the corresponding frames Fl-Fl0 and G1-G7, as indicated in FIG. 2 can be transmitted via a suitable transmission method, for example to a receiving or display device.
  • step SO there is thus a reception of corresponding data streams which have both video and audio data, for example via the Internet.
  • a splitting takes place in both the audio data and in the video data, so that as
  • the audio data streams AU and video data streams VI shown in FIG. 2 are present.
  • step S2 a phoneme analysis is carried out, which can be performed, for example, according to known methods of speech recognition.
  • a visual analysis is performed in step S22.
  • Various methods are also conceivable in viseme analysis, some of which are outlined below as examples.
  • bilabial plosive sounds such as / p / are detected in the data content A2 and the time occurrence with respect to the local time t, with which, for example, the reception representation device operates, is registered.
  • visa analysis recognizes, for example, whether visems that can be assigned to particular phonemes, such as as a quick round lip opening when saying the sound / p / detected. This is also recorded in time.
  • both phonemes and vises can in principle comprise a longer period of time, a specific point in time can be marked within the recognized synchronization point (phoneme or visem), which must be displayed simultaneously when reproducing both data stream contents. It is conceivable, for example, to check the audio and / or video signal with regard to a particular energy. For plosive sounds, there is a short, rapid increase in audio signal energy that can be used as the synchronization time. Analogously, a corresponding video signal can be assigned a "visual energy" which indicates a particularly rapid lip movement, such as, for example, the explosive opening of the mouth in the case of bilabulous plosives.
  • Corresponding viseme analyzes are based on a graphic evaluation of the video content, wherein, for example, first a mouth region or facial region is recognized via pattern recognition, then in particular the occurring lip movements are monitored and, as in speech recognition, a corresponding pattern recognition is carried out.
  • steps S2 and S22 is thus determined whether in the audio data stream AU synchronization sites, such as bilabial Plosivlaute occur and present in the temporal environment, ie in a surrounding time window in the video data stream VI matching Viseme.
  • steps S3 and S3 these monitoring results from the phoneme analysis and the visi analysis are compared.
  • step S4 If a temporal shift or a time offset is detected in the temporal sequence of the phonemes and visemes belonging to one another, which must occur with correct synchronization at a common synchronization time, an adjustment can now take place in step S4, so that both events, namely the audio reproduction of the bilabial Plosivlautes and the simultaneous display of the appropriate video content with a corresponding face or lip pull done.
  • Fig. 4 shows a block diagram of a possible implementation of the synchronization method in a multimedia data receiving device.
  • the receiving device 1 has a synchronization device 2 and a display device 3.
  • the synchronization device 2 is supplied via an input 13, a multimedia signal with media data streams AV.
  • a synchronized media data stream for example with audiovisual data, can then be tapped off.
  • the multimedia data AVS can be for example synchronized audio and video data streams. In Fig. 4, this is shown only by a simple arrow.
  • the synchronized audiovisual data AVS are then displayed by a display device 3, for example a screen 10, which is equipped with a loudspeaker 11 and a display 12. Both the synchronization device 2 and the display device can be implemented in a computer system.
  • the synchronization device 2 has a splitting unit 4, which generates an audio data stream AU and a video data stream VI from the mixed media data stream AV.
  • the data streams AU, VI can be configured, for example, in the form of FIG. 2. Compared to the method steps of FIG. 3, a reception of the data streams and a splitting into audio and video data streams according to the steps S0 and S1 in the splitting unit 4 thus ensue.
  • a phoneme recognition is performed on the audio data stream AU via a phoneme recognition unit 5.
  • the video data stream is visually recognized via a Visem recognition unit 6.
  • parameters of the audio and video signals are determined, such as MFCCs (MeI frequency cepstrum coefficients).
  • MFCCs are often used in automatic speech recognition because they result in a compact representation of the spectrum.
  • the phonemes are then recognized.
  • features or parameters of the video signal are determined and used in the visor recognition.
  • a phoneme analysis unit 7 processes the data AE supplied by the phoneme recognition unit 5, and a visual analysis unit 8 processes the data VE supplied by the visor recognition 6.
  • the exact times required for synchronization are determined on the basis of a consideration of the energy profile of the audio or video signals or variables derived therefrom within the phonemes or vises.
  • the processes denoted by steps S2 and S22 in FIG. 3 thus take place in the phoneme recognition and analysis unit 5, 7 or the viseme recognition and analysis unit 6, 8.
  • a synchronization unit 9 receives corresponding synchronization data SDA for the audio portion and SDV for the video portion of the audiovisual data AV.
  • the synchronization unit 9 delays and matches the different detected synchronization points, which are defined by the recognized visems or phonemes, and outputs synchronized data streams AVS.
  • the elements designated in FIG. 4 as a splitting unit 4, phoneme recognition unit 5, visem recognition unit 6, phoneme analysis unit 7, viseme analysis unit 8 and synchronization unit 9 can be implemented, for example, in the form of corresponding computer program applications. In this case, an implementation of the method shown by way of example in FIG. 3 is realized.
  • the synchronization described above takes advantage of the dependencies between the audio and video information in a language process. For example, lip-reading methods and procedures may use the correlation between lip movements and simultaneous audio information to achieve improved recognition rates in speech recognition. For example, there are some phonemes that are relatively easily detectable in both audio and video perception. This is the case in particular with the so-called bilabial plosive phonemes.
  • the phonemes / p / and / b / are, for example, the phonemes / p / and / b /.
  • the energy of the audio signal rises sharply in a very short period of time. That is, there is a rapid transition from a low energy level of the audio signal to a higher audio level.
  • the audio energy is a function of an integral over the square of the audio amplitude, being integrated over a time-varying time window.
  • Plosive phonemes are thus essentially characterized by the fact that initially there is a period of relatively low audio energy and then a rapid energy increase can be recognized. Physiologically, this first takes place with a stop of the air flow and an ensuing explosive release of the accumulated air flow. Therefore, the terms explosive sounds are relevant.
  • a corresponding plosive sound can be recognized by the fact that the speaker first closes his mouth, so that the lips lie on each other and then suddenly open with the pent-up air flow. This change in the lip state or the lip arrangement in the face of a speaker may be associated with a corresponding visual energy.
  • FIG. 5 shows representations of five visual modes MOl, MO2, MO3, MO4 and MO5.
  • the rhombuses shown in the various diagrams correspond to excellent points on the lips of a speaker. These can be set standardized, for example.
  • the middle column shows the five viscous modes MO1-MO5 shown in an averaged normalized form.
  • ASM Active Shape Model
  • a lip movement or a lip pattern in the form of these basic modes MO1-MO5 can be represented.
  • the left and right columns show the basic modes MO1-MO5 with a standard deviation of ⁇ 3.
  • the respective linear coefficients of the ASM modes are continuously measured and considered as a measure of a viscous energy or visual energy.
  • its presence in the data stream can thus be recognized by the fact that there is a special combination of linear coefficients.
  • the first linear coefficient for the mode MOL rapid suddenly increases, which is detected as a synchronization point.
  • Eye and mouth positions identify the central region. For this purpose, for example, first of all a corresponding greyscale image of a single image in the video data stream can be generated and a facial color classification can be carried out. Subsequently, a horizontal filtering of the gray scale image can be carried out, whereby contours become more easily recognizable. In the identification of the lip or mouth region, known methods can be used. Furthermore, based on the recognized lip shape, pattern recognition can then be carried out, for example, on the basis of a hidden Markov model. The pattern recognition is similar to a pattern recognition for speech recognition, but lip pattern for the corresponding model can be used and identified in the recognition algorithm.
  • An alternative classification of lip movements or lip patterns may be by means of a discrete cosine transformation of the image region in which the lips of a person are visible. Similar to a spectral analysis on the audio signal, a check of the corresponding Fourier coefficients or coefficients of the discrete cosine transformation (DCT coefficients) takes place in the context of pattern recognition, such as, for example, with a Hiden-Markov model or else methods that use neural networks use.
  • DCT coefficients discrete cosine transformation
  • FIG. 6 shows a table MP in which phonemes and visemes which are correlated with one another are listed.
  • FIG. 6 shows the phonemes and assignable visems for English pronunciation.
  • there is no 1: 1 assignment in the table MP as is the case, for example, with the bilabial plosive sounds or in the sound AH, AY.
  • it is possible to use the phoneme recognition which can be carried out with limited computation effort and, in particular, already known phonemes in the audio data stream, in order to use a limited search space in the case of a visor recognition.
  • FIG. 7 shows an example of the plosive sound / p /.
  • the audio signal ASG is displayed in arbitrary units over a period of 3.5 seconds.
  • Tl 2.2s the sound "Peh" is pronounced.
  • This audio signal shown in FIG. 7A is present, for example, in an audio data stream.
  • FIG. 7B also shows the time profile of a parameter for a visual energy VPE in arbitrary units over the same period of time. For example, as shown in FIG.
  • a measure of the visual energy VPE may be derived from a DCT coefficient of the lip region.
  • the square of the corresponding value can be calculated and considered as a measure of visual energy.
  • a local time or a clock of the receiving device it is now possible, as shown in FIG. 7A, to set a synchronization point in the audio data stream at the time T 1.
  • a corresponding synchronization point VSS can be set at time T2.
  • the two synchronization points VSS and ASS occur at different times. This is illustrated in Fig. 7C.
  • the two data streams must be compared with each other. This can be done, for example, by delaying the synchronization signal Tl for the audio signal earlier in time by ⁇ T, so that both synchronization points VSS, ASS coincide at the same synchronization time ST during the reproduction of the audiovisual, ie multimedia contents.
  • time scale t corresponds to the simultaneous reproduction of all data streams.
  • the time scale t ' can be compared with the others Time scales T be distorted or linearly stretched, since it is conceivable that audio data streams are transmitted much faster than, for example, video data streams. This can be due to the bandwidth of the transmission medium or the processing speed for the decoding of the corresponding data.
  • FIG. 7A it is also indicated that in a time window TT, which is a recognized synchronization event ASS or a synchronization instant Tl, in particular the simultaneous visual recognition is set up particularly thoroughly or sensitively.
  • a given delay time which of course occurs, for example, in the utterance of a phoneme "kah" is taken into account.
  • the mouth is first prepared for the pronunciation of the vowel "ah”, so that visually the corresponding lip movement before the air ejection and thus before the corresponding audio signal can be detected.
  • the envelope or time derivatives of the audio signal can be used for phoneme recognition.
  • the natural processes that can be extracted from the audio and video content are used to synchronize the two media channels.
  • the invention has the particular advantage that no further synchronization flags need to be provided in the data streams. Rather, the natural bimodal events, which represent a temporal correlation between audio and video information, are used.
  • the synchronization method is particularly reliable if the lip movements of the speakers from which the audio signal originates are detectable and detectable.
  • more synchronization points are conceivable. For example, when playing a piano keyboard, the ordered sequence of keys and their operation can be visually recognized and assigned in the audiovisual reproduction of the corresponding sounds that can be clearly recognized in the audio signal, and thus synchronized.
  • both visually and by audio registrable content conceivable.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
  • Synchronisation In Digital Transmission Systems (AREA)

Abstract

La présente invention concerne un procédé et un dispositif de synchronisation de flux de données médiatiques (AU, VI) comportant respectivement des contenus de données médiatiques (A1, A2, V1, V2) d'une classe médiatique prédéfinie. Une synchronisation temporelle des flux de données (AU, VI) est effectuée en fonction des contenus de données médiatiques (A1, A2, V1, V2).
PCT/EP2008/060055 2007-08-22 2008-07-31 Procédé de synchronisation de flux de données médiatiques WO2009024442A2 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
DE102007039603.3 2007-08-22
DE102007039603A DE102007039603A1 (de) 2007-08-22 2007-08-22 Verfahren zum Synchronisieren von medialen Datenströmen

Publications (2)

Publication Number Publication Date
WO2009024442A2 true WO2009024442A2 (fr) 2009-02-26
WO2009024442A3 WO2009024442A3 (fr) 2009-04-23

Family

ID=40263335

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2008/060055 WO2009024442A2 (fr) 2007-08-22 2008-07-31 Procédé de synchronisation de flux de données médiatiques

Country Status (2)

Country Link
DE (1) DE102007039603A1 (fr)
WO (1) WO2009024442A2 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106067989A (zh) * 2016-04-28 2016-11-02 江苏大学 一种人像语音视频同步校准装置及方法

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
FR2955183B3 (fr) * 2010-01-11 2012-01-13 Didier Calle Procede de traitement automatique de donnees numeriques destinees a des doublages ou a des post-synchronisations de videos
US20140365685A1 (en) * 2013-06-11 2014-12-11 Koninklijke Kpn N.V. Method, System, Capturing Device and Synchronization Server for Enabling Synchronization of Rendering of Multiple Content Parts, Using a Reference Rendering Timeline

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0336032A1 (fr) * 1988-04-07 1989-10-11 Research Triangle Institute Reconnaissance acoustique et optique de la parole
US5572261A (en) * 1995-06-07 1996-11-05 Cooper; J. Carl Automatic audio to video timing measurement device and method
WO2005099251A1 (fr) * 2004-04-07 2005-10-20 Koninklijke Philips Electronics N.V. Synchronisation video-audio
WO2006113409A2 (fr) * 2005-04-13 2006-10-26 Pixel Instruments, Corp. Procede, systeme et produit-programme de mesure de synchronisation audio video a l'aide de caracteristiques de levres et de dents

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100236974B1 (ko) * 1996-12-13 2000-02-01 정선종 동화상과 텍스트/음성변환기 간의 동기화 시스템
US7499104B2 (en) * 2003-05-16 2009-03-03 Pixel Instruments Corporation Method and apparatus for determining relative timing of image and associated information
US20050047664A1 (en) * 2003-08-27 2005-03-03 Nefian Ara Victor Identifying a speaker using markov models

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0336032A1 (fr) * 1988-04-07 1989-10-11 Research Triangle Institute Reconnaissance acoustique et optique de la parole
US5572261A (en) * 1995-06-07 1996-11-05 Cooper; J. Carl Automatic audio to video timing measurement device and method
WO2005099251A1 (fr) * 2004-04-07 2005-10-20 Koninklijke Philips Electronics N.V. Synchronisation video-audio
WO2006113409A2 (fr) * 2005-04-13 2006-10-26 Pixel Instruments, Corp. Procede, systeme et produit-programme de mesure de synchronisation audio video a l'aide de caracteristiques de levres et de dents

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
POTAMIANOS G ET AL: "An image transform approach for HMM based automatic lipreading" IMAGE PROCESSING, 1998. ICIP 98. PROCEEDINGS. 1998 INTERNATIONAL CONFE RENCE ON CHICAGO, IL, USA 4-7 OCT. 1998, LOS ALAMITOS, CA, USA,IEEE COMPUT. SOC, US, Bd. 3, 4. Oktober 1998 (1998-10-04), Seiten 173-177, XP010586875 ISBN: 978-0-8186-8821-8 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106067989A (zh) * 2016-04-28 2016-11-02 江苏大学 一种人像语音视频同步校准装置及方法

Also Published As

Publication number Publication date
DE102007039603A1 (de) 2009-02-26
WO2009024442A3 (fr) 2009-04-23

Similar Documents

Publication Publication Date Title
DE4436692C2 (de) Trainingssystem für ein Spracherkennungssystem
DE102019001775A1 (de) Nutzung von Maschinenlernmodellen zur Bestimmung von Mundbewegungen entsprechend Live-Sprache
DE60123747T2 (de) Spracherkennungsbasiertes Untertitelungssystem
DE60108373T2 (de) Verfahren zur Detektion von Emotionen in Sprachsignalen unter Verwendung von Sprecheridentifikation
DE102004023436B4 (de) Vorrichtung und Verfahren zum Analysieren eines Informationssignals
WO2017001607A1 (fr) Procédé et dispositif pour créer une base de données
DE19753454C2 (de) Text/Sprache-Umsetzungssystem zur Synchronisierung synthetisierter Sprache mit einem Film in einer Multimediaumgebung und Verfahren für eine derartige Synchronisierung
DE19753453B4 (de) System zum Synchronisieren eines Films mit einem Text/Sprache-Umsetzer
Broad et al. Formant‐Frequency Trajectories in Selected CVC‐Syllable Nuclei
DE112016007138T5 (de) Vorrichtung und verfahren zur überwachung eines tragezustandes eines ohrhörers
WO2015078689A1 (fr) Dispositif de correction auditive avec modification de la fréquence fondamentale
WO2009024442A2 (fr) Procédé de synchronisation de flux de données médiatiques
DE2104622B2 (de) Verfahren und schaltungsanordnung zur synchronisation von signalen
Berthommier A phonetically neutral model of the low-level audio-visual interaction
DE102019126688A1 (de) System und verfahren zur automatischen untertitelanzeige
WO2022013045A1 (fr) Procédé de lecture automatique sur des lèvres au moyen d'un élément fonctionnel et de fourniture dudit élément fonctionnel
EP1976291B1 (fr) Procédé et système de communication vidéo destinés à la commande en temps réel basée sur la gestuelle d'un avatar
WO2001047335A2 (fr) Procede pour eliminer des composantes de signaux parasites dans un signal d'entree d'un systeme auditif, mise en oeuvre dudit procede et appareil auditif
DE602004011292T2 (de) Vorrichtung zur Sprachdetektion
EP2548382A1 (fr) Procédé d'essai d'appareils d'aide auditive
DE69816078T2 (de) Verbesserungen im bezug auf visuelle sprachsynthese
DE4015381A1 (de) Spracherkennungsgeraet und verfahren zur spracherkennung
EP4178212A1 (fr) Procédé de synchronisation d'un signal supplémentaire à un signal principal
DE10305369B4 (de) Benutzeradaptives Verfahren zur Geräuschmodellierung
JP2000506327A (ja) トレーニングプロセス

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 08786680

Country of ref document: EP

Kind code of ref document: A2

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 08786680

Country of ref document: EP

Kind code of ref document: A2