US20210219012A1 - System and a computerized method for audio lip synchronization of video content - Google Patents

System and a computerized method for audio lip synchronization of video content Download PDF

Info

Publication number
US20210219012A1
US20210219012A1 US17/200,450 US202117200450A US2021219012A1 US 20210219012 A1 US20210219012 A1 US 20210219012A1 US 202117200450 A US202117200450 A US 202117200450A US 2021219012 A1 US2021219012 A1 US 2021219012A1
Authority
US
United States
Prior art keywords
video
audio
lip sync
scene
content
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US17/200,450
Inventor
Oren Jack MAURICE
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ichannel Io Ltd
Original Assignee
Ichannel Io Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ichannel Io Ltd filed Critical Ichannel Io Ltd
Priority to US17/200,450 priority Critical patent/US20210219012A1/en
Assigned to ICHANNEL.IO LTD reassignment ICHANNEL.IO LTD ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: MAURICE, Oren Jack
Publication of US20210219012A1 publication Critical patent/US20210219012A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/4302Content synchronisation processes, e.g. decoder synchronisation
    • H04N21/4307Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/242Synchronization processes, e.g. processing of PCR [Program Clock References]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/4302Content synchronisation processes, e.g. decoder synchronisation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4394Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • H04N21/854Content authoring
    • H04N21/8547Content authoring involving timestamps for synchronizing content

Definitions

  • the disclosure relates to lip synchronization (lip sync) between a video signal and its respective audio signal, and in particular to the correction of lip sync errors between the video signal and the audio signal.
  • Lip synchronization error also referred to as lip sync error
  • Lip synchronization error is defined as when the timing of a video portion deviates from the timing of its respective audio portion. Such a mismatch between the video signal and the audio signal, especially when the mismatch is above a certain threshold, is bothersome to the viewers and considered to be of poor quality. Unless care is taken to maintain the audio and video in sync this phenomena may continue and even become worse as transmission continues.
  • the timing differential which may be static or dynamic, is typically referred to as the lip sync error. That is, the visual effect of the motion of a speaker's lips is out of sync (i.e., not synchronized) with the audio heard. This requirement for lip synchronization may occur in broadcast and live streaming as well as video clip transmission from files.
  • the prior art teaches a variety of ways to reduce the lip sync error.
  • One method calls for manual adjustment of the lip sync error based on an observation made by a user of a control system. Once the observer detects a lip sync error a manual adjustment, for example, delaying the video or delaying the audio, resolves the lip sync error.
  • This method has many drawbacks including its subjectivity, i.e., it is dependent on a particular user's experience rather than on an objective metric, it being error prone, and it being difficult to scale as the number of video channels exponentially increase over time. This may also be achieved automatically if a previously detected delay is known and a delay factor is automatically used.
  • This method is deficient as this requires the use of typically an arbitrary delay factor that may or may not be suitable for a particular case. Moreover, it does not resolve any dynamic changes in the lip sync error that may occur during the delivery of a video clip to a client.
  • Yet other prior art methods for detection of lip sync errors include the insertion of a video signal in sync with an audio synchronization signal, also referred to as a “pip”. This allows for occasional synchronization between the video signal and the audio signal at rendezvous points.
  • Yet another type of solution attempts to analyze the lip motion from its visual clues and correlate them to the audio provided by the audio track.
  • these methods require specialized and mostly expensive equipment. The exponential growth of video delivery and the need to reduce costs significantly cannot be supported by such prior methods.
  • FIG. 1 is a schematic illustration of a system according to an embodiment.
  • FIG. 2A is a schematic illustration of a first unsynchronized audio and video stream by a time difference according to an embodiment.
  • FIG. 28 is a schematic illustration of a second unsynchronized audio and video stream by a time difference according to an embodiment.
  • FIG. 3 is a schematic block diagram of a system for lip sync error correction according to an embodiment.
  • FIG. 4 is a schematic illustration of a first flowchart for detection and correction of lip sync error according to an embodiment.
  • FIG. 5 is a schematic illustration of a second flowchart for detection and correction of lip sync error according to an embodiment.
  • FIG. 6 is a schematic illustration of a third flowchart providing details of the determination of mismatch cost for the second flowchart.
  • Audiovisual content in the form of video clip files, streamed or broadcasted may present a problem known as a lip sync error, i.e., the motion of the lips of a speaker do not correspond to the sound at the same time.
  • a lip sync error i.e., the motion of the lips of a speaker do not correspond to the sound at the same time.
  • the video content is segmented according to video scene cuts.
  • the audio is segmented at audio scene cuts.
  • Analyzer compares the timing of the various cuts and determines if a lip sync error has occurred and if so if the system can provide a correction to overcome the problem.
  • a lip sync error is detected, based on a comparison between the video scene cuts and the audio scene cuts, a correction may be either suggested or automatically applied.
  • FIG. 1 an exemplary and non-limiting schematic illustration 100 of a synchronized audio and video stream is provided. While a reference herein is made to an audiovisual content stream it should be understood that the application of the invention disclosed herein is broader and applies to such content that is streamed, provided from file or otherwise broadcasted.
  • the video stream 110 has various video scenes Vs 1 through Vs 7 . According to principles of the invention these video scenes are determined based on analysis of neighboring frames, searching, for example and without limitation, for a sudden spike in the difference between the neighboring frames, or according to any of a plurality of prior art methods including, without limitation, those specified herein. These tend to change from one video scene to another.
  • a cut for example cut 111
  • another scene begins. That is, in this particular example and without limitation, scene Vs 1 is in a home while scene Vs 2 is in the street, the cut between the scenes being at 111 . Then the scene may move into a car, changing the video frames content abruptly and therefore suggesting a scene cut, indicated for example as cut 112 . As a result the subsequent scene Vs 3 is a scene happening within a car.
  • a similar process is performed in order to slice the audio track into segments, looking for abrupt changes in the ambient sound, or according to any of the listed prior art methods.
  • the audio stream 120 is perfectly aligned with the video stream 110 , that is As 3 and As 5 are in sync with Vs 3 and Vs 5 while As 4 is in sync with Vs 4 .
  • a case like this would not require any lip sync correction as no lip sync error actually is shown.
  • the division into the segments Vs 1 through Vs 7 and corresponding As 1 through As 7 are integral to the principles of the inventions though ways of such segmentation of video and/or audio are found in the prior art and are outside the scope of the current invention.
  • One of ordinary skill in the art would readily appreciate that even imperfect alignment between the audio and the video may be tolerable by a user if such is below a predetermined threshold.
  • a threshold of a misalignment between audio and video that is up to 80 milliseconds is considered to be acceptable and therefore no lip sync error correction may be needed.
  • the invention is concerned of novel and inventive use of such segmentation.
  • FIG. 2A is an exemplary and non-limiting schematic illustration 200 A of a first unsynchronized audio 220 and video stream 210 by a time difference T ⁇ .
  • T ⁇ the time difference between the video stream 210 and the audio stream is constant for the purpose of this illustration.
  • the value of T ⁇ may also fluctuate to a certain degree around a threshold value ⁇ without departing from the scope of the disclosure herein. Therefore, a segmentation of the video stream 210 and the audio stream 220 , performed according to the principles of the invention, shows a delta value between the audio and the video, then, if T ⁇ is above a predetermined threshold value ⁇ a correction may be either attempted automatically, or, a notification may be generated to alert an operator that an adjustment may be necessary.
  • FIG. 2B is an exemplary and non-limiting schematic illustration 200 B second unsynchronized audio and video stream by a time difference. This illustration however differs from that shown in FIG. 2A . While the same video sequence from Vs 1 through Vs 7 is shown, the audio stream is different. For Vs 3 through Vs 5 no audio cut, or segment is found, rather a continuous audio segment As 3 is detected. Thereafter the T ⁇ values for the lip sync error continue. As will be explained herein a decision may be taken as to the lip sync error correction that may be taken, for example, if this occurs at a low enough frequency throughout the received audiovisual content it may be assumed the a T ⁇ lip sync correction should take place.
  • FIG. 3 is an exemplary and non-limiting schematic block diagram of a system 300 for lip sync error correction according to an embodiment.
  • An audiovisual content 302 is provided and the video content 304 is directed to a video cut analyzer 310 .
  • the video cut analyzer 310 is enabled to segment the video content 304 to a plurality of video segments which are then provided by the video cut analyzer 310 to a video/audio scene delta analyzer 330 as well as to a lip sync error correction unit 340 .
  • the video cut analyzer 310 performs the segment cuts based on, for example but not by way of limitation, known in the art segmentation techniques.
  • the audio content 306 of the audiovisual content 302 is provided to an audio cut analyzer.
  • the audio cut analyzer 320 is enabled to segment the audio content 306 to a plurality of audio segments which are then provided by the audio cut analyzer 320 to the video/audio scene delta analyzer 330 as well as to a lip sync error correction unit 340 .
  • the audio cut analyzer 320 performs the segment cuts based on, for example but not by way of limitation, detection of changes in ambient noise (or sound) when changing from one scene to another.
  • detection of changes in ambient noise (or sound) when changing from one scene to another.
  • One out of many prior art solutions for such scene change detection is discussed in Lin et al., “Acoustic Scene Change Detection by Spectro-Temporal Filtering on Spectogram Using Chirps”.
  • the voice/audio scene delta analyzer (also referred to herein as the delta analyzer) 330 performs an analysis respective of the T ⁇ values between the video segments, as cut by the video cut analyzer 310 , and the audio segments, as cut by the audio cut analyzer 320 . Assuming there are a sufficient number of both audio and video segments, the analyzer may provide several types of different notification on notification signal 335 .
  • the first notification is that no lip sync errors were detected, which would mean that the T ⁇ values found are below a predetermined ⁇ threshold value, or, that the number of cases where the T ⁇ values exceed the minimum ⁇ threshold value is below another predetermined threshold value K.
  • K the value of ⁇ is 60 milliseconds and the value of K is 10%. In such cases no lip sync error correction may be necessary.
  • Both ⁇ and K threshold values may be programmable so as to allow for tighter or looser threshold values depending on the desired quality of service with respect to lip sync errors. Another case is where it is impossible to make any kind of lip sync error correction and the system 300 provides a notification on signal 335 of this case.
  • the inconsistency may be determined as an inconsistency between ⁇ value that is above a predetermined E threshold.
  • a notification may be provided on the notification signal 335 to alert an operator of the system 300 that certain manual intervention may be required as automatic lip sync error correction cannot be performed by the system 300 .
  • the first case is when the T ⁇ is of a consistent value above ⁇ but below a predetermined E error value.
  • the second case is when T ⁇ is of a consistently increasing or decreasing value above ⁇ but below a predetermined E error value.
  • lip sync error correction takes place and is correctable. Such error correction is performed by the lip sync error correction unit 340 that receives the video segments from the video cut analyzer 310 and the audio segments from the audio cut analyzer 320 as well as any necessary information regarding the analysis performed by the video/audio scene delta analyzer 330 .
  • the lip sync error correction unit 340 is used by the lip sync error correction unit 340 to compensate for the T ⁇ value. If the distribution around the T ⁇ value is small, then correction can be made, however, if the distribution is large, i.e., it is inconsistent, then it is not possible to make a lip sync error correction using this particular solution. However if the T ⁇ value is constant, or has a tendency to either increase or decrease over time but within the boundaries of the maximum E threshold, and do that in a linear fashion over time, then the correction is possible using appropriate factor equations.
  • the factor may change over time if changes in the T ⁇ value are relatively infrequent, or, in other words, distribution is not too wide around the T ⁇ value.
  • the lip sync error correction unit 340 provides lip sync corrected audiovisual content 345 thereby overcoming deficiencies that may have occurred in the audiovisual input content 302 . It should therefore be understood that the error correction may include, but is not limited to, linear drift correction and non-linear drift correction.
  • FIG. 4 is an exemplary and non-limiting schematic illustration of a flowchart 400 for detection and correction of lip sync error according to an embodiment.
  • audiovisual content is received. It may be received from a file or as an audiovisual stream. In the latter case it is necessary to collect or otherwise analyze a sufficient number of video segments and audio segments before an analysis according to the invention can take place. Thereafter corrections and updates can take place as new audiovisual content (for example audiovisual content 302 ) is provided and an updated analysis takes place that takes into account the newly received content.
  • video scene cuts in the video content (for example video content 304 ) of the received audiovisual content are determined, using, for example but not by way of limitation, techniques described herein.
  • audio scene cuts in the audio content (for example audio content 306 ) of the received audiovisual content are determined, using, for example but not by way of limitation, techniques described herein.
  • a comparison analysis is performed to check correlations between the video scene cuts and the audio scene cuts to determine matches as well as T ⁇ values between video segments and audio segments. It should be understood, as noted with respect of FIG. 2B , that there are cases where there is no one-to-one match between each video segment and each audio segment, and such mismatch, as long as it is infrequent, can be overcome by system 300 by skipping to the next possible match.
  • S 450 it is checked whether the lip sync error is within correctable parameters of the system 300 , for example, but not by way of limitation, if T ⁇ is above E and is inconsistent, as described herein in more detail, and if so execution continues with S 470 ; otherwise, execution continues with S 460 where a notification is provided noting that the system, for example system 300 , cannot perform lip sync to the received audiovisual content though a lip sync problem does exist, and thereafter execution terminates.
  • S 470 it is checked if the offset between the audio segments and the video segments is smaller than a predetermined threshold, i.e.
  • T ⁇ is smaller than ⁇ , and if not execution continues with S 490 ; otherwise, execution continues with S 480 where a notification may be generated noting that no lip sync error correction is required.
  • lip sync error correction is performed so as to compensate for the T ⁇ between the video segments and the audio segments, for example using techniques discussed herein. The compensation may involve any one of the two cases discussed herein in more detail, i.e., the first case where T ⁇ is constant, or thereabout, and the second case where T ⁇ continuously increases or decreases over time. Once correction has completed, execution terminates.
  • FIG. 5 is a schematic illustration of a second flowchart 500 for detection and correction of lip sync error according to an embodiment
  • FIG. 6 is a schematic illustration of a third flowchart 600 providing details of the determination of mismatch cost for the second flowchart.
  • the method starts by obtaining a list of audio and video scene cuts (S 505 ), which may be detected using prior-art solutions, or other solutions which are outside of the scope of the current invention. It then generates a collection start/end audio/video offsets (S 510 ).
  • Each such set points to a specific scene cut (up to a predetermined value X from the list's start) as a possible start, for either list, and to another scene cut (up to X scene cuts from the end of the list) as its end, again from either list.
  • These sets cover all the possibilities for start and end cuts, on either list, resulting in X 4 such sets.
  • V s is the selected video start time of aspecific set; V e the selected video end time; A s the selected audio start time; and, A e the selected Audio end time.
  • both P a and P v shall be incremented (S 530 - 25 ) unless one has reached the end of its list, in which case it will not be incremented.
  • the mismatch counter is incremented (S 530 - 30 ), and then increment the pointer which is pointing to a scene change time that is “further behind” (S 530 - 40 or S 530 - 45 as the case may be), unless that pointer has reached the end of its list, in which case the other one will be incremented.
  • the number of mismatches is evaluated (S 530 - 55 ). If that value is above a predetermined value then the cost of this set is considered to be infinite (S 530 - 60 ), and it will not be considered a good option. If the number of mismatches is below the predetermined threshold, or equal thereto, then the resulting accumulated cost is the accumulated cost (S 530 - 65 ) and compared (S 535 ) to the best accumulated cost thus far. If the cost is lower for this set, its cost are saved (S 540 ) as the best cost, and its A,B factors are saved as the best factors thus far.
  • the best cost is still infinity which means that no good match was found, and therefore a notification is provided that lipsync cannot be corrected (S 555 ); b.
  • the best cost is not infinity, the best A factor is 0, and the best B factor is 1 in which case a notification that the lipsync appears to be perfect as-is and no correction is necessary (S 565 ); or, c.
  • the various embodiments disclosed herein can be implemented as hardware, firmware, software, or any combination thereof.
  • the software is preferably implemented as an application program tangibly embodied on a program storage unit or computer readable medium consisting of parts, or of certain devices and/or a combination of devices.
  • the application program may be uploaded to, and executed by, a machine comprising any suitable architecture.
  • the machine is implemented on a computer platform having hardware such as one or more central processing units (“CPUs”), a memory, and input/output interfaces.
  • CPUs central processing units
  • the computer platform may also include an operating system and microinstruction code.
  • a non-transitory computer readable medium is any computer readable medium except for a transitory propagating signal.
  • the various embodiments disclosed herein can be implemented as hardware, firmware, software, or any combination thereof.
  • the software is preferably implemented as an application program tangibly embodied on a program storage unit or computer readable medium consisting of parts, or of certain devices and/or a combination of devices.
  • the application program may be uploaded to, and executed by, a machine comprising any suitable architecture.
  • the machine is implemented on a computer platform having hardware such as one or more central processing units (“CPUs”), a memory, and input/output interfaces.
  • CPUs central processing units
  • the computer platform may also include an operating system and microinstruction code.
  • a non-transitory computer readable medium is any computer readable medium except for a transitory propagating signal.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computer Security & Cryptography (AREA)
  • Television Signal Processing For Recording (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

Audiovisual content in the form of video clip files, streamed or broadcasted may present a problem known as a lip sync error, i.e., the motion of the lips of a speaker do not correspond to the sound at the same time. So as to overcome the problem the video content to the system the video content is segmented according to video scene cuts. Similarly, the audio is segmented at audio scene cuts. Analyzer compares the timing of the various cuts and determines if a lip sync error has occurred and if so if the system can provide a correction to overcome the problem. When a lip sync error is detected, based on a comparison between the video scene cuts and the audio scene cuts, a correction may be either suggested or automatically applied.

Description

    CROSS-REFERENCE TO RELATED APPLICATIONS
  • This application is a continuation of a PCT Application No. PCT/IL2019/051022 filed Sep. 12, 2019 claims the benefit of U.S. Provisional Application No. 62/730,555 filed on Sep. 13, 2018, the contents of which are hereby incorporated by reference.
  • TECHNICAL FIELD
  • The disclosure relates to lip synchronization (lip sync) between a video signal and its respective audio signal, and in particular to the correction of lip sync errors between the video signal and the audio signal.
  • BACKGROUND
  • The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, issues identified with respect to one or more approaches should not assume to have been recognized in any prior art on the basis of this section, unless otherwise indicated.
  • Lip synchronization error, also referred to as lip sync error, is defined as when the timing of a video portion deviates from the timing of its respective audio portion. Such a mismatch between the video signal and the audio signal, especially when the mismatch is above a certain threshold, is bothersome to the viewers and considered to be of poor quality. Unless care is taken to maintain the audio and video in sync this phenomena may continue and even become worse as transmission continues. The timing differential, which may be static or dynamic, is typically referred to as the lip sync error. That is, the visual effect of the motion of a speaker's lips is out of sync (i.e., not synchronized) with the audio heard. This requirement for lip synchronization may occur in broadcast and live streaming as well as video clip transmission from files.
  • The prior art teaches a variety of ways to reduce the lip sync error. One method calls for manual adjustment of the lip sync error based on an observation made by a user of a control system. Once the observer detects a lip sync error a manual adjustment, for example, delaying the video or delaying the audio, resolves the lip sync error. This method has many drawbacks including its subjectivity, i.e., it is dependent on a particular user's experience rather than on an objective metric, it being error prone, and it being difficult to scale as the number of video channels exponentially increase over time. This may also be achieved automatically if a previously detected delay is known and a delay factor is automatically used. This method is deficient as this requires the use of typically an arbitrary delay factor that may or may not be suitable for a particular case. Moreover, it does not resolve any dynamic changes in the lip sync error that may occur during the delivery of a video clip to a client. Yet other prior art methods for detection of lip sync errors include the insertion of a video signal in sync with an audio synchronization signal, also referred to as a “pip”. This allows for occasional synchronization between the video signal and the audio signal at rendezvous points. Yet another type of solution attempts to analyze the lip motion from its visual clues and correlate them to the audio provided by the audio track. One of ordinary skill in the art would readily appreciate that these methods require specialized and mostly expensive equipment. The exponential growth of video delivery and the need to reduce costs significantly cannot be supported by such prior methods.
  • It is therefore desirable to provide a solution that will allow for affordable, simple and real-time lip sync to support the ever increasing demand to resolve the lip sync error problem.
  • SUMMARY
  • A summary of several example embodiments of the disclosure follows. This summary is provided for the convenience of the reader to provide a basic understanding of such embodiments and does not wholly define the breadth of the disclosure. This summary is not an extensive overview of all contemplated embodiments, and is intended to neither identify key or critical elements of all embodiments nor to delineate the scope of any or all aspects. Its sole purpose is to present some concepts of one or more embodiments in a simplified form as a prelude to the more detailed description that is presented later. For convenience, the term “certain embodiments” may be used herein to refer to a single embodiment or multiple embodiments of the disclosure.
  • Certain embodiments disclosed herein include a system for lip synchronization of audiovisual content comprises: a video cut analyzer adapted to receive a video portion of the audiovisual content and output video segments at video scene cuts; an audio cut analyzer adapted to receive audio portion of the audiovisual content and output audio segments at audio scene cuts; a video-audio scene delta analyzer adapted to receive the video segments and the audio segments and determine therefrom at least a time delta value between the video segments and the audio segments and determine at least a correction factor; and, a lip sync error correction unit adapted to receive the video segments, the audio segments and the correction factor and output a lip sync corrected audiovisual content, wherein the correction factor is used to reduce the time delta value of the lip sync corrected audiovisual content to below a predetermined threshold value.
  • Certain embodiments disclosed herein include method for lip synchronization of audiovisual content comprises: receive audiovisual content that require lip sync; detect all video scene cuts in the received video content of the audiovisual content; detect all audio scene cuts in the received audio content of the audiovisual content; perform a comparison analysis between video cuts and audio cuts to determine a sync error; generate a notification that a lip sync is required for the audiovisual content but cannot be performed upon determination that the sync error is not within correctable parameters; generate a notification that no lip sync is required for the audiovisual content upon determination that the lip sync error is within correctable parameters and that an offset between the video content and the audio content is below a predetermined threshold value; and, perform lip sync error correction to reduce the lip sync error between the video content and the audio content upon determination that the lip sync error is within correctable parameters and that the offset between the video content and the audio content exceeds the predetermined threshold value.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The foregoing and other objects, features and advantages will become apparent and more readily appreciated from the following detailed description taken in conjunction with the accompanying drawings, in which:
  • FIG. 1 is a schematic illustration of a system according to an embodiment.
  • FIG. 2A is a schematic illustration of a first unsynchronized audio and video stream by a time difference according to an embodiment.
  • FIG. 28 is a schematic illustration of a second unsynchronized audio and video stream by a time difference according to an embodiment.
  • FIG. 3 is a schematic block diagram of a system for lip sync error correction according to an embodiment.
  • FIG. 4 is a schematic illustration of a first flowchart for detection and correction of lip sync error according to an embodiment.
  • FIG. 5 is a schematic illustration of a second flowchart for detection and correction of lip sync error according to an embodiment.
  • FIG. 6 is a schematic illustration of a third flowchart providing details of the determination of mismatch cost for the second flowchart.
  • DETAILED DESCRIPTION
  • Below, exemplary embodiments will be described in detail with reference to accompanying drawings so as to be easily realized by a person having ordinary knowledge in the art. The exemplary embodiments may be embodied in various forms without being limited to the exemplary embodiments set forth herein. Descriptions of well-known parts are omitted for clarity, and like reference numerals refer to like elements throughout.
  • It is important to note that the embodiments disclosed herein are only examples of the many advantageous uses of the innovative teachings herein. In general, statements made in the specification of the present application do not necessarily limit any of the various claims. Moreover, some statements may apply to some inventive features but not to others. In general, unless otherwise indicated, singular elements may be in plural and vice versa with no loss of generality.
  • Audiovisual content in the form of video clip files, streamed or broadcasted may present a problem known as a lip sync error, i.e., the motion of the lips of a speaker do not correspond to the sound at the same time. So as to overcome the problem the video content to the system the video content is segmented according to video scene cuts. Similarly, the audio is segmented at audio scene cuts. Analyzer compares the timing of the various cuts and determines if a lip sync error has occurred and if so if the system can provide a correction to overcome the problem. When a lip sync error is detected, based on a comparison between the video scene cuts and the audio scene cuts, a correction may be either suggested or automatically applied.
  • Reference is now made to FIG. 1 where an exemplary and non-limiting schematic illustration 100 of a synchronized audio and video stream is provided. While a reference herein is made to an audiovisual content stream it should be understood that the application of the invention disclosed herein is broader and applies to such content that is streamed, provided from file or otherwise broadcasted. The video stream 110 has various video scenes Vs1 through Vs7. According to principles of the invention these video scenes are determined based on analysis of neighboring frames, searching, for example and without limitation, for a sudden spike in the difference between the neighboring frames, or according to any of a plurality of prior art methods including, without limitation, those specified herein. These tend to change from one video scene to another. For example, as a video clip moves from a scene inside a home to a scene on the street a cut, for example cut 111, is determined and another scene begins. That is, in this particular example and without limitation, scene Vs1 is in a home while scene Vs2 is in the street, the cut between the scenes being at 111. Then the scene may move into a car, changing the video frames content abruptly and therefore suggesting a scene cut, indicated for example as cut 112. As a result the subsequent scene Vs3 is a scene happening within a car. A similar process, with obvious adaptations for the different type of media, is performed in order to slice the audio track into segments, looking for abrupt changes in the ambient sound, or according to any of the listed prior art methods. In this exemplary and non-limiting example, the audio stream 120 is perfectly aligned with the video stream 110, that is As3 and As5 are in sync with Vs3 and Vs5 while As4 is in sync with Vs4. A case like this would not require any lip sync correction as no lip sync error actually is shown. The division into the segments Vs1 through Vs7 and corresponding As1 through As7 are integral to the principles of the inventions though ways of such segmentation of video and/or audio are found in the prior art and are outside the scope of the current invention. One of ordinary skill in the art would readily appreciate that even imperfect alignment between the audio and the video may be tolerable by a user if such is below a predetermined threshold. Typically for the industry a threshold of a misalignment between audio and video that is up to 80 milliseconds is considered to be acceptable and therefore no lip sync error correction may be needed. The invention is concerned of novel and inventive use of such segmentation.
  • FIG. 2A is an exemplary and non-limiting schematic illustration 200A of a first unsynchronized audio 220 and video stream 210 by a time difference TΔ. As can be seen, the time difference between the video stream 210 and the audio stream is constant for the purpose of this illustration. The value of TΔ may also fluctuate to a certain degree around a threshold value Δ without departing from the scope of the disclosure herein. Therefore, a segmentation of the video stream 210 and the audio stream 220, performed according to the principles of the invention, shows a delta value between the audio and the video, then, if TΔ is above a predetermined threshold value Δ a correction may be either attempted automatically, or, a notification may be generated to alert an operator that an adjustment may be necessary. FIG. 2B is an exemplary and non-limiting schematic illustration 200B second unsynchronized audio and video stream by a time difference. This illustration however differs from that shown in FIG. 2A. While the same video sequence from Vs1 through Vs7 is shown, the audio stream is different. For Vs3 through Vs5 no audio cut, or segment is found, rather a continuous audio segment As3 is detected. Thereafter the TΔ values for the lip sync error continue. As will be explained herein a decision may be taken as to the lip sync error correction that may be taken, for example, if this occurs at a low enough frequency throughout the received audiovisual content it may be assumed the a TΔ lip sync correction should take place.
  • Reference is now made to FIG. 3 which is an exemplary and non-limiting schematic block diagram of a system 300 for lip sync error correction according to an embodiment. An audiovisual content 302 is provided and the video content 304 is directed to a video cut analyzer 310. The video cut analyzer 310 is enabled to segment the video content 304 to a plurality of video segments which are then provided by the video cut analyzer 310 to a video/audio scene delta analyzer 330 as well as to a lip sync error correction unit 340. The video cut analyzer 310 performs the segment cuts based on, for example but not by way of limitation, known in the art segmentation techniques. The audio content 306 of the audiovisual content 302 is provided to an audio cut analyzer. The audio cut analyzer 320 is enabled to segment the audio content 306 to a plurality of audio segments which are then provided by the audio cut analyzer 320 to the video/audio scene delta analyzer 330 as well as to a lip sync error correction unit 340. The audio cut analyzer 320 performs the segment cuts based on, for example but not by way of limitation, detection of changes in ambient noise (or sound) when changing from one scene to another. One out of many prior art solutions for such scene change detection is discussed in Lin et al., “Acoustic Scene Change Detection by Spectro-Temporal Filtering on Spectogram Using Chirps”. Another scene change detection method is provided by Kyperountas et al., in “Enhanced Eigen-Audioframes for Audiovisual Scene Change Detection”. The voice/audio scene delta analyzer (also referred to herein as the delta analyzer) 330 performs an analysis respective of the TΔ values between the video segments, as cut by the video cut analyzer 310, and the audio segments, as cut by the audio cut analyzer 320. Assuming there are a sufficient number of both audio and video segments, the analyzer may provide several types of different notification on notification signal 335. The first notification is that no lip sync errors were detected, which would mean that the TΔ values found are below a predetermined Δ threshold value, or, that the number of cases where the TΔ values exceed the minimum Δ threshold value is below another predetermined threshold value K. In one example, but not by way of limitation, the value of Δ is 60 milliseconds and the value of K is 10%. In such cases no lip sync error correction may be necessary. Both Δ and K threshold values may be programmable so as to allow for tighter or looser threshold values depending on the desired quality of service with respect to lip sync errors. Another case is where it is impossible to make any kind of lip sync error correction and the system 300 provides a notification on signal 335 of this case. Such a case may happen when the lip sync error is above the Δ threshold and has an inconsistent value. The inconsistency may be determined as an inconsistency between Δ value that is above a predetermined E threshold. In this case a notification may be provided on the notification signal 335 to alert an operator of the system 300 that certain manual intervention may be required as automatic lip sync error correction cannot be performed by the system 300.
  • In between these two cases there are two other cases that may be handled according to the principles of the invention. The first case is when the TΔ is of a consistent value above Δ but below a predetermined E error value. The second case is when TΔ is of a consistently increasing or decreasing value above Δ but below a predetermined E error value. In both cases lip sync error correction takes place and is correctable. Such error correction is performed by the lip sync error correction unit 340 that receives the video segments from the video cut analyzer 310 and the audio segments from the audio cut analyzer 320 as well as any necessary information regarding the analysis performed by the video/audio scene delta analyzer 330. Hence if the video/audio scene delta analyzer 330 has concluded that the TΔ value is below the predetermined E threshold value then the correction is possible. A correction factor is used by the lip sync error correction unit 340 to compensate for the TΔ value. If the distribution around the TΔ value is small, then correction can be made, however, if the distribution is large, i.e., it is inconsistent, then it is not possible to make a lip sync error correction using this particular solution. However if the TΔ value is constant, or has a tendency to either increase or decrease over time but within the boundaries of the maximum E threshold, and do that in a linear fashion over time, then the correction is possible using appropriate factor equations. According to one embodiment the factor may change over time if changes in the TΔ value are relatively infrequent, or, in other words, distribution is not too wide around the TΔ value. The lip sync error correction unit 340 provides lip sync corrected audiovisual content 345 thereby overcoming deficiencies that may have occurred in the audiovisual input content 302. It should therefore be understood that the error correction may include, but is not limited to, linear drift correction and non-linear drift correction.
  • FIG. 4 is an exemplary and non-limiting schematic illustration of a flowchart 400 for detection and correction of lip sync error according to an embodiment. In S410 audiovisual content is received. It may be received from a file or as an audiovisual stream. In the latter case it is necessary to collect or otherwise analyze a sufficient number of video segments and audio segments before an analysis according to the invention can take place. Thereafter corrections and updates can take place as new audiovisual content (for example audiovisual content 302) is provided and an updated analysis takes place that takes into account the newly received content. In S420 video scene cuts in the video content (for example video content 304) of the received audiovisual content are determined, using, for example but not by way of limitation, techniques described herein. In S430 audio scene cuts in the audio content (for example audio content 306) of the received audiovisual content are determined, using, for example but not by way of limitation, techniques described herein. In S440 a comparison analysis is performed to check correlations between the video scene cuts and the audio scene cuts to determine matches as well as TΔ values between video segments and audio segments. It should be understood, as noted with respect of FIG. 2B, that there are cases where there is no one-to-one match between each video segment and each audio segment, and such mismatch, as long as it is infrequent, can be overcome by system 300 by skipping to the next possible match. In S450 it is checked whether the lip sync error is within correctable parameters of the system 300, for example, but not by way of limitation, if TΔ is above E and is inconsistent, as described herein in more detail, and if so execution continues with S470; otherwise, execution continues with S460 where a notification is provided noting that the system, for example system 300, cannot perform lip sync to the received audiovisual content though a lip sync problem does exist, and thereafter execution terminates. In S470 it is checked if the offset between the audio segments and the video segments is smaller than a predetermined threshold, i.e. TΔ is smaller than Δ, and if not execution continues with S490; otherwise, execution continues with S480 where a notification may be generated noting that no lip sync error correction is required. In S490 lip sync error correction is performed so as to compensate for the TΔ between the video segments and the audio segments, for example using techniques discussed herein. The compensation may involve any one of the two cases discussed herein in more detail, i.e., the first case where TΔ is constant, or thereabout, and the second case where TΔ continuously increases or decreases over time. Once correction has completed, execution terminates.
  • FIG. 5 is a schematic illustration of a second flowchart 500 for detection and correction of lip sync error according to an embodiment and FIG. 6 is a schematic illustration of a third flowchart 600 providing details of the determination of mismatch cost for the second flowchart. Essentially the method starts by obtaining a list of audio and video scene cuts (S505), which may be detected using prior-art solutions, or other solutions which are outside of the scope of the current invention. It then generates a collection start/end audio/video offsets (S510). Each such set points to a specific scene cut (up to a predetermined value X from the list's start) as a possible start, for either list, and to another scene cut (up to X scene cuts from the end of the list) as its end, again from either list. These sets cover all the possibilities for start and end cuts, on either list, resulting in X4 such sets. According to the method it will initiate the best found cost to infinity. It thereafter iterates (S520) for each of these possible sets, to determine the A and B factors (S525) for this set, as follows: Af=Vs−As and Bf=(Ve−Vs)/Ae−As). Where Vs is the selected video start time of aspecific set; Ve the selected video end time; As the selected audio start time; and, Ae the selected Audio end time. Thereafter, a new list of corrected audio scene change times is determined as follows: A[i]=(A[i]−As)*Bf+Af+As. The method then determines the cost (S530) for this set of A,B factors. The determination is performed (S530) as follows: setting (S530-10) the cost accumulator to 0, the number of detected mismatches to 0, and pointers inside the list for both audio and video, to 0 (Pa=Pv=). Thereafter looping over until both pointers reach the end of their lists, based on the following logic: determining the distance between the pointed-to scene cuts as follows: D=|A[Pa]−V[Pv]|. If the pointed to scene cuts are close enough to count as a match (D<=Dm), but not a perfect match (D>Dp), the distance between them is added to the accumulated cost (S530-20) after which both Pa and Pv are increased (S530-25) unless one reached the end of its list, in which case it will not be incremented. In the case where the pointed to scene cuts are close enough to count as a perfect match (D<=Dp), both Pa and Pv shall be incremented (S530-25) unless one has reached the end of its list, in which case it will not be incremented. In case where the delta is too big (D>Dm), the mismatch counter is incremented (S530-30), and then increment the pointer which is pointing to a scene change time that is “further behind” (S530-40 or S530-45 as the case may be), unless that pointer has reached the end of its list, in which case the other one will be incremented. Once both pointers reach the end of their respective lists, the number of mismatches is evaluated (S530-55). If that value is above a predetermined value then the cost of this set is considered to be infinite (S530-60), and it will not be considered a good option. If the number of mismatches is below the predetermined threshold, or equal thereto, then the resulting accumulated cost is the accumulated cost (S530-65) and compared (S535) to the best accumulated cost thus far. If the cost is lower for this set, its cost are saved (S540) as the best cost, and its A,B factors are saved as the best factors thus far. Once all the sets have been evaluated, the following options exist (S550, S560): a. The best cost is still infinity which means that no good match was found, and therefore a notification is provided that lipsync cannot be corrected (S555); b. The best cost is not infinity, the best A factor is 0, and the best B factor is 1 in which case a notification that the lipsync appears to be perfect as-is and no correction is necessary (S565); or, c. The best cost is not infinity, but the best factors differ from Af=0, Bf=1 resulting in a notification that the lipsync is not good, but can be corrected by applying these factors to the audio (S570).
  • The various embodiments disclosed herein can be implemented as hardware, firmware, software, or any combination thereof. Moreover, the software is preferably implemented as an application program tangibly embodied on a program storage unit or computer readable medium consisting of parts, or of certain devices and/or a combination of devices. The application program may be uploaded to, and executed by, a machine comprising any suitable architecture. Preferably, the machine is implemented on a computer platform having hardware such as one or more central processing units (“CPUs”), a memory, and input/output interfaces. The computer platform may also include an operating system and microinstruction code. The various processes and functions described herein may be either part of the microinstruction code or part of the application program, or any combination thereof, which may be executed by a CPU, whether or not such a computer or processor is explicitly shown. In addition, various other peripheral units may be connected to the computer platform such as an additional data storage unit and a printing unit. Furthermore, a non-transitory computer readable medium is any computer readable medium except for a transitory propagating signal.
  • The various embodiments disclosed herein can be implemented as hardware, firmware, software, or any combination thereof. Moreover, the software is preferably implemented as an application program tangibly embodied on a program storage unit or computer readable medium consisting of parts, or of certain devices and/or a combination of devices. The application program may be uploaded to, and executed by, a machine comprising any suitable architecture. Preferably, the machine is implemented on a computer platform having hardware such as one or more central processing units (“CPUs”), a memory, and input/output interfaces. The computer platform may also include an operating system and microinstruction code. The various processes and functions described herein may be either part of the microinstruction code or part of the application program, or any combination thereof, which may be executed by a CPU, whether or not such a computer or processor is explicitly shown. In addition, various other peripheral units may be connected to the computer platform such as an additional data storage unit and a printing unit. Furthermore, a non-transitory computer readable medium is any computer readable medium except for a transitory propagating signal.
  • All examples and conditional language recited herein are intended for pedagogical purposes to aid the reader in understanding the principles of the disclosed embodiment and the concepts contributed by the inventor to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions. Moreover, all statements herein reciting principles, aspects, and embodiments of the disclosed embodiments, as well as specific examples thereof, are intended to encompass both structural and functional equivalents thereof. Additionally, it is intended that such equivalents include both currently known equivalents as well as equivalents developed in the future, i.e., any elements developed that perform the same function, regardless of structure.

Claims (17)

1. A system for lip synchronization of audiovisual content comprises:
a video cut analyzer adapted to receive a video portion of the audiovisual content and output video segments at video scene cuts;
an audio cut analyzer adapted to receive audio portion of the audiovisual content and output audio segments at audio scene cuts;
a video-audio scene delta analyzer adapted to receive the video segments and the audio segments and determine therefrom at least a time delta value between the video segments and the audio segments and determine at least a correction factor; and
a lip sync error correction unit adapted to receive the video segments, the audio segments and the correction factor and output a lip sync corrected audiovisual content, wherein the correction factor is used to reduce the time delta value of the lip sync corrected audiovisual content to below a predetermined threshold value.
2. The system of claim 1, wherein the video cut analyzer determines a video scene change for the video scene cut based on an abrupt difference between neighboring frames of the video portion.
3. The system of claim 1, wherein the video cut analyzer determines a video scene change for the video scene cut based on a change from a frame in a video scene having a first background to a video scene in a second background.
4. The system of claim 1, wherein the audio cut analyzer determines an audio scene change for the audio scene cut based on a change in an ambient sound.
5. The system of claim 1, wherein the audio cut analyzer determines an audio scene change for the audio scene cut based on a change in an ambient noise.
6. The system of claim 1, wherein the audio cut analyzer determines an audio scene change for the audio scene cut by performing a spectro-temporal filtering.
7. The system of claim 1, wherein the lip sync error correction unit provides a notification that lip sync correction cannot be performed upon determination that the lip sync error is not within correctable parameters.
8. The system of claim 1, wherein the lip sync error correction unit provides a notification that lip sync correction is unnecessary as the lip sync error is smaller than a predetermined threshold value between audio and video.
9. The system of claim 1, wherein the lip sync error correction unit performs the lip sync error correction upon determination that the lip sync error is within correctable parameters but above a predetermined threshold value for the offset between audio and video.
10. The system of claim 1, wherein the audiovisual content is at least one of: video clip file, streamed video content, and broadcast video content.
11. The system of claim 1, wherein the error correction unit is further adapted to perform at least one of: a linear drift correction and a non-liner drift correction.
12. A method for lip synchronization of audiovisual content comprises:
receive audiovisual content that require lip sync;
detecting all video scene cuts in the received video content of the audiovisual content;
detecting all audio scene cuts in the received audio content of the audiovisual content;
performing a comparison analysis between video cuts and audio cuts to determine a sync error;
generating a notification that a lip sync is required for the audiovisual content but cannot be performed upon determination that the sync error is not within correctable parameters;
generating a notification that no lip sync is required for the audiovisual content upon determination that the lip sync error is within correctable parameters and that an offset between the video content and the audio content is below a predetermined threshold value; and
performing a lip sync error correction to reduce the lip sync error between the video content and the audio content upon determination that the lip sync error is within correctable parameters and that the offset between the video content and the audio content exceeds the predetermined threshold value.
13. The method of claim 12, wherein a detection of a video scene cut comprises:
determining an abrupt difference between neighboring frames of the video content.
14. The method of claim 12, wherein a detection of a video scene cut comprises:
determining a change from a frame in a video scene having a first background to a video scene in a second background.
15. The method of claim 12, wherein a detection of an audio scene cut comprises:
determining a change for the audio scene cut based on a change in an ambient sound.
16. The method of claim 12, wherein a detection of an audio scene cut comprises:
determining a change for the audio scene cut by performing a spectro-temporal filtering.
17. The method of claim 12, wherein performing a lip sync error correction comprises performing at least one of: a linear drift correction and a non-liner drift correction.
US17/200,450 2018-09-13 2021-03-12 System and a computerized method for audio lip synchronization of video content Abandoned US20210219012A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/200,450 US20210219012A1 (en) 2018-09-13 2021-03-12 System and a computerized method for audio lip synchronization of video content

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US201862730555P 2018-09-13 2018-09-13
PCT/IL2019/051022 WO2020053861A1 (en) 2018-09-13 2019-09-12 A system and a computerized method for audio lip synchronization of video content
US17/200,450 US20210219012A1 (en) 2018-09-13 2021-03-12 System and a computerized method for audio lip synchronization of video content

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
PCT/IL2019/051022 Continuation WO2020053861A1 (en) 2018-09-13 2019-09-12 A system and a computerized method for audio lip synchronization of video content

Publications (1)

Publication Number Publication Date
US20210219012A1 true US20210219012A1 (en) 2021-07-15

Family

ID=69778425

Family Applications (1)

Application Number Title Priority Date Filing Date
US17/200,450 Abandoned US20210219012A1 (en) 2018-09-13 2021-03-12 System and a computerized method for audio lip synchronization of video content

Country Status (3)

Country Link
US (1) US20210219012A1 (en)
EP (1) EP3841758A4 (en)
WO (1) WO2020053861A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11871068B1 (en) * 2019-12-12 2024-01-09 Amazon Technologies, Inc. Techniques for detecting non-synchronization between audio and video

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111354235A (en) * 2020-04-24 2020-06-30 刘纯 Piano remote teaching system
CN111510758A (en) * 2020-04-24 2020-08-07 怀化学院 Synchronization method and system in piano video teaching
EP4024878A1 (en) * 2020-12-30 2022-07-06 Advanced Digital Broadcast S.A. A method and a system for testing audio-video synchronization of an audio-video player
CN113516985A (en) * 2021-09-13 2021-10-19 北京易真学思教育科技有限公司 Speech recognition method, apparatus and non-volatile computer-readable storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7149686B1 (en) * 2000-06-23 2006-12-12 International Business Machines Corporation System and method for eliminating synchronization errors in electronic audiovisual transmissions and presentations
US7130316B2 (en) * 2001-04-11 2006-10-31 Ati Technologies, Inc. System for frame based audio synchronization and method thereof
KR100694060B1 (en) * 2004-10-12 2007-03-12 삼성전자주식회사 Apparatus and method for synchronizing video and audio
CA2654574A1 (en) * 2006-06-08 2007-12-13 Thomson Licensing Scene change detection for video
WO2013086027A1 (en) * 2011-12-06 2013-06-13 Doug Carson & Associates, Inc. Audio-video frame synchronization in a multimedia stream

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11871068B1 (en) * 2019-12-12 2024-01-09 Amazon Technologies, Inc. Techniques for detecting non-synchronization between audio and video

Also Published As

Publication number Publication date
WO2020053861A1 (en) 2020-03-19
EP3841758A1 (en) 2021-06-30
EP3841758A4 (en) 2022-06-22

Similar Documents

Publication Publication Date Title
US20210219012A1 (en) System and a computerized method for audio lip synchronization of video content
US11432037B2 (en) Method and system for detecting and responding to changing of media channel
CN111213385B (en) Content modification method, media client and non-transitory computer readable medium
JP5698318B2 (en) Feature optimization and reliability prediction for audio and video signature generation and detection
US11979631B2 (en) Audio video synchronization
US10356492B2 (en) Video management
JP4482911B2 (en) System and method for determining synchronization between an audio signal and a video signal
US11445266B2 (en) System and computerized method for subtitles synchronization of audiovisual content using the human voice detection for synchronization
KR101741747B1 (en) Apparatus and method for processing real time advertisement insertion on broadcast
US9955229B2 (en) Using scene-change transitions to output an alert indicating a functional state of a back-up video-broadcast system
CN111510758A (en) Synchronization method and system in piano video teaching
JP2011146783A (en) Relay device, program, system and method, for correcting synchronization between video frame and audio frame
US11722729B2 (en) Method and system for use of earlier and/or later single-match as basis to disambiguate channel multi-match with non-matching programs
US9582244B2 (en) Using mute/non-mute transitions to output an alert indicating a functional state of a back-up audio-broadcast system
US20170098467A1 (en) Method and apparatus for detecting frame synchronicity between master and ancillary media files

Legal Events

Date Code Title Description
AS Assignment

Owner name: ICHANNEL.IO LTD, ISRAEL

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:MAURICE, OREN JACK;REEL/FRAME:055580/0234

Effective date: 20210311

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER

STPP Information on status: patent application and granting procedure in general

Free format text: FINAL REJECTION MAILED

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION