WO2020010883A1 - 同步视频数据和音频数据的方法、存储介质和电子设备 - Google Patents

同步视频数据和音频数据的方法、存储介质和电子设备 Download PDF

Info

Publication number
WO2020010883A1
WO2020010883A1 PCT/CN2019/081591 CN2019081591W WO2020010883A1 WO 2020010883 A1 WO2020010883 A1 WO 2020010883A1 CN 2019081591 W CN2019081591 W CN 2019081591W WO 2020010883 A1 WO2020010883 A1 WO 2020010883A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
image
video data
face
audio data
Prior art date
Application number
PCT/CN2019/081591
Other languages
English (en)
French (fr)
Inventor
王正博
沈亮
Original Assignee
北京大米科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京大米科技有限公司 filed Critical 北京大米科技有限公司
Publication of WO2020010883A1 publication Critical patent/WO2020010883A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/4302Content synchronisation processes, e.g. decoder synchronisation
    • H04N21/4307Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/441Acquiring end-user identification, e.g. using personal code sent by the remote control or by inserting a card
    • H04N21/4415Acquiring end-user identification, e.g. using personal code sent by the remote control or by inserting a card using biometric characteristics of the user, e.g. by voice recognition or fingerprint scanning

Definitions

  • the present invention relates to the field of digital signal processing, and in particular, to a method, a storage medium, and an electronic device for synchronizing video data and audio data.
  • Embodiments of the present invention provide a method, a storage medium, and an electronic device for synchronizing video data and audio data, so as to synchronize video data with audio data.
  • a method for synchronizing video data and audio data includes:
  • the first sequence is a time sequence of facial feature parameters, and the facial feature parameters are used to characterize a state of a lip (ie, a mouth) of a face in the video data;
  • the second sequence is a time sequence of the strength of the speech signal in the audio data, and the second sequence uses the same sampling period as the first sequence;
  • the video data and the audio data are synchronized according to a time axis deviation having a maximum number of correlations.
  • a computer-readable storage medium on which computer program instructions are stored, wherein the computer program instructions, when executed by a processor, implement the method according to the first aspect.
  • an electronic device including a memory and a processor, wherein the memory is used to store one or more computer program instructions, wherein the one or more computer program instructions are The processor executes to implement the method as described in the first aspect.
  • the correlation between the change in the state of the lip and the intensity of the voice signal is obtained by sliding cross-correlation search.
  • Time axis deviation which is synchronized based on the time axis deviation, and realizes audio and video synchronization of video data and audio data.
  • FIG. 1 is a flowchart of a method for synchronizing video data and audio data in one or more embodiments
  • FIG. 2 is a flowchart of a method for obtaining a first sequence according to an embodiment of the present invention
  • FIG. 3 is a flowchart of sliding cross-correlation between a first sequence and a second sequence according to an embodiment of the present invention
  • FIG. 4 is a block diagram of an electronic device according to an embodiment of the present invention.
  • the online playback program will play according to the index order of video files and audio files and timeline information. Due to the inconsistency in the length of the video file and the audio file, the audio and video will not be synchronized during playback.
  • FIG. 1 is a flowchart of a method of synchronizing video data and audio data in one or more embodiments.
  • the process of synchronizing video data and audio data recorded in an online classroom is described as an example.
  • the method in this embodiment includes the following steps:
  • Step S100 Obtain a first sequence according to the video data.
  • the first sequence is a time sequence of facial feature parameters, and the facial feature parameters are used to characterize a lip state of a human face in video data.
  • the video data processed in step S100 is a video file recorded online and processed in segments.
  • the first sequence obtains an image of each sampling point by sampling video data according to a predetermined sampling period, and then processes each image to obtain a facial feature parameter.
  • synchronization is performed based on a positive correlation between the intensity of a person's speech and the degree of opening of a person's mouth. For example, the greater the mouth opening, the greater the intensity of the speech.
  • synchronization of video data and audio data is performed by utilizing the above-mentioned relationship.
  • FIG. 2 is a flowchart of a method for obtaining a first sequence according to an embodiment of the present invention. As shown in FIG. 2, step S100 includes:
  • Step S110 Sampling the video data according to a predetermined sampling period to obtain a first image sequence.
  • the first image sequence includes images obtained by sampling.
  • the video data is regarded as a continuous image sequence
  • the first image sequence can be obtained by extracting an image from the video data every other sampling period on the time axis.
  • the data amount of the first image sequence obtained after extraction is much smaller than the original video data, which can reduce the computational load of subsequent data processing.
  • the sampling period is set according to the frequency of face and mouth movements in the video data and the configured computing power.
  • Step S120 Perform face recognition on each image in the first image sequence to obtain face area information of each image.
  • the face detection is implemented by various existing image processing algorithms, such as a reference template method, a face rule method, a feature sub-face method, and a sample recognition method.
  • the obtained face area information may be represented by a data structure R (X, Y, W, H) of the face area.
  • R (X, Y, W, H) defines a rectangular area including the main part of the face in the image, wherein X and Y define the coordinates of an endpoint of the rectangular area, and W and H define the rectangular area, respectively. Width and height.
  • Step S130 Obtain keypoint information of the face and lips according to each image in the first image sequence and corresponding face area information.
  • the image in the facial area can be further detected to obtain the positions of the facial features.
  • the correlation between the opening degree of the human mouth and the strength of the voice signal is used to synchronize the video data and audio data.
  • the state of the human lip is detected by detecting the human face and lip and acquiring key point information of the human face and lip.
  • Dlib is used to perform the above-mentioned face detection and lip keypoint information acquisition.
  • Dlib is a C ++ open source toolkit containing machine learning algorithms.
  • the facial features and contours of a face are identified by 68 key points.
  • the contour of the lip is defined by a number of key points.
  • Step S140 Acquire the facial feature parameters according to the keypoint information of the face and lips of each image in the first image sequence.
  • the facial feature parameters are used to characterize the lip state of the human face. In one or more embodiments, the facial feature parameters need to be able to represent the degree of mouth opening, so as to facilitate subsequent association with the strength of the voice signal. In one or more embodiments, the facial feature parameter may be any one of a height of a face lip image, an area of the face lip image, and a ratio of a height to a width of the face lip image. In one or more embodiments, these parameters are used to characterize the degree of opening of a person's face and mouth.
  • the ratio of the height to the width of the face and lip image is a relative parameter, which can eliminate the deviation caused by the face's back and forth movement relative to the camera device, and characterize the mouth opening in different images. degree.
  • the above parameters are further processed to include a function of at least one of a height of the face lip image, an area of the face lip image, and a ratio of the height and width of the face image to As a facial feature parameter.
  • Step S150 Obtain the first sequence according to the facial feature parameters corresponding to each image in the first image sequence.
  • the first sequence thus obtained can effectively characterize the trend of the movement state of the face and mouth in the video data over time.
  • Step S200 Acquire a second sequence according to the audio data.
  • the second sequence is a time sequence of voice signal strength in audio data.
  • the second sequence uses the same sampling period as the first sequence.
  • step S200 voice signal strength is extracted from the audio data according to the sampling period to obtain the second sequence, and the audio data is recorded and divided synchronously with the video data. Audio file with no voice signal portion.
  • the operation of removing the voiceless signal portion is performed by calculating the energy spectrum of the audio data and performing endpoint detection.
  • the audio data is an audio file that is directly segmented according to time without any processing after synchronous recording.
  • speech extraction is implemented by various existing speech signal extraction algorithms, such as linear prediction analysis, perceptual linear prediction coefficients, and Fbank feature extraction based on filter banks.
  • the obtained second sequence characterizes a change trend of the strength of the speech signal in the audio data.
  • step S100 and step S200 are performed successively. In one or more embodiments, step S200 is performed first, and then step S100 is performed. In one or more embodiments, S200 and S100 are performed simultaneously. In one or more embodiments, the first sequence and the second sequence can be successfully extracted before performing the sliding related operation.
  • the sampling period used is 1 s / time. Adopting the sampling frequency can appropriately reduce the number of samplings, thereby reducing the calculation amount of steps S100-S400 and the memory required, and can quickly achieve the purpose of synchronizing video data with audio data.
  • Step S300 Perform sliding cross-correlation on the first sequence and the second sequence to obtain the number of correlations corresponding to different time axis deviations.
  • the number of correlations between two time series is used to characterize the degree of similarity between the values of the two sequences at different times, which can be used to characterize the two sequences under a certain offset state Degree of mutual matching.
  • the degree of correlation between the first sequence and the second sequence in different time axis offset states is calculated by calculating the number of correlations, that is, in different time axis offset states, the video data The degree of matching of the speech signal strength in the mouth data and the relatively offset audio data.
  • FIG. 3 is a flowchart of performing sliding cross-correlation between a first sequence and a second sequence according to an embodiment of the present invention.
  • step S300 may include the following steps:
  • Step S310 Perform a time axis offset on the first sequence according to a possible time axis deviation to obtain a first sequence after the offset corresponding to each possible time axis deviation.
  • Step S320 Perform cross-correlation between the second sequence and each offset first sequence to obtain the number of correlations corresponding to each possible time axis deviation.
  • step S300 includes:
  • Step S310 ' Perform a time axis offset on the second sequence according to a possible time axis deviation to obtain a second sequence after the offset corresponding to each possible time axis deviation.
  • Step S320 ' Cross-correlate the first sequence and each offset second sequence to obtain the number of correlations corresponding to each possible time axis deviation.
  • step S320 the number of obtained correlations for each possible time axis deviation is:
  • ⁇ t is the possible time axis deviation
  • corr ( ⁇ t) is the number of correlations corresponding to the possible time axis deviation
  • i is the number of sampling points obtained by using the sampling period
  • a (t) is The first sequence is described
  • I (t) is the second sequence
  • I (t- ⁇ t) is the second sequence after the offset
  • n is the length of the first sequence and the second sequence.
  • the above calculation formula of the correlation number is a simplified calculation method of the correlation number, and the purpose of adopting the above formula is to further reduce the required calculation amount.
  • the standard correlation calculation formula can also be used to calculate the correlation.
  • Step S400 Synchronize the video data and the audio data according to a time axis deviation with a maximum number of correlations.
  • the cross-correlation number may represent the degree of matching between the first sequence and the second sequence shifted by the time axis, that is, it may represent the matching state of the face and lip state and the strength of the voice signal. Therefore, the time axis deviation with the maximum number of correlations makes the face and mouth state and the strength of the voice signal optimally match. At this time, the voice content is consistent with the action of the mouth of the face, and the video data and audio data are processed. Relative offset can be synchronized.
  • the correlation between the change in the state of the lip and the intensity of the voice signal is determined by sliding cross-correlation.
  • the time axis deviation is synchronized based on the time axis deviation. Therefore, the audio and video synchronization of video data and audio data can be performed quickly. In one or more embodiments, it is possible to achieve better video and audio synchronization without relying on timestamp information, and enhance the user experience.
  • FIG. 4 is a schematic diagram of an electronic device according to an embodiment of the present invention.
  • the electronic device shown in FIG. 4 is a general-purpose data processing apparatus including a general-purpose computer hardware structure including at least a processor 41 and a memory 42.
  • the processor 41 and the memory 42 are connected via a bus 43.
  • the memory 42 is adapted to store instructions or programs executable by the processor 41.
  • the processor 41 may be an independent microprocessor or a set of one or more microprocessors. Therefore, the processor 41 executes the commands stored in the memory 42 to execute the method flow of the embodiment of the present invention as described above to implement data processing and control on other devices.
  • the bus 43 connects the above-mentioned multiple components together, and simultaneously connects the above-mentioned components to the display controller 44 and the display device and the input / output (I / O) device 45.
  • the input / output (I / O) device 45 may be a mouse, keyboard, modem, network interface, touch input device, somatosensory input device, printer, and other devices known in the art.
  • an input / output (I / O) device 45 is connected to the system through an input / output (I / O) controller 46.
  • the memory 42 may store software components, such as an operating system, a communication module, an interaction module, and an application program. Each module and application described above corresponds to a set of executable program instructions that perform one or more functions and methods described in the embodiments of the invention.
  • aspects of the embodiments of the present invention may be implemented as a system, method or computer program product. Therefore, various aspects of the embodiments of the present invention may take the following forms: a completely hardware implementation, a completely software implementation (including firmware, resident software, microcode, etc.) or may generally be referred to herein as “circuits", “modules” “Or” system “implementations that combine software and hardware aspects. Furthermore, aspects of the invention may take the form of a computer program product implemented in one or more computer-readable media, the computer-readable medium having computer-readable program code implemented thereon.
  • the computer-readable medium may be a computer-readable signal medium or a computer-readable storage medium.
  • the computer-readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing.
  • a computer-readable storage medium may be any tangible medium capable of containing or storing a program used by or in conjunction with an instruction execution system, device, or device.
  • the computer-readable signal medium may include a propagated data signal having computer-readable program code implemented therein, such as in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electromagnetic, optical, or any suitable combination thereof.
  • the computer-readable signal medium may be any of the following computer-readable media: not a computer-readable storage medium, and may communicate and propagate a program used by or in conjunction with an instruction execution system, device, or device Or transmission.
  • Computer program code for performing operations directed to aspects of the present invention may be written in any combination of one or more programming languages, including: object-oriented programming languages such as Java, Smalltalk, C ++, PHP, Python Etc .; and conventional procedural programming languages such as the "C" programming language or similar programming languages.
  • the program code may be executed entirely on the user's computer as a stand-alone software package, partly on the user's computer; partly on the user's computer and partly on a remote computer; or entirely on a remote computer or server.
  • the remote computer can be connected to the user's computer through any type of network including a local area network (LAN) or wide area network (WAN), or can be connected to an external computer (for example, by using the Internet of an Internet service provider) .
  • LAN local area network
  • WAN wide area network
  • Internet service provider for example, by using the Internet of an Internet service provider

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Theoretical Computer Science (AREA)
  • Television Signal Processing For Recording (AREA)
  • Image Analysis (AREA)

Abstract

公开了一种同步视频数据和音频数据的方法、存储介质和电子设备。本发明实施例通过获取视频数据中人脸的唇部状态变化与音频数据中语音信号强度的变化,通过滑动互相关获取使得唇部状态变化和语音信号强度变化相关度最高的时间轴偏差,基于该时间轴偏差进行同步。由此,可以快速进行视频数据和音频数据的音画同步。

Description

同步视频数据和音频数据的方法、存储介质和电子设备
本申请要求了2018年7月11日提交的、申请号为2018107599943、发明名称为“同步视频数据和音频数据的方法、存储介质和电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及数字信号处理领域,具体涉及一种同步视频数据和音频数据的方法、存储介质和电子设备。
背景技术
随着互联网技术的高速发展,在线视频观看的应用也越来越广泛。当前视频多采用音频数据与视频数据分别存储在不同文件中,在播放时,分别从视频文件和音频文件读取信息进行播放。但是,如果分别存储的音频数据与视频数据的时间轴不同步,则会导致音画不同步的问题。
发明内容
本发明实施例提出一种同步视频数据和音频数据的方法、存储介质和电子设备,以实现视频数据与音频数据的同步。
根据本发明实施例的第一方面,提供一种同步视频数据和音频数据的方法,其中,所述方法包括:
根据视频数据获取第一序列,所述第一序列为人脸特征参数的时间序列,所述人脸特征参数用于表征视频数据中人脸的唇部(也即,嘴部)状态;
根据音频数据获取第二序列,所述第二序列为音频数据中语音信号强度的时间序列,所述第二序列与所述第一序列采用相同的采样周期;
对所述第一序列与所述第二序列进行滑动互相关,以获得不同时间轴偏差对应的互相关系数;
根据具有最大互相关系数的时间轴偏差同步所述视频数据和所述音频数据。
根据本发明实施例的第二方面,提供一种计算机可读存储介质,其上存储计算机程序指令,其中,所述计算机程序指令在被处理器执行时实现如第一方面所述的方 法。
根据本发明实施例的第三方面,提供一种电子设备,包括存储器和处理器,其中,所述存储器用于存储一条或多条计算机程序指令,其中,所述一条或多条计算机程序指令被所述处理器执行以实现如第一方面所述的方法。
在一个或多个实施例中,通过获取视频数据中人脸的唇部状态变化与音频数据中语音信号强度的变化,通过滑动互相关查找使得唇部状态变化和语音信号强度变化相关度最高的时间轴偏差,基于该时间轴偏差进行同步,实现视频数据和音频数据的音画同步。
附图说明
通过以下参照附图对本发明实施例的描述,本发明的上述以及其它目的、特征和优点将更为清楚,在附图中:
图1是一个或多个实施例中同步视频数据和音频数据的方法的流程图;
图2是本发明实施例的方法获取第一序列的流程图;
图3是本发明实施例的第一序列与第二序列滑动互相关的流程图;
图4是本发明实施例的电子设备的框图。
具体实施方式
以下基于实施例对本发明进行描述,但是本发明并不仅仅限于这些实施例。在下文对本发明的细节描述中,详尽描述了一些特定的细节部分。对本领域技术人员来说没有这些细节部分的描述也可以完全理解本发明。为了避免混淆本发明的实质,公知的方法、过程、流程、元件和电路并没有详细叙述。
此外,本领域普通技术人员应当理解,在此提供的附图都是为了说明的目的,并且附图不一定是按比例绘制的。
除非上下文明确要求,否则整个说明书和权利要求书中的“包括”、“包含”等类似词语应当解释为包含的含义而不是排他或穷举的含义;也就是说,是“包括但不限于”的含义。
在本发明的描述中,需要理解的是,术语“第一”、“第二”等仅用于描述目的,而不能理解为指示或暗示相对重要性。此外,在本发明的描述中,除非另有说明,“多个”的含义是两个或两个以上。
发明人知晓的是,对于在线录制的视频数据和音频数据,为了尽可能缩小数据所占用的存储空间,会将音频数据中没有语音信号的部分去除掉,从而存储分段的、具有不同时间长度的音频文件。同时,视频数据也会被分段存储为多个不同的视频文件。在播放时,在线播放程序会根据视频文件和音频文件的索引顺序以及时间轴信息来进行播放。由于视频文件和音频文件的长度不一致,会出现播放时音画不同步的问题。
图1是一个或多个实施例中同步视频数据和音频数据的方法的流程图。在一个或多个实施例中,以对于在线课堂同步录制的视频数据和音频数据的同步过程为例进行说明。如图1所示,本实施例的方法包括如下步骤:
步骤S100、根据视频数据获取第一序列。其中,所述第一序列为人脸特征参数的时间序列,所述人脸特征参数用于表征视频数据中人脸的唇部状态。
在一个或多个实施例中,步骤S100处理的视频数据是在线录制并经过分段处理的视频文件。在一个或多个实施例中,第一序列通过按照预定的采样周期对视频数据采样,获取每个采样点的图像,然后对每个图像进行处理以获得人脸特征参数。在一个或多个实施例中,基于人发出的语音的强度与人嘴部的张开程度正相关的关系进行同步,例如,嘴部张开越大,通常语音的强度越大。在一个或多个实施例中,通过利用上述关系来进行视频数据和音频数据的同步。
图2是本发明实施例的方法获取第一序列的流程图。如图2所示,步骤S100包括:
步骤S110、按照预定采样周期对所述视频数据采样以获取第一图像序列。所述第一图像序列包括采样获取的图像。
在一个或多个实施例中,将视频数据看成一个连续的图像序列,通过在时间轴上每隔一个采样周期从视频数据中抽取一个图像就可以获得所述第一图像序列。在一个或多个实施例中,经过抽取后获得第一图像序列的数据量远小于原来的视频数据,能够减少后续数据处理的计算负担。在一个或多个实施例中,采样周期根据视频数据中人脸嘴部动作的频率以及所配置的计算能力来设定。
步骤S120、对所述第一图像序列中的每一个图像进行人脸识别获取每一个图像的人脸区域信息。
在一个或多个实施例中,步骤S120中,所述人脸检测通过各种现有的图像处理算法实现,例如参考模板法、人脸规则法、特征子脸法以及样本识别法等。在一个或 多个实施例中,所获取的人脸区域信息可以通过人脸区域的数据结构R(X,Y,W,H)来表示。其中,R(X,Y,W,H)限定了图像中包括人脸主要部分的一个矩形区域,其中,X和Y限定了该矩形区域的一个端点的坐标,W和H分别限定该矩形区域的宽度和高度。
步骤S130、根据所述第一图像序列中的每一个图像和对应的人脸区域信息获取人脸唇部关键点信息。
由于人脸五官的分布具有较高的相似性,因此,在检测获得人脸区域信息后,就可以对人脸区域内的图像进行进一步检测来获取五官的位置。在一个或多个实施例中,利用人嘴部的张开程度和语音信号强度的相关性来进行视频数据和音频数据的同步。在一个或多个实施例中,在本步骤,通过检测人脸唇部,获取人脸唇部关键点信息来实现对人唇部状态的检测。
在一个或多个实施例中,利用Dlib来进行上述的人脸检测和唇部关键点信息获取。Dlib是一个包含机器学习算法的C++开源工具包。在Dlib中,将人脸的五官和轮廓通过68个关键点来进行标识。在一个或多个实施例中,唇部的轮廓用多个关键点来限定。由此,通过提取获得唇部的关键点即可获得图像中当前人脸嘴部的状态。
步骤S140、根据所述第一图像序列中的每一个图像的人脸唇部关键点信息获取所述人脸特征参数。
在一个或多个实施例中,人脸特征参数用于表征人脸的唇部状态。在一个或多个实施例中,人脸特征参数需要能够表征嘴部的张开程度,以便于后续与语音信号强度建立关联。在一个或多个实施例中,所述人脸特征参数可以选用人脸唇部图像的高度、人脸唇部图像的面积和人脸唇部图像的高度与宽度的比值中的任一项。在一个或多个实施例中,这些参数被用来表征人脸嘴部的张开程度。在一个或多个实施例中,人脸唇部图像的高度与宽度的比值由于是相对参数,可以消除由于人脸相对于摄像装置前后移动造成的偏差,表征不同的图像中嘴部张开的程度。在一个或多个实施例中,对上述参数进行进一步处理以包含人脸唇部图像的高度、人脸唇部图像的面积和人脸图像的高度与宽度的比值中的至少一项的函数来作为人脸特征参数。
步骤S150、根据所述第一图像序列中每一个图像对应的所述人脸特征参数获取所述第一序列。
由此获得的第一序列可以有效地表征视频数据中人脸嘴部的动作状态随时间变化的趋势。
步骤S200、根据音频数据获取第二序列。其中,在一个或多个实施例中,所述第二序列为音频数据中语音信号强度的时间序列。在一个或多个实施例中,所述第二序列与所述第一序列采用相同的采样周期。
在一个或多个实施例中,在步骤S200中,根据所述采样周期对所述音频数据进行语音信号强度的提取以获取所述第二序列,所述音频数据为随视频数据同步录制并除无语音信号部分的音频文件。在一个或多个实施例中,去除无语音信号部分的操作,通过计算音频数据的能量谱以及进行端点检测来进行。在一个或多个实施例中,音频数据是同步录制后未经过任何处理直接根据时间分段的音频文件。
在一个或多个实施例中,语音提取通过各种现有的语音信号提取算法实现,例如,线性预测分析、感知线性预测系数以及基于滤波器组的Fbank特征提取等方法。
在一个或多个实施例中,获得的第二序列表征音频数据中语音信号强度的变化趋势。
在一个或多个实施例中,步骤S100与步骤S200的执行两者先后进行。在一个或多个实施例中,先执行步骤S200,后执行S100。在一个或多个实施例中,S200、S100同时执行在一个或多个实施例中,在进行滑动相关操作前,第一序列和第二序列均能提取成功。
在一个或多个实施例中,采用的采样周期为1s/次。采用该采样频率可以适当地减少采样次数,从而减少步骤S100-S400的计算量及需要占用的内存,能够快速地实现视频数据与音频数据同步的目的。
步骤S300、对所述第一序列与所述第二序列进行滑动互相关,以获得不同时间轴偏差对应的互相关系数。
在一个或多个实施例中,两个时间序列的互相关系数用于表征两个序列在不同时刻的取值之间的相似程度,其可以用于表征两个序列在一定的偏移状态下的相互匹配的程度。在一个或多个实施例中,通过计算互相关系数来表征不同的时间轴偏移状态下第一序列和第二序列的相关程度,也即,不同的时间轴偏移状态下,视频数据中嘴部状态和相对偏移的音频数据中语音信号强度的匹配程度。
图3是本发明实施例的进行第一序列与第二序列滑动互相关的流程图。在一个可选的实现方式中,如图3所示,步骤S300可以包括如下步骤:
步骤S310、按照可能的时间轴偏差对所述第一序列进行时间轴偏移,获取每一个可能的时间轴偏差对应的偏移后的第一序列。
步骤S320、将所述第二序列和每一个偏移后的第一序列进行互相关,获取每一个可能的时间轴偏差对应的互相关系数。
在一个或多个实施例中,将对所述第一序列进行时间轴偏移可替换为对所述第二序列进行时间轴偏移。在这种情况下,步骤S300包括:
步骤S310’、按照可能的时间轴偏差对所述第二序列进行时间轴偏移,获取每一个可能的时间轴偏差对应的偏移后的第二序列。
步骤S320’、将所述第一序列和每一个偏移后的第二序列进行互相关,获取每一个可能的时间轴偏差对应的互相关系数。
在一个或多个实施例中,步骤S320中,所述获取每一个可能的时间轴偏差对应的互相关系数为:
Figure PCTCN2019081591-appb-000001
其中,Δt为所述可能的时间轴偏差,corr(Δt)为所述可能的时间轴偏差对应的互相关系数,i为采用所述采样周期获得的采样点的数量,A(t)为所述第一序列,I(t)为所述第二序列,I(t-Δt)为所述偏移后的第二序列,n为第一序列和第二序列的长度。在第一序列和第二序列的长度不同时,此时视频数据和音频数据的时间长度不同,因此,n为第一序列和第二序列中长度较小的序列的长度。还应理解,上述的互相关系数计算公式为简化后的互相关系数计算方式,采用上述公式的目的在于进一步缩减所需要的计算量。应理解,也可以采用标准的互相关系数计算公式来计算互相关系数。
步骤S400、根据具有最大互相关系数的时间轴偏差同步所述视频数据和所述音频数据。
在一个或多个实施例中,互相关系数可以表征第一序列和经过时间轴偏移的第二序列的匹配程度,也即,可以表征人脸唇部状态和语音信号强度的匹配状态。由此,具有最大互相关系数的时间轴偏差使得所述人脸嘴部状态和语音信号强度达到最佳匹配,这时,语音内容与人脸的嘴部动作一致,对视频数据和音频数据进行相对偏移即可实现同步。
在一个或多个实施例中,通过获取视频数据中人脸的唇部状态变化与音频数据中语音信号强度的变化,通过滑动互相关查找使得唇部状态变化和语音信号强度变化相关度最高的时间轴偏差,基于该时间轴偏差进行同步。由此,可以快速进行视频数据和音频数据的音画同步。在一个或多个实施例中,可以不依赖时间戳信息,达到更好 的视频与音频同步的效果,增强了用户体验。
图4是本发明实施例的电子设备的示意图。图4所示的电子设备为通用数据处理装置,其包括通用的计算机硬件结构,其至少包括处理器41和存储器42。处理器41和存储器42通过总线43连接。存储器42适于存储处理器41可执行的指令或程序。处理器41可以是独立的微处理器,也可以是一个或者多个微处理器集合。由此,处理器41通过执行存储器42所存储的命令,从而执行如上所述的本发明实施例的方法流程实现对于数据的处理和对于其他装置的控制。总线43将上述多个组件连接在一起,同时将上述组件连接到显示控制器44和显示装置以及输入/输出(I/O)装置45。输入/输出(I/O)装置45可以是鼠标、键盘、调制解调器、网络接口、触控输入装置、体感输入装置、打印机以及本领域公知的其他装置。典型地,输入/输出(I/O)装置45通过输入/输出(I/O)控制器46与系统相连。
其中,存储器42可以存储软件组件,例如操作系统、通信模块、交互模块以及应用程序。以上所述的每个模块和应用程序都对应于完成一个或多个功能和在发明实施例中描述的方法的一组可执行程序指令。
上述根据本发明实施例的方法、设备(系统)和计算机程序产品的流程图和/或框图描述了本发明的各个方面。应理解,流程图和/或框图的每个块以及流程图图例和/或框图中的块的组合可以由计算机程序指令来实现。这些计算机程序指令可以被提供至通用计算机、专用计算机或其它可编程数据处理设备的处理器,以产生机器,使得(经由计算机或其它可编程数据处理设备的处理器执行的)指令创建用于实现流程图和/或框图块或块中指定的功能/动作的装置。
同时,如本领域技术人员将意识到的,本发明实施例的各个方面可以被实现为系统、方法或计算机程序产品。因此,本发明实施例的各个方面可以采取如下形式:完全硬件实施方式、完全软件实施方式(包括固件、常驻软件、微代码等)或者在本文中通常可以都称为“电路”、“模块”或“系统”的将软件方面与硬件方面相结合的实施方式。此外,本发明的方面可以采取如下形式:在一个或多个计算机可读介质中实现的计算机程序产品,计算机可读介质具有在其上实现的计算机可读程序代码。
可以利用一个或多个计算机可读介质的任意组合。计算机可读介质可以是计算机可读信号介质或计算机可读存储介质。计算机可读存储介质可以是如(但不限于)电子的、磁的、光学的、电磁的、红外的或半导体系统、设备或装置,或者前述的任意适当的组合。计算机可读存储介质的更具体的示例(非穷尽列举)将包括以下各项:具 有一根或多根电线的电气连接、便携式计算机软盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或闪速存储器)、光纤、便携式光盘只读存储器(CD-ROM)、光存储装置、磁存储装置或前述的任意适当的组合。在本发明实施例的上下文中,计算机可读存储介质可以为能够包含或存储由指令执行系统、设备或装置使用的程序或结合指令执行系统、设备或装置使用的程序的任意有形介质。
计算机可读信号介质可以包括传播的数据信号,所述传播的数据信号具有在其中如在基带中或作为载波的一部分实现的计算机可读程序代码。这样的传播的信号可以采用多种形式中的任何形式,包括但不限于:电磁的、光学的或其任何适当的组合。计算机可读信号介质可以是以下任意计算机可读介质:不是计算机可读存储介质,并且可以对由指令执行系统、设备或装置使用的或结合指令执行系统、设备或装置使用的程序进行通信、传播或传输。
用于执行针对本发明各方面的操作的计算机程序代码可以以一种或多种编程语言的任意组合来编写,所述编程语言包括:面向对象的编程语言如Java、Smalltalk、C++、PHP、Python等;以及常规过程编程语言如“C”编程语言或类似的编程语言。程序代码可以作为独立软件包完全地在用户计算机上、部分地在用户计算机上执行;部分地在用户计算机上且部分地在远程计算机上执行;或者完全地在远程计算机或服务器上执行。在后一种情况下,可以将远程计算机通过包括局域网(LAN)或广域网(WAN)的任意类型的网络连接至用户计算机,或者可以与外部计算机进行连接(例如通过使用因特网服务供应商的因特网)。
以上所述仅为本发明的优选实施例,并不用于限制本发明,对于本领域技术人员而言,本发明可以有各种改动和变化。凡在本发明的精神和原理之内所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (10)

  1. 一种同步视频数据和音频数据的方法,其特征在于,所述方法包括:
    根据视频数据获取第一序列,所述第一序列为人脸特征参数的时间序列,所述人脸特征参数用于表征视频数据中人脸的唇部状态;
    根据音频数据获取第二序列,所述第二序列为音频数据中语音信号强度的时间序列,所述第二序列与所述第一序列采用相同的采样周期;
    对所述第一序列与所述第二序列进行滑动互相关,以获得不同时间轴偏差对应的互相关系数;
    根据具有最大互相关系数的时间轴偏差同步所述视频数据和所述音频数据。
  2. 根据权利要求1所述的方法,其特征在于,根据所述视频数据获取第一序列包括:
    按照预定采样周期对所述视频数据采样以获取第一图像序列,所述第一图像序列包括采样获取的图像;
    获取所述第一图像序列中每一个图像对应的所述人脸特征参数,以获取所述第一序列。
  3. 根据权利要求2所述的方法,其特征在于,获取所述第一图像序列中每一个图像对应的所述人脸特征参数包括:
    对所述第一图像序列中的每一个图像进行人脸检测获取每一个图像的人脸区域信息;
    根据所述第一图像序列中的每一个图像的对应的人脸区域信息获取人脸唇部关键点信息;
    根据所述第一图像序列中的每一个图像的人脸唇部关键点信息获取所述人脸特征参数。
  4. 根据权利要求1所述的方法,其特征在于,所述人脸特征参数为:人脸唇部图像的高度、人脸唇部图像的面积和人脸唇部图像的高度与宽度的比值中的任一项;或者
    包含人脸唇部图像的高度、人脸唇部图像的面积和人脸唇部图像的高度与宽度的比值中的至少一项的函数。
  5. 根据权利要求2所述的方法,其特征在于,所述根据音频数据获取第二序列包括:
    根据所述采样周期对所述音频数据进行语音信号强度的提取以获取所述第二序列。
  6. 根据权利要求1所述的方法,其特征在于,所述视频数据为在线录制的视频文件,所述音频数据为随视频数据同步录制并去除无语音信号部分的音频文件。
  7. 根据权利要求1所述的方法,其特征在于,对所述第一序列与所述第二序列进行滑动互相关包括:
    按照可能的时间轴偏差对所述第一序列进行时间轴偏移,获取每一个可能的时间轴偏差对应的偏移后的第一序列;
    将所述第二序列和每一个偏移后的第一序列进行互相关,获取每一个可能的时间轴偏差对应的互相关系数。
  8. 根据权利要求1所述的方法,其特征在于,对所述第一序列与所述第二序列进行滑动互相关包括:
    按照可能的时间轴偏差对所述第二序列进行时间轴偏移,获取每一个可能的时间轴偏差对应的偏移后的第二序列;
    将所述第一序列和每一个偏移后的第二序列进行互相关,获取每一个可能的时间轴偏差对应的互相关系数。
  9. 一种计算机可读存储介质,其上存储计算机程序指令,其特征在于,所述计算机程序指令在被处理器执行时实现如权利要求1-8中任一项所述的方法。
  10. 一种电子设备,包括存储器和处理器,其特征在于,所述存储器用于存储一条或多条计算机程序指令,其中,所述一条或多条计算机程序指令被所述处理器执行以实现如权利要求1-8中任一项所述的方法。
PCT/CN2019/081591 2018-07-11 2019-04-04 同步视频数据和音频数据的方法、存储介质和电子设备 WO2020010883A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810759994.3 2018-07-11
CN201810759994.3A CN108924617B (zh) 2018-07-11 2018-07-11 同步视频数据和音频数据的方法、存储介质和电子设备

Publications (1)

Publication Number Publication Date
WO2020010883A1 true WO2020010883A1 (zh) 2020-01-16

Family

ID=64411602

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/081591 WO2020010883A1 (zh) 2018-07-11 2019-04-04 同步视频数据和音频数据的方法、存储介质和电子设备

Country Status (2)

Country Link
CN (1) CN108924617B (zh)
WO (1) WO2020010883A1 (zh)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108924617B (zh) * 2018-07-11 2020-09-18 北京大米科技有限公司 同步视频数据和音频数据的方法、存储介质和电子设备
CN110099300B (zh) * 2019-03-21 2021-09-03 北京奇艺世纪科技有限公司 视频处理方法、装置、终端及计算机可读存储介质
CN110544270A (zh) * 2019-08-30 2019-12-06 上海依图信息技术有限公司 结合语音识别且实时预测人脸追踪轨迹方法及装置
CN112653916B (zh) * 2019-10-10 2023-08-29 腾讯科技(深圳)有限公司 一种音视频同步优化的方法及设备
CN113362849A (zh) * 2020-03-02 2021-09-07 阿里巴巴集团控股有限公司 一种语音数据处理方法以及装置
CN111461235B (zh) 2020-03-31 2021-07-16 合肥工业大学 音视频数据处理方法、系统、电子设备及存储介质
CN111225237B (zh) 2020-04-23 2020-08-21 腾讯科技(深圳)有限公司 一种视频的音画匹配方法、相关装置以及存储介质
CN113096223A (zh) * 2021-04-25 2021-07-09 北京大米科技有限公司 图像生成方法、存储介质和电子设备
CN114422825A (zh) * 2022-01-26 2022-04-29 科大讯飞股份有限公司 音视频同步方法、装置、介质、设备及程序产品
CN115547357B (zh) * 2022-12-01 2023-05-09 合肥高维数据技术有限公司 音视频伪造同步方法及其构成的伪造系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103517044A (zh) * 2012-06-25 2014-01-15 鸿富锦精密工业(深圳)有限公司 视频会议装置及其唇形同步的方法
CN105512348A (zh) * 2016-01-28 2016-04-20 北京旷视科技有限公司 用于处理视频和相关音频的方法和装置及检索方法和装置
US20160134785A1 (en) * 2014-11-10 2016-05-12 Echostar Technologies L.L.C. Video and audio processing based multimedia synchronization system and method of creating the same
CN105959723A (zh) * 2016-05-16 2016-09-21 浙江大学 一种基于机器视觉和语音信号处理相结合的假唱检测方法
CN108924617A (zh) * 2018-07-11 2018-11-30 北京大米科技有限公司 同步视频数据和音频数据的方法、存储介质和电子设备

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5387943A (en) * 1992-12-21 1995-02-07 Tektronix, Inc. Semiautomatic lip sync recovery system
US7149686B1 (en) * 2000-06-23 2006-12-12 International Business Machines Corporation System and method for eliminating synchronization errors in electronic audiovisual transmissions and presentations
WO2007035183A2 (en) * 2005-04-13 2007-03-29 Pixel Instruments, Corp. Method, system, and program product for measuring audio video synchronization independent of speaker characteristics
US9111580B2 (en) * 2011-09-23 2015-08-18 Harman International Industries, Incorporated Time alignment of recorded audio signals
CN106067989B (zh) * 2016-04-28 2022-05-17 江苏大学 一种人像语音视频同步校准装置及方法
US10397516B2 (en) * 2016-04-29 2019-08-27 Ford Global Technologies, Llc Systems, methods, and devices for synchronization of vehicle data with recorded audio

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103517044A (zh) * 2012-06-25 2014-01-15 鸿富锦精密工业(深圳)有限公司 视频会议装置及其唇形同步的方法
US20160134785A1 (en) * 2014-11-10 2016-05-12 Echostar Technologies L.L.C. Video and audio processing based multimedia synchronization system and method of creating the same
CN105512348A (zh) * 2016-01-28 2016-04-20 北京旷视科技有限公司 用于处理视频和相关音频的方法和装置及检索方法和装置
CN105959723A (zh) * 2016-05-16 2016-09-21 浙江大学 一种基于机器视觉和语音信号处理相结合的假唱检测方法
CN108924617A (zh) * 2018-07-11 2018-11-30 北京大米科技有限公司 同步视频数据和音频数据的方法、存储介质和电子设备

Also Published As

Publication number Publication date
CN108924617B (zh) 2020-09-18
CN108924617A (zh) 2018-11-30

Similar Documents

Publication Publication Date Title
WO2020010883A1 (zh) 同步视频数据和音频数据的方法、存储介质和电子设备
US10497382B2 (en) Associating faces with voices for speaker diarization within videos
US10181325B2 (en) Audio-visual speech recognition with scattering operators
WO2021082941A1 (zh) 视频人物识别方法、装置、存储介质与电子设备
JP6339489B2 (ja) 画像分割方法および画像分割装置
KR20070118038A (ko) 정보처리 장치 및 정보처리 방법과 컴퓨터·프로그램
WO2020215722A1 (zh) 视频处理方法和装置、电子设备及计算机可读存储介质
WO2020019591A1 (zh) 用于生成信息的方法和装置
WO2021203823A1 (zh) 图像分类方法、装置、存储介质及电子设备
CN113242361B (zh) 一种视频处理方法、装置以及计算机可读存储介质
JP2018159788A5 (ja) 情報処理装置、感情認識方法、及び、プログラム
JP2014049125A (ja) 手順を文書記録する方法及び装置
WO2020052062A1 (zh) 检测方法和装置
JP2008015848A (ja) 物体領域探索方法,物体領域探索プログラムおよび物体領域探索装置
US20150304705A1 (en) Synchronization of different versions of a multimedia content
JP2017146672A (ja) 画像表示装置、画像表示方法、画像表示プログラム及び画像表示システム
US11163822B2 (en) Emotional experience metadata on recorded images
JP6690442B2 (ja) プレゼンテーション支援装置、プレゼンテーション支援システム、プレゼンテーション支援方法及びプレゼンテーション支援プログラム
Shipman et al. Speed-accuracy tradeoffs for detecting sign language content in video sharing sites
US20140285426A1 (en) Signal processing device and signal processing method
Kunka et al. Multimodal English corpus for automatic speech recognition
Lin et al. Detecting Deepfake Videos Using Spatiotemporal Trident Network
CN111128190A (zh) 一种表情匹配的方法及系统
EP2136314A1 (en) Method and system for generating multimedia descriptors
WO2021244468A1 (zh) 视频处理

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19833799

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19833799

Country of ref document: EP

Kind code of ref document: A1