WO2021244468A1 - 视频处理 - Google Patents

视频处理 Download PDF

Info

Publication number
WO2021244468A1
WO2021244468A1 PCT/CN2021/097192 CN2021097192W WO2021244468A1 WO 2021244468 A1 WO2021244468 A1 WO 2021244468A1 CN 2021097192 W CN2021097192 W CN 2021097192W WO 2021244468 A1 WO2021244468 A1 WO 2021244468A1
Authority
WO
WIPO (PCT)
Prior art keywords
key point
original
original image
audio
sequence
Prior art date
Application number
PCT/CN2021/097192
Other languages
English (en)
French (fr)
Inventor
郭明坤
祝夭龙
Original Assignee
北京灵汐科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京灵汐科技有限公司 filed Critical 北京灵汐科技有限公司
Publication of WO2021244468A1 publication Critical patent/WO2021244468A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
    • H04N5/265Mixing

Definitions

  • the present disclosure relates to the field of computer technology, in particular to video processing.
  • the purpose of the embodiments of the present disclosure is to provide a video processing method, a video processing device, a storage medium, and an electronic device, which are used to adjust the image relatively quickly and accurately, thereby improving the matching of the image and the audio in the video stream. Enhance the viewing experience.
  • a video processing method comprising: acquiring an original image sequence, the original image sequence is a plurality of original images ordered in time, each of the original images includes the original Key point information; obtain an audio sequence, the audio sequence is a plurality of audio signals sorted by time, each of the audio signals includes an acoustic feature; according to the time correspondence relationship, according to the audio sequence of all the audio signals
  • the acoustic feature adjusts the original key point information of the original image in the original image sequence to form a target image sequence, and the target image sequence includes a plurality of target images sorted in time.
  • a video processing device comprising: an image acquisition unit for acquiring an original image sequence, the original image sequence being a plurality of original images sorted in time, each The original image includes original key point information; an audio acquisition unit for acquiring an audio sequence, the audio sequence is a plurality of audio signals sorted by time, each of the audio signals includes an acoustic feature; an adjustment unit for pressing Time correspondence, adjust the original key point information of the original image in the original image sequence according to the acoustic characteristics of each of the audio signals in the audio sequence to form a target image sequence, the target The image sequence includes multiple target images sorted in time.
  • a computer-readable storage medium on which computer program instructions are stored.
  • the computer program instructions implement the method as described in the first aspect when being executed by the processor.
  • an electronic device including a memory and a processor.
  • the memory is used to store one or more computer program instructions, and the one or more computer program instructions are executed by the processor to implement the method according to the first aspect.
  • At least one original key point information of the original image in the original image sequence is adjusted according to the time correspondence relationship and the key point feature corresponding to the acoustic feature of each audio signal in the audio sequence to obtain the target image sequence.
  • Fig. 1 is a flowchart of a video processing method according to an exemplary embodiment of the present disclosure
  • Fig. 2 is a flowchart of acquiring a target image sequence by a method according to an exemplary embodiment of the present disclosure
  • Fig. 3 is a schematic diagram of acquiring a target image in an exemplary embodiment of the present disclosure
  • Fig. 4 is a schematic diagram of a video processing device according to an exemplary embodiment of the present disclosure.
  • Fig. 5 is a schematic diagram of an electronic device according to an exemplary embodiment of the present disclosure.
  • Fig. 1 is a flowchart of a video processing method according to an exemplary embodiment of the present disclosure. As shown in FIG. 1, the method of this embodiment may include the following steps S100, S200, and S300.
  • Step S100 Obtain an original image sequence.
  • the original image sequence is a plurality of original images sorted by time, which may be captured by an image acquisition device (such as a camera, a video camera, etc.), or may be manually drawn.
  • an image acquisition device such as a camera, a video camera, etc.
  • This embodiment does not specifically limit the method of obtaining the original image sequence.
  • Each original image in the original image sequence includes original key point information.
  • the original key point information is used to characterize the information of the key parts in the face image that have a greater influence on the expression change, and may specifically be the contour shape and coordinates of the key parts.
  • the key parts can specifically be eyes, lips, etc. In this embodiment, the key point information of the lips is selected as the original key point information. Therefore, the server can perform face detection on each original image in the original image sequence, obtain the face area information of each original image, and determine the original key point information of each original image according to the face area information.
  • various video processing algorithms well known to those skilled in the art can be used to realize face detection, such as a reference template method, a face rule method, a feature sub-face method, and a sample recognition method.
  • the obtained face area information can be expressed in many different forms. Exemplarily, it can be represented by the data structure R (X, Y, W, H) of the face area. Among them, R(X, Y, W, H) defines a rectangular area including the main part of the face in the image, X and Y define the coordinates of an end point of the rectangular area, and W and H respectively define the width of the rectangular area And height.
  • Dlib may be used to perform face detection and obtain key point information of the lips.
  • Dlib is a C++ open source toolkit containing machine learning algorithms.
  • Dlib can identify the facial features and contours of the face through 68 key points. Among them, the contour of the lips can be defined by multiple key points.
  • the server can extract key point information of the lips from each original image based on Dlib, and determine the original key point information in the original image sequence according to the extracted key point information of each lip and the time stamp information of each original image.
  • Step S200 Acquire an audio sequence.
  • the audio sequence can be recorded synchronously with the image sequence, or can be recorded at a later stage of the image sequence, and each word (or word) in the predetermined text can be converted into voice in the order of the time axis of the original image sequence. get.
  • the time axis correspondence between the image sequence and the audio sequence may be determined in advance, or may be determined in any manner well known to those skilled in the art, which is not specifically limited in the present disclosure. For example, the method described in "Qi Chengming. Research and Realization of Audio and Video Synchronization Issues. Harbin Institute of Technology. 2009 Master's Thesis" can be used to synchronize the time axis of the image sequence with the audio sequence to determine each of the image sequences. The time axis correspondence between each acoustic feature in the image and the audio sequence.
  • the audio sequence may specifically be a plurality of audio signals sorted in time, and each of the audio signals includes an acoustic feature.
  • the acoustic feature may be used to characterize at least one of audio signal strength and audio signal frequency.
  • the audio signal strength can reflect the volume
  • the audio signal frequency can reflect the pitch.
  • the server can effectively distinguish the audio signal of the current timestamp as human voice or environmental sound, so that at least one original image corresponding in time can be adjusted relatively accurately according to the acoustic characteristics of the audio signal. .
  • Step S300 According to the time correspondence relationship, the original key point information of the original image in the original image sequence is adjusted according to the acoustic characteristics of each audio signal in the audio sequence to form a target image sequence.
  • the target image sequence includes a plurality of target images sorted in time.
  • Fig. 2 is a flowchart of acquiring a target image sequence by a method according to an exemplary embodiment of the present disclosure. As shown in FIG. 2, in an optional implementation of this embodiment, step S300 may be specifically: for each audio signal in the audio sequence, performing processing including the following steps.
  • Step S310 Acquire key point features corresponding to the acoustic features of the audio signal.
  • the key point feature may include, for example, a lip feature.
  • Step S310 may specifically include step S310A: judging whether the audio signal of the current timestamp is a human voice, and if it is a human voice, acquiring a lip feature corresponding to the acoustic feature of the audio signal.
  • the way to judge whether the audio signal is human voice includes: judging whether the audio signal frequency belongs to the frequency range of human voice, or judging whether the audio signal intensity belongs to the intensity range of human voice, and so on. This embodiment does not limit the implementation of determining whether the audio signal is human voice or voice.
  • lip features may include lip width and lip height.
  • the intensity of a person's speech is positively correlated with the degree of opening of the person's mouth. In other words, the greater the opening of the mouth, the greater the strength of the audio signal. Therefore, the server can determine the corresponding lip feature when the audio signal is a human voice according to the predetermined corresponding relationship between the acoustic feature and the lip feature.
  • the lip feature may also include the shape of the mouth to improve the accuracy of subsequent adjustment of the original key point information.
  • the server can perform voice recognition on the audio signal in any manner well known to those skilled in the art.
  • the speech recognition system described in "Cui Tianyu. Research and Implementation of HMM-based Speech Recognition System. Jilin University. Master's Thesis in 2016" performs speech recognition on audio signals to obtain speech recognition results corresponding to the audio signals. Then, the server can obtain the corresponding lip feature according to the voice recognition result.
  • the result of speech recognition is used to characterize characters or phonemes.
  • the server can determine the lip feature corresponding to the voice recognition result according to the corresponding relationship between the voice recognition result and the lip feature. It is easy to understand that the server can also determine the lip feature corresponding to the voice recognition result according to the language corresponding to the voice recognition result and the corresponding relationship between the voice recognition result and the lip feature in this language.
  • the corresponding relationship between the acoustic features and the lip features can also be adjusted according to the ratio of the size of the human face in the original image to the size of the actual human face.
  • Step S320 Adjust the original key point information of the original image corresponding to the audio signal in time in the original image sequence according to the key point feature to obtain target key point information.
  • the server may adjust the original key point information corresponding in time (for example, having the same or adjacent timestamp) according to the characteristics of each key point to obtain the corresponding target key point information.
  • the audio signal of any time stamp is a human voice
  • the original key point information of the corresponding original image is adjusted to obtain the target key point information.
  • the server can determine the mouth on the face according to the original key point information in the original image corresponding to the time stamp Then adjust the coordinates of the key points used to characterize the two lip corners, the upper end of the lip and the lower end of the lip in the original key point information according to the coordinates of the center point, and adjust the coordinates of the key points adjacent to the above-mentioned key points adaptively The coordinates of the multiple key points of, make the distance between the two lip corners meet 1.5cm, and the distance between the upper end of the lip and the lower end of the lip meets 1cm, so that the target key point information can be obtained.
  • the server can determine the position of the center point of the mouth on the face according to the original key point information in the original image corresponding to the timestamp As the first center point, the second center point is determined according to shape 1, and the second center point is overlapped with the first center point, and then the original key point information is adjusted according to shape 1, so as to obtain target key point information.
  • the server may further determine whether to adjust the original key point information.
  • Step S320A Determine whether the difference between the original key point information and the key point feature corresponding to the acoustic feature of the audio signal is less than a first threshold.
  • the first threshold is used to determine whether the difference between the original key point information and the key point feature corresponding to the acoustic feature of the audio signal is small. For example, it can be determined whether the difference between the first distance between any two original key points in the original key point information and the second distance between the corresponding two key points in the key point feature is less than the first threshold.
  • step S320A can be executed before step S320.
  • step S320A when the difference between the original key point information and the key point feature corresponding to the acoustic feature of the audio signal is small, the server does not need to obtain the target key point information, which can appropriately reduce the server's calculation load, thereby improving the key point information. Point adjustment efficiency.
  • Step S330 Adjust the original image corresponding to the audio signal in time in the original image sequence according to the target key point information, and obtain a target image in the target image sequence corresponding in time to the audio signal.
  • the server can according to the time axis correspondence between the audio sequence and the original image sequence, and according to the target key point The information replaces the original key point information in the original image to obtain the target image.
  • the target image sequence can be determined based on the target image corresponding to each audio signal that is human voice and the original image corresponding to each audio signal that is not human voice.
  • the original key point information of each original image in the original image sequence can be sorted by time to form the original key point sequence; the target key point information of each target image in the target image sequence can also be sorted by time to form the target key point sequence. In this way, when it is necessary to obtain key point information separately, for example, in image frame smoothing processing that considers video fluency, the key points of adjacent timestamps can be directly read from the corresponding original key point sequence or target key point sequence information.
  • step S310 may further include step S310B: acquiring the emotion coefficient corresponding to the acoustic feature of the audio signal.
  • the emotion coefficient is used to characterize the intensity of emotion. Generally, the greater the intensity of the audio signal and/or the higher the frequency of the audio signal, the stronger the emotion of a person when speaking. Therefore, the server may determine the emotion coefficient corresponding to the acoustic feature according to at least one of the audio signal strength and the audio signal frequency. Optionally, the server may determine the emotion coefficient corresponding to the acoustic feature of the audio signal of the human voice according to the corresponding relationship between the acoustic feature and the emotion coefficient.
  • the emotion coefficient can be 1; if the audio signal strength is 66-70 decibels, the emotion coefficient can be 1.5.
  • step S310B is executed before step S320. Specifically, step S310A may be executed first, and then step S310B; or step S310B may be executed first, and then step S310A may be executed. As long as the emotion coefficient corresponding to the acoustic feature is obtained through step S310B before step S320, it belongs to the protection scope of the present disclosure.
  • step S320 may specifically include: adjusting the original key point information of the original image corresponding to the audio signal in time in the original image sequence according to the lip feature or the lip feature and the emotion coefficient to obtain the target key point information .
  • the target key point information can more accurately reflect different emotions in the audio, and the accuracy of the key point adjustment is further improved.
  • step S310 may specifically include: obtaining key point features corresponding to the acoustic features of the audio signal based on the classification model.
  • the server may obtain the emotion coefficient corresponding to the audio signal when it is a human voice, and then based on the pre-trained classification model, determine the corresponding key point feature according to the acoustic feature and the emotion coefficient of the audio signal.
  • the determination method of the emotion coefficient is similar to the determination method of the emotion coefficient in step S310B, and will not be repeated here.
  • the classification model may be a decision tree, a neural network, a support vector machine (SVM, Support Vector Machine), etc., which is not specifically limited in this embodiment.
  • a neural network a support vector machine
  • SVM Support Vector Machine
  • ANN artificial neural network
  • Common artificial neural networks include Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), etc.
  • ANN is non-linear (suitable for processing non-linear information), non-limiting (that is, the overall behavior of a system depends on the interaction between processing units), very qualitative (that is, adaptive, self-organizing, self-learning Ability, capable of continuous self-learning in the process of processing information) and non-convexity (the activation function of the model has multiple extreme values, which makes the model have multiple stable equilibrium states, so that the changes of the model are diverse) Therefore, it can be widely used in various fields for more accurate data prediction.
  • the server can obtain the category labels of various lip features in advance, and then convert the problem of predicting lip features into a classification problem.
  • the server can train the classification model based on historical data.
  • the historical data may include acoustic features of multiple audio signals, and emotion coefficients and category labels corresponding to the acoustic features of each audio signal.
  • the server can train the classification model with the acoustic characteristics of each audio signal and the corresponding emotion coefficient as input, and the category label as the output.
  • the server can take the acoustic features of the human voice audio signal and the corresponding emotion coefficient as input, and obtain the category corresponding to the acoustic feature of the audio signal based on the trained (ie, pre-trained) classification model Label, and then determine the lip feature corresponding to the audio signal according to the category label.
  • the accuracy of adjusting the key point information in the corresponding image in time can be further improved.
  • the server can adjust the original key point information of the original image corresponding to the time stamp interval to the target key point information of the target image of the adjacent time stamp, so as to enhance the continuity of the target image sequence.
  • the original key point information corresponding to the time stamp interval can be adjusted to the target key point information immediately before the time stamp interval, or the original key point information corresponding to the time stamp interval can be adjusted to be immediately after the time stamp interval The key point information of the subsequent target.
  • the speech of the first word corresponds to the image of frames 1-6 in time
  • the speech of the second word corresponds to the image of frames 7 to 13 in time
  • the speech of the third word is in Corresponding to the 16th to 23rd frames in time
  • the blank that is, there is no sound. Due to the inertia of the lip features during speech, the three frames of images corresponding to the blank voice also need to be adjusted for key point information.
  • the lip features of the three frames of images corresponding to the blank sound can be adjusted to the lip features consistent with the 13th frame image, or the lip features consistent with the 16th frame image.
  • Fig. 3 is a schematic diagram of acquiring a target image in an exemplary embodiment of the present disclosure.
  • image 31 is an original image with a time stamp of 0 minutes and 33 seconds in the original image sequence.
  • the server can obtain an audio signal with a time stamp of 0 minutes and 33 seconds in the audio sequence. If the acoustic feature of the audio signal is human voice
  • the server may obtain the key point feature corresponding to the acoustic feature of the audio signal including at least the lip feature, and then determine whether the difference between the obtained key point feature and the original key point information of the image 31 is less than the first threshold. When the difference is greater than or equal to the first threshold, the server may adjust the original key point information according to the key point feature to obtain target key point information. Then, the original key point information in the image 31 is replaced with the target key point information, thereby obtaining a target image with a time stamp of 0 minutes and 33 seconds, that is, the image 32.
  • the server may use the time axis sequence of the original image sequence as the time axis sequence of the target image sequence.
  • the server can synthesize the target image sequence and the audio sequence according to the time axis correspondence between the target image sequence and the audio sequence, thereby obtaining the target video segment.
  • the original key point information of the original object in the original image sequence is adjusted according to the key point feature corresponding to the acoustic feature of each audio signal in the audio sequence according to the time correspondence relationship to obtain the target image sequence.
  • the matching degree between the image and the audio in the video stream can be improved relatively quickly and accurately, and the viewing experience can be enhanced.
  • Fig. 4 is a schematic diagram of a video processing device according to an exemplary embodiment of the present disclosure. As shown in FIG. 4, the device of this embodiment includes an image acquisition unit 41, an audio acquisition unit 42, and an adjustment unit 43.
  • the image acquisition unit 41 is configured to acquire an original image sequence, the original image sequence is a plurality of original images sorted by time, and each of the original images includes original key point information.
  • the audio acquisition unit 42 is configured to acquire an audio sequence, the audio sequence is a plurality of audio signals sorted in time, and each of the audio signals includes an acoustic feature.
  • the adjustment unit 43 is configured to adjust the original key point information of the original image in the original image sequence according to the time correspondence relationship according to the acoustic characteristics of each of the audio signals in the audio sequence to form a target An image sequence, where the target image sequence includes a plurality of target images sorted in time.
  • At least one original key point information of the original image in the original image sequence is adjusted according to the time correspondence relationship according to the acoustic characteristics of each audio signal in the audio sequence to obtain the target image sequence.
  • Fig. 5 is a schematic diagram of an electronic device according to an exemplary embodiment of the present disclosure.
  • the electronic device shown in FIG. 5 is a general-purpose data processing device, has a general-purpose computer hardware structure, and includes at least a processor 51 and a memory 52.
  • the processor 51 and the memory 52 are connected by a bus 53.
  • the memory 52 is suitable for storing instructions or programs executable by the processor 51.
  • the processor 51 may be an independent microprocessor, or may be a collection including one or more microprocessors. As a result, the processor 51 executes the command stored in the memory 52 to execute the method flow of the embodiment of the present disclosure as described above.
  • the bus 53 connects the above-mentioned multiple components together, and at the same time connects the above-mentioned components to the display controller 54 and the display device and the input/output (I/O) device 55.
  • the input/output (I/O) device 55 may be a mouse, a keyboard, a modem, a network interface, a touch input device, a motion sensing input device, a printer, and other devices known in the art.
  • an input/output (I/O) device 55 is connected to the system through an input/output (I/O) controller 56.
  • the memory 52 may store software components, such as an operating system, a communication module, an interaction module, and an application program. Each module and application program described above corresponds to a set of executable program instructions that complete one or more functions and methods described in the disclosed embodiments.
  • aspects of the embodiments of the present disclosure may be implemented as a system, a method, or a computer program product. Therefore, various aspects of the embodiments of the present disclosure may take the following forms: a complete hardware implementation, a complete software implementation (including firmware, resident software, microcode, etc.), or may be generally referred to as “circuits” and “modules” herein. "Or “system” is an implementation that combines software and hardware aspects.
  • aspects of the present disclosure may take the following form: a computer program product implemented in one or more computer-readable media, the computer-readable medium having computer-readable program code implemented thereon.
  • the computer-readable medium may be a computer-readable signal medium or a computer-readable storage medium.
  • the computer-readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, device or device, or any appropriate combination of the foregoing.
  • a computer-readable storage medium may be any tangible medium that can contain or store a program used by an instruction execution system, device, or device, or a program used in conjunction with an instruction execution system, device, or device.
  • the computer-readable signal medium may include a propagated data signal having computer-readable program code implemented therein as in baseband or as part of a carrier wave. Such a propagated signal can take any of a variety of forms, including but not limited to: electromagnetic, optical, or any suitable combination thereof.
  • the computer-readable signal medium can be any of the following computer-readable media: it is not a computer-readable storage medium, and it can communicate and propagate the program used by the instruction execution system, device, or device, or used in conjunction with the instruction execution system, device, or device Or transmission.
  • the computer program code used to perform operations directed to various aspects of the present disclosure can be written in any combination of one or more programming languages, including: object-oriented programming languages such as Java, Smalltalk, C++, PHP, Python Etc.; and conventional process programming languages such as "C" programming language or similar programming languages.
  • the program code can be executed as an independent software package entirely on the user's computer, partly on the user's computer; partly on the user's computer and partly on a remote computer; or entirely on the remote computer or server.
  • the remote computer can be connected to the user's computer through any type of network including a local area network (LAN) or a wide area network (WAN), or can be connected with an external computer (for example, by using the Internet of an Internet service provider) .
  • LAN local area network
  • WAN wide area network

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Signal Processing (AREA)
  • Evolutionary Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • User Interface Of Digital Computer (AREA)
  • Image Analysis (AREA)

Abstract

本公开实施例公开了一种视频处理方法、视频处理装置、存储介质和电子设备。根据所述视频处理方法的一个示例,在获取原始图像序列和音频序列之后,其中原始图像序列包括按时间排序的多个原始图像,各原始图像包括原始关键点信息,音频序列包括按时间排序的多个音频信号,各音频信号包括声学特征,可按时间对应关系,根据音频序列中各音频信号的声学特征对原始图像序列中原始图像的至少一个原始关键点信息进行调整,以获取包括按时间排序的多个目标图像的目标图像序列。

Description

视频处理 技术领域
本公开涉及计算机技术领域,具体涉及视频处理。
背景技术
随着计算机技术的不断发展,视频图像处理的应用领域越来越广泛。对于涉及视频图像方面的行业,例如影视行业、动漫行业等,在对拍摄或制作的视频进行后期处理时,可能需要对视频中的图像和/或音频进行调整,并因此可能导致相同时间戳的图像、特别是图像中人物的口型与音频对不上。因此,需要对图像进行较为准确地调整,以增强视频流中图像与音频的匹配度。
发明内容
有鉴于此,本公开实施例的目的在于提供一种视频处理方法、视频处理装置、存储介质和电子设备,用于较为快速且准确地对图像进行调整,从而提升视频流中图像与音频的匹配度,增强观看体验。
根据本公开实施例的第一方面,提供一种视频处理方法,所述方法包括:获取原始图像序列,所述原始图像序列为按时间排序的多个原始图像,每个所述原始图像包括原始关键点信息;获取音频序列,所述音频序列为按时间排序的多个音频信号,每个所述音频信号包括声学特征;按时间对应关系,根据所述音频序列中各所述音频信号的所述声学特征对所述原始图像序列中所述原始图像的所述原始关键点信息进行调整,以形成目标图像序列,所述目标图像序列包括按时间排序的多个目标图像。
根据本公开实施例的第二方面,提供一种视频处理装置,所述装置包括:图像获取单元,用于获取原始图像序列,所述原始图像序列为按时间排序的多个原始图像,每个所述原始图像包括原始关键点信息;音频获取单元,用于获取音频序列,所述音频序列为按时间排序的多个音频信号,每个所述音频信号包括声学特征;调整单元,用于按时间对应关系,根据所述音频序列中各所述音频信号的所述声学特征对所述原始图像序列中所述原始图像的所述原始关键点信息进行调整,以形成目标图像序列,所述目标图像序列包括按时间排序的多个目标图像。
根据本公开实施例的第三方面,提供一种计算机可读存储介质,其上存储计算机程序指令。其中,所述计算机程序指令在被处理器执行时实现如第一方面所述的方法。
根据本公开实施例的第四方面,提供一种电子设备,包括存储器和处理器。其中,所述存储器用于存储一条或多条计算机程序指令,所述一条或多条计算机程序指令被所述处理器执行以实现如第一方面所述的方法。
在本公开实施例中,通过按时间对应关系,根据音频序列中各音频信号的声学特征对应的关键点特征,对原始图像序列中原始图像的至少一个原始关键点信息进行调整,以获取目标图像序列。通过本公开实施例的方法,可以较为快速且准确地提升视频流中图像与音频的匹配度,增强观看体验。
附图说明
通过以下参照附图对本公开实施例的描述,本公开的上述以及其它目的、特征和优点将更为清楚,在附图中:
图1是本公开一示例性实施例的视频处理方法的流程图;
图2是本公开一示例性实施例的方法获取目标图像序列的流程图;
图3是本公开一示例性实施例中获取目标图像的示意图;
图4是本公开一示例性实施例的视频处理装置的示意图;
图5是本公开一示例性实施例的电子设备的示意图。
具体实施方式
以下基于实施例对本公开进行描述,但是本公开并不仅仅限于这些实施例。在下文对本公开的细节描述中,详尽描述了一些特定的细节部分。对本领域技术人员来说没有这些细节部分的描述也可以完全理解本公开。为了避免混淆本公开的实质,公知的方法、过程、流程、元件和电路并没有详细叙述。
此外,本领域普通技术人员应当理解,在此提供的附图都是为了说明的目的,并且附图不一定是按比例绘制的。
除非上下文明确要求,否则在说明书的“包括”、“包含”等类似词语应当解释为包含的含义而不是排他或穷举的含义;也就是说,是“包括但不限于”的含义。
在本公开的描述中,需要理解的是,术语“第一”、“第二”等仅用于描述目的,而不能理解为指示或暗示相对重要性。此外,在本公开的描述中,除非另有说明,“多个”的含义是两个或两个以上。
对于涉及图像方面的行业,例如影视行业、动漫行业等,在对拍摄或制作的视频进行后期处理时,可能需要对视频中的图像和/或音频进行调整。例如,由于配音演员在音频的录制过程中产生了口误,使得音频中的某一句话在后期被重新录制,并因此可能导致相同时间戳的图像、特别是图像中人物的口型与音频不匹配。因此,对于影视行业、动漫行业等涉及图像方面的行业,需要对图像进行较为准确地调整,以增强图像与音频的匹配度。
在本发明实施例中,以面部为人脸为例进行说明。但是本领域技术人员容易理解,在面部为其他类型的面部,例如卡通角色面部、动物面部等时,本实施例的方法同样适用。
图1是本公开一示例性实施例的视频处理方法的流程图。如图1所示,本实施例的方法可以包括如下步骤S100、S200、S300。
步骤S100,获取原始图像序列。
在本实施例中,原始图像序列为按时间排序的多个原始图像,可以由图像采集装置(如,摄像机、摄影机等)拍摄得到,也可以由人工绘制得到。本实施例对原始图像序列的获得方式不做具体限定。原始图像序列中的各原始图像包括原始关键点信息。原始关键点信息用于表征人脸图像中对于表情变化影响较大的关键部位的信息,具体可以为关键部位的轮廓形状、坐标等。关键部位具体可以为眼部、唇部等。在本实施例中,选择唇部的关键点信息作为原始关键点信息。因此,服务器可以对原始图像序列中的各原始图像进行人脸检测,获取各原始图像的人脸区域信息,并根据各人脸区域信息确定各原始图像的原始关键点信息。
在本实施例中,可以通过各种本领域技术人员熟知的视频处理算法来实现人脸检测,例如参考模板法、人脸规则法、特征子脸法以及样本识别法等。获取到的人脸区域信息可以多种不同形式表现。示例性的,可以通过人脸区域的数据结构R(X,Y,W,H)来表示。其中,R(X,Y,W,H)限定了图像中包括人脸主要部分的一个矩形区域,X和Y限定了该矩形区域的一个端点的坐标,W和H分别限定该矩形区域的宽度和高度。
在本实施例的一种可选的实现方式中,可以利用Dlib来进行人脸检测以及获取唇部关键点信息。Dlib是一个包含机器学习算法的C++开源工具包。Dlib可以将人脸的五官和轮廓通过68个关键点来进行标识。其中,唇部的轮廓可以用多个关键点来限定。 由此,服务器可以基于Dlib从各原始图像中提取到唇部关键点信息,并根据提取到的各个唇部关键点信息以及各原始图像的时间戳信息确定原始图像序列中的原始关键点信息。
步骤S200,获取音频序列。
在本实施例中,音频序列可以随图像序列同步录制得到,也可以根据图像序列后期录制得到,还可以按原始图像序列的时间轴顺序将预定文本中的每个字(或词)转化为语音得到。图像序列与音频序列的时间轴对应关系可以预先确定,也可以通过本领域技术人员熟知的任意方式来确定,本公开对此不做具体限定。例如,可通过《齐成明.音视频同步问题的研究与实现.哈尔滨工业大学.2009年硕士学位论文》中记载的方法,来对图像序列与音频序列进行时间轴同步,从而确定图像序列中的各图像与音频序列中的各声学特征之间的时间轴对应关系。
音频序列具体可以为按时间排序的多个音频信号,每个所述音频信号包括声学特征。声学特征可以用于表征音频信号强度以及音频信号频率中的至少一项。其中,音频信号强度可以反映音量大小,音频信号频率可以反映音调高低。根据音频信号强度以及音频信号频率,服务器可以有效区分当前时间戳的音频信号为人声或环境声音,从而在后续根据该音频信号的声学特征对在时间上对应的至少一个原始图像进行相对准确地调整。
步骤S300,按时间对应关系,根据音频序列中各音频信号的声学特征对原始图像序列中原始图像的原始关键点信息进行调整,以形成目标图像序列。所述目标图像序列包括按时间排序的多个目标图像。
图2是本公开一示例性实施例的方法获取目标图像序列的流程图。如图2所示,在本实施例的一种可选的实现方式中,步骤S300可以具体为:针对音频序列中的每个音频信号,执行包括如下步骤的处理。
步骤S310,获取该音频信号的声学特征对应的关键点特征。其中,所述关键点特征可包括例如唇部特征。
在一些实施例中,服务器可以获取声学特征对应的唇部特征。步骤S310可具体包括步骤S310A:判断当前时间戳的音频信号是否为人声,若为人声,则获取该音频信号的声学特征对应的唇部特征。其中,判断音频信号是否为人声的方式包括:判断音频信号频率是否属于人声的频率区间,或者,判断音频信号强度是否属于人声的强度 区间等。本实施例对判断音频信号是否为人声或语音的实现方式不做限定。
可选地,唇部特征可以包括唇部宽度和唇部高度。通常,人发出的语音的强度与人嘴部的张开程度正相关。也就是说,嘴部张开越大,通常音频信号强度越大。因此,服务器可以根据预先确定的声学特征与唇部特征的对应关系,确定音频信号为人声时所对应的唇部特征。
可选地,唇部特征还可以包括嘴型,以提升后续调整原始关键点信息的准确性。服务器可以采用本领域技术人员熟知的任意方式对音频信号进行语音识别。例如,通过《崔天宇.基于HMM的语音识别系统的研究与实现.吉林大学.2016年硕士学位论文》中记载的语音识别系统对音频信号进行语音识别,获取音频信号对应的语音识别结果。然后,服务器可根据语音识别结果获取对应的唇部特征。语音识别结果用于表征字符或音素。通常字符或音素与唇部特征具有对应关系,且发音相似的字符或音素的唇部特征较为接近,例如“赢”和“音”的唇部特征较为接近。因此,服务器可以根据语音识别结果与唇部特征的对应关系,确定语音识别结果对应的唇部特征。容易理解,服务器还可以根据语音识别结果对应的语种以及该语种下语音识别结果与唇部特征的对应关系,来确定语音识别结果对应的唇部特征。
可选地,上述声学特征与唇部特征的对应关系还可以根据原始图像中人脸的大小与实际人脸的大小的比例进行调整。
步骤S320,根据关键点特征对原始图像序列中在时间上与该音频信号对应的原始图像的原始关键点信息进行调整,获取目标关键点信息。
在一些实施例中,服务器可以根据各关键点特征对时间上对应(例如,具有相同或邻近的时间戳)的原始关键点信息进行调整,获得对应的目标关键点信息。
在一些实施例中,在任一时间戳的音频信号为人声时,可以认为该时间戳对应的原始图像中人脸上的嘴部需要张开到一定程度,因此服务器可以根据唇部特征对该时间戳对应的原始图像的原始关键点信息进行调整,获取目标关键点信息。
例如,服务器根据特定时间戳的音频信号的声学特征获取到的唇部特征为宽度1.5cm,高度1cm,则服务器可以根据该时间戳对应的原始图像中的原始关键点信息确定人脸上嘴部的中心点位置,然后根据该中心点的坐标对原始关键点信息中用于表征两个唇角、唇部上端以及唇部下端的关键点的坐标进行调整,并适应性调整与上述关键点相邻的多个关键点的坐标,使得两个唇角之间的距离满足1.5cm,唇部上端与唇部 下端之间的距离满足1cm,从而得到目标关键点信息。
再例如,服务器根据特定时间戳的音频信号的声学特征获取到的嘴型为形状1,则服务器可以根据该时间戳对应的原始图像中的原始关键点信息确定人脸上嘴部的中心点位置作为第一中心点,根据形状1确定第二中心点,并将第二中心点与第一中心点重合,进而根据形状1调整原始关键点信息,从而得到目标关键点信息。
可选地,服务器还可以进一步判断是否对原始关键点信息进行调整。
步骤S320A,判断原始关键点信息与音频信号的声学特征对应的关键点特征之间的差异是否小于第一阈值。
第一阈值用于判断原始关键点信息与音频信号的声学特征所对应的关键点特征之间的差异是否较小。例如,可以判断原始关键点信息中任意两个原始关键点间的第一距离,与关键点特征中对应的两个关键点间的第二距离的差值是否小于第一阈值。
若原始关键点信息与关键点特征的差异小于第一阈值,服务器可以不对原始关键点信息进行调整;若原始关键点信息与关键点特征的差异大于等于第一阈值,服务器可以执行步骤S320。也就是说,步骤S320A可以在步骤S320前执行。
通过步骤S320A,可以使得在原始关键点信息与音频信号的声学特征所对应的关键点特征之间的差异较小时,服务器无需获取目标关键点信息,从而可适当降低服务器的运算量,从而提升关键点调整的效率。
步骤S330,根据目标关键点信息对原始图像序列中在时间上与所述音频信号对应的原始图像进行调整,获取目标图像序列中在时间上与所述音频信号对应的目标图像。
在本步骤中,在对特定时间戳的音频信号对应的原始图像的原始关键点信息进行调整获取目标关键点信息后,服务器可以根据音频序列与原始图像序列的时间轴对应关系,根据目标关键点信息对原始图像中的原始关键点信息进行替换,得到目标图像。这样,根据为人声的各音频信号对应的目标图像以及不为人声的各音频信号对应的原始图像,就可确定目标图像序列。此外,原始图像序列中各原始图像的原始关键点信息,可按时间排序组成原始关键点序列;目标图像序列中各目标图像的目标关键点信息,也可按时间排序组成目标关键点序列。这样,在需要单独取用关键点信息的情况下,例如在考虑视频流畅度的图像帧平滑处理中,可直接从相应的原始关键点序列或目标关键点序列读取相邻时间戳的关键点信息。
在一些实施例中,目标关键点信息的获取还综合了声学特征对应的情绪系数。作为一种关键点特征的获取方式,具体地,步骤S310还可包括步骤S310B:获取该音频信号的声学特征对应的情绪系数。
情绪系数用于表征情绪的强烈程度。通常音频信号强度越大和/或音频信号频率越高,人在说话时的情绪越强烈。因此,服务器可以根据音频信号强度以及音频信号频率中的至少一项确定声学特征对应的情绪系数。可选地,服务器可以根据声学特征与情绪系数的对应关系,确定为人声的音频信号的声学特征对应的情绪系数。
例如,若音频信号强度在61-65分贝,则情绪系数可以为1;若音频信号强度在66-70分贝,则情绪系数可以为1.5。
其中,步骤S310B在步骤S320之前执行。具体地,可以先执行步骤S310A,再执行步骤S310B;也可以先执行步骤S310B,再执行步骤S310A。只要在步骤S320之前,通过步骤S310B获得声学特征对应的情绪系数均属于本公开的保护范围。
此时,步骤S320可具体为:根据唇部特征或者唇部特征以及情绪系数,对原始图像序列中在时间上与该音频信号对应的原始图像的原始关键点信息进行调整,获取目标关键点信息。
在根据唇部特征以及情绪系数对原始关键点信息进行调整时,服务器可以根据唇部特征以及情绪系数的乘积对原始关键点信息进行调整。例如,若某一时间戳的音频信号为人声时,且该音频信号的声学特征对应的情绪系数为1.5,唇部特征为宽度1.5cm、高度1cm,则服务器可以确定宽度与情绪系数的乘积为1.5*1.5=2.25,高度与情绪系数的乘积为1*1.5=1.5。从而,可根据上述两个乘积对该时间戳的原始图像的原始关键点信息进行调整,得到目标关键点信息。
通过上述步骤S310B和步骤S320,使得目标关键点信息可以对音频中的不同情绪进行较为准确地体现,进一步提升了关键点调整的准确性。
作为另一种目标关键点信息的获得方式,步骤S310可具体为:基于分类模型,获取音频信号的声学特征对应的关键点特征。在本步骤中,服务器可以获取为人声时的音频信号对应的情绪系数,然后基于预先训练的分类模型,根据音频信号的声学特征以及情绪系数确定对应的关键点特征。在本步骤中,情绪系数的确定方式与步骤S310B中情绪系数的确定方式相似,在此不再赘述。
在本实施例中,分类模型可以为决策树、神经网络、支持向量机(SVM,Support  Vector Machine)等,本实施例不做具体限定。以神经网络为例,神经网络全称人工神经网络(Artificial Neural Network,ANN),是由大量处理单元互联形成的信息处理模型。常见的人工神经网络包括卷积神经网络(Convolutional Neural Network,CNN)、循环神经网络(Recurrent Neural Network,RNN)等。ANN具有非线性(适于处理非线性信息)、非局限性(也即,一个系统的整体行为取决于处理单元间的相互作用)、非常定性(也即,具有自适应、自组织、自学习能力,能够在处理信息的过程中不断进行自我学习)和非凸性(模型的激活函数具有多个极值,这使得模型具有多个较为稳定的平衡态,从而使得模型的变化是多样的)的特点,因此能够广泛地应用于各种领域,进行较为准确的数据预测。
由于人在发声时,发音相近的字符或者音素所形成的嘴型较为接近。也就是说,唇部特征的数量是有限的,服务器可以预先获取多种唇部特征的类别标签,然后将预测唇部特征的问题转化为分类问题。
服务器可以基于历史数据对分类模型进行训练。其中,历史数据可以包括多个音频信号的声学特征以及各音频信号的声学特征对应的情绪系数以及类别标签。在分类模型的训练过程中,服务器可以以各音频信号的声学特征以及对应的情绪系数为输入,以类别标签为输出对分类模型进行训练。在对分类模型训练完毕后,服务器可以以为人声的音频信号的声学特征以及对应的情绪系数为输入,基于训练完毕(也即,预先训练)的分类模型得到该音频信号的声学特征对应的类别标签,然后根据该类别标签确定该音频信号对应的唇部特征。
通过基于模型预测的方式,根据音频信号的声学特征以及情绪系数确定对应的关键点特征、例如唇部特征,可以进一步提升对时间上对应的图像中关键点信息调整的准确性。
由于人在发声的过程中具有连续性,在相邻两个音频信号的时间戳间隔小于预定阈值时,表示虽然两个音频信号之间间隔了一段空白时长,但是对应的图像上人脸唇部的特征不应该因为这一段空白时长而突变。因此,服务器可以将该时间戳间隔所对应的原始图像的原始关键点信息调整为相邻时间戳的目标图像的目标关键点信息,以增强目标图像序列的连续性。
可以将时间戳间隔所对应的原始关键点信息调整为紧接在该时间戳间隔之前的目标关键点信息,也可以将时间戳间隔所对应的原始关键点信息调整为紧接在该时间戳间 隔后的目标关键点信息。
例如,一般视频1秒25帧,第一个字的语音在时间上对应第1-6帧图像,第二个字的语音在时间上对应第7-13帧图像,第三个字的语音在时间上对应第16-23帧图像,第二个字的语音(可具体为该语音的结束点)和第三个字的语音(可具体为该语音的开始点)之间存在对应3帧图像的空白、即没有发声。由于人在说话过程中唇部特征的惯性,对应空白发声的这3帧图像同样需要进行关键点信息的调整处理。例如,可将对应空白发声的这3帧图像的唇部特征调整为与第13帧图像一致的唇部特征,或者与第16帧图像一致的唇部特征。
图3是本公开一示例性实施例中获取目标图像的示意图。如图3所示,图像31为原始图像序列中时间戳为0分33秒的原始图像,服务器可以获取音频序列中时间戳为0分33秒的音频信号,若该音频信号的声学特征为人声,服务器可以获取该音频信号的声学特征对应的至少包括唇部特征的关键点特征,然后判断所获取的关键点特征与图像31的原始关键点信息之间的差异是否小于第一阈值。在该差异大于等于第一阈值时,服务器可以根据该关键点特征对原始关键点信息进行调整,得到目标关键点信息。然后将图像31中的原始关键点信息替换为目标关键点信息,从而得到时间戳为0分33秒的目标图像,也即图像32。
可选地,在确定目标图像序列后,由于目标图像序列是由原始图像序列得到的,服务器可以将原始图像序列的时间轴顺序作为目标图像序列的时间轴顺序。由此,服务器可以根据目标图像序列与音频序列的时间轴对应关系,对目标图像序列以及音频序列进行合成,从而得到目标视频片段。
在本实施例中,通过按时间对应关系,根据音频序列中各音频信号的声学特征对应的关键点特征,对原始图像序列中原始对象的原始关键点信息进行调整,以获取目标图像序列。通过本实施例的方法,可以较为快速且准确地提升视频流中图像与音频的匹配度,增强观看体验。
图4是本公开一示例性实施例的视频处理装置的示意图。如图4所示,本实施例的装置包括图像获取单元41、音频获取单元42、调整单元43。
其中,图像获取单元41用于获取原始图像序列,所述原始图像序列为按时间排序的多个原始图像,每个所述原始图像包括原始关键点信息。音频获取单元42用于获取音频序列,所述音频序列为按时间排序的多个音频信号,每个所述音频信号包括声学 特征。调整单元43用于按时间对应关系,根据所述音频序列中各所述音频信号的所述声学特征对所述原始图像序列中所述原始图像的所述原始关键点信息进行调整,以形成目标图像序列,所述目标图像序列包括按时间排序的多个目标图像。
在本实施例中,通过按时间对应关系,根据音频序列中各音频信号的声学特征对原始图像序列中原始图像的至少一个原始关键点信息进行调整,以获取目标图像序列。通过本公开实施例的方法,可以较为快速且准确地提升视频流中图像与音频的匹配度,增强观看体验。
图5是本公开一示例性实施例的电子设备的示意图。图5所示的电子设备为通用数据处理装置,具有通用的计算机硬件结构,至少包括处理器51和存储器52。处理器51和存储器52通过总线53连接。存储器52适于存储处理器51可执行的指令或程序。处理器51可以是独立的微处理器,也可以是包括一个或者多个微处理器的集合。由此,处理器51通过执行存储器52所存储的命令,从而执行如上所述的本公开实施例的方法流程。总线53将上述多个组件连接在一起,同时将上述组件连接到显示控制器54和显示装置以及输入/输出(I/O)装置55。输入/输出(I/O)装置55可以是鼠标、键盘、调制解调器、网络接口、触控输入装置、体感输入装置、打印机以及本领域公知的其他装置。典型地,输入/输出(I/O)装置55通过输入/输出(I/O)控制器56与系统相连。
存储器52可以存储软件组件,例如操作系统、通信模块、交互模块以及应用程序。以上所述的每个模块和应用程序都对应于完成一个或多个功能和在公开实施例中描述的方法的一组可执行程序指令。
上述根据本公开实施例的方法、设备(系统)和计算机程序产品的流程图和/或框图描述了本公开的各个方面。应理解,流程图和/或框图的每个块以及流程图图例和/或框图中的块的组合可以由计算机程序指令来实现。这些计算机程序指令可以被提供至通用计算机、专用计算机或其它可编程数据处理设备的处理器,以产生机器,使得(经由计算机或其它可编程数据处理设备的处理器执行的)指令创建用于实现流程图和/或框图块或块中指定的功能/动作的装置。
如本领域技术人员将意识到的,本公开实施例的各个方面可以被实现为系统、方法或计算机程序产品。因此,本公开实施例的各个方面可以采取如下形式:完全硬件实施方式、完全软件实施方式(包括固件、常驻软件、微代码等)或者在本文中通常可 以都称为“电路”、“模块”或“系统”的将软件方面与硬件方面相结合的实施方式。此外,本公开的方面可以采取如下形式:在一个或多个计算机可读介质中实现的计算机程序产品,计算机可读介质具有在其上实现的计算机可读程序代码。
可以利用一个或多个计算机可读介质的任意组合。计算机可读介质可以是计算机可读信号介质或计算机可读存储介质。计算机可读存储介质可以是如(但不限于)电子的、磁的、光学的、电磁的、红外的或半导体系统、设备或装置,或者前述的任意适当的组合。计算机可读存储介质的更具体的示例(非穷尽列举)将包括以下各项:具有一根或多根电线的电气连接、便携式计算机软盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或闪速存储器)、光纤、便携式光盘只读存储器(CD-ROM)、光存储装置、磁存储装置或前述的任意适当的组合。在本公开实施例的上下文中,计算机可读存储介质可以为能够包含或存储由指令执行系统、设备或装置使用的程序或结合指令执行系统、设备或装置使用的程序的任意有形介质。
计算机可读信号介质可以包括传播的数据信号,所述传播的数据信号具有在其中如在基带中或作为载波的一部分实现的计算机可读程序代码。这样的传播的信号可以采用多种形式中的任何形式,包括但不限于:电磁的、光学的或其任何适当的组合。计算机可读信号介质可以是以下任意计算机可读介质:不是计算机可读存储介质,并且可以对由指令执行系统、设备或装置使用的或结合指令执行系统、设备或装置使用的程序进行通信、传播或传输。
用于执行针对本公开各方面的操作的计算机程序代码可以以一种或多种编程语言的任意组合来编写,所述编程语言包括:面向对象的编程语言如Java、Smalltalk、C++、PHP、Python等;以及常规过程编程语言如“C”编程语言或类似的编程语言。程序代码可以作为独立软件包完全地在用户计算机上、部分地在用户计算机上执行;部分地在用户计算机上且部分地在远程计算机上执行;或者完全地在远程计算机或服务器上执行。在后一种情况下,可以将远程计算机通过包括局域网(LAN)或广域网(WAN)的任意类型的网络连接至用户计算机,或者可以与外部计算机进行连接(例如通过使用因特网服务供应商的因特网)。
以上所述仅为本公开的优选实施例,并不用于限制本公开,对于本领域技术人员而言,本公开可以有各种改动和变化。凡在本公开的精神和原理之内所作的任何修改、等同替换、改进等,均应包含在本公开的保护范围之内。

Claims (13)

  1. 一种视频处理方法,包括:
    获取原始图像序列,所述原始图像序列为按时间排序的多个原始图像,每个所述原始图像包括原始关键点信息;
    获取音频序列,所述音频序列为按时间排序的多个音频信号,每个所述音频信号包括声学特征;
    按时间对应关系,根据所述音频序列中各所述音频信号的所述声学特征对所述原始图像序列中所述原始图像的所述原始关键点信息进行调整,以形成目标图像序列,所述目标图像序列包括按时间排序的多个目标图像。
  2. 根据权利要求1所述的方法,其特征在于,
    所述原始关键点信息包括唇部关键点信息;
    获取所述原始图像序列,包括:
    对所述原始图像序列中的各所述原始图像进行面部检测,获取各所述原始图像的面部区域信息;
    根据各所述原始图像的所述面部区域信息,获取各所述原始图像的唇部关键点信息。
  3. 根据权利要求1或2所述的方法,其特征在于,按时间对应关系,根据所述音频序列中各所述音频信号的所述声学特征对所述原始图像序列中所述原始图像的所述原始关键点信息进行调整,以形成目标图像序列,包括:
    针对所述音频序列中的每个所述音频信号,
    获取所述音频信号的声学特征对应的关键点特征,所述关键点特征包括唇部特征,所述唇部特征包括唇部宽度和唇部高度;
    根据所述关键点特征对所述原始图像序列中在时间上与所述音频信号对应的原始图像的原始关键点信息进行调整,获取目标关键点信息;
    根据所述目标关键点信息对所述原始图像序列中在时间上与所述音频信号对应的原始图像进行调整,获取所述目标图像序列中在时间上与所述音频信号对应的目标图像。
  4. 根据权利要求3所述的方法,其特征在于,获取所述音频信号的声学特征对应的关键点特征,包括:
    判断所述音频信号的声学特征是否表示所述音频信号为人声;
    在确定所述音频信号为人声的情况下,对所述音频信号进行语音识别,获取所述 音频信号对应的语音识别结果,所述语音识别结果用于表征语种、字符以及音素中的任意一项或多项;
    根据所述语音识别结果,获取所述音频信号的声学特征对应的唇部特征。
  5. 根据权利要求4所述的方法,其特征在于,获取所述音频信号的声学特征对应的关键点特征,还包括:
    获取所述音频信号的声学特征对应的情绪系数,所述情绪系数用于表征情绪的强烈程度;
    根据所述唇部特征以及所述情绪系数,确定所述音频信号的声学特征对应的所述关键点特征。
  6. 根据权利要求3至5任一项所述的方法,其特征在于,利用预先训练的分类模型,获取所述音频信号的声学特征对应的关键点特征,所述分类模型基于历史数据训练获得。
  7. 根据权利要求3至6中任一项所述的方法,其特征在于,根据所述关键点特征对所述原始图像序列中在时间上与所述音频信号对应的原始图像的原始关键点信息进行调整,获取所述目标关键点信息,包括:
    判断所述关键点特征与所述原始图像序列中在时间上与所述音频信号对应的原始图像的原始关键点信息之间的差异是否小于第一阈值;
    在所述差异大于等于所述第一阈值的情况下,根据所述关键点特征对所述原始图像序列中在时间上与所述音频信号对应的原始图像的原始关键点信息进行调整,获取所述目标关键点信息。
  8. 根据权利要求3所述的方法,其特征在于,根据所述目标关键点信息对所述原始图像序列中在时间上与所述音频子序列对应的原始图像进行调整,获取所述目标图像序列中在时间上与所述音频信号对应的目标图像,包括:
    将所述原始图像序列中在时间上与所述音频信号对应的原始图像的所述原始关键点信息替换为所述目标关键点信息,获取所述目标图像。
  9. 根据权利要求3至8中任一项所述的方法,其特征在于,按时间对应关系,根据所述音频序列中各所述音频信号的所述声学特征对所述原始图像序列中所述原始图像的所述原始关键点信息进行调整,以形成目标图像序列,还包括:
    判断所述音频序列中相邻的第一音频信号和第二音频信号之间的时间戳间隔是否小于预定阈值;
    当所述时间戳间隔小于所述预定阈值时,将所述原始图像序列中在时间上与所述 时间戳间隔对应的原始图像的原始关键点信息,调整为与在时间上所述第一音频信号或所述第二音频信号对应的目标图像的目标关键点信息,以获取所述目标图像序列中在时间上与所述时间戳间隔对应的目标图像;
    将所获取的各所述目标图像以及所述原始图像序列中所述关键点信息未经调整的各所述原始图像,按时间顺序组成所述目标图像序列。
  10. 根据权利要求1-9中任一项所述的方法,其特征在于,所述声学特征包括以下中的至少一项:
    音频信号强度;以及
    音频信号频率。
  11. 一种视频处理装置,包括:
    图像获取单元,用于获取原始图像序列,所述原始图像序列为按时间排序的多个原始图像,各所述原始图像包括原始关键点信息;
    音频获取单元,用于获取音频序列,所述音频序列为按时间排序的多个音频信号,各所述音频信号包括声学特征;
    调整单元,用于按时间对应关系,根据所述音频序列中各所述音频信号的所述声学特征对所述原始图像序列中所述原始图像的所述原始关键点信息进行调整,以形成目标图像序列,所述目标图像序列包括按时间排序的多个目标图像。
  12. 一种计算机可读存储介质,其上存储计算机程序指令,其特征在于,所述计算机程序指令在被处理器执行时实现如权利要求1-10中任一项所述的方法。
  13. 一种电子设备,包括存储器和处理器,其特征在于,所述存储器用于存储一条或多条计算机程序指令,其中,所述一条或多条计算机程序指令被所述处理器执行以实现如权利要求1-10中任一项所述的方法。
PCT/CN2021/097192 2020-06-05 2021-05-31 视频处理 WO2021244468A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010508016.9 2020-06-05
CN202010508016.9A CN113761988A (zh) 2020-06-05 2020-06-05 图像处理方法、图像处理装置、存储介质和电子设备

Publications (1)

Publication Number Publication Date
WO2021244468A1 true WO2021244468A1 (zh) 2021-12-09

Family

ID=78785194

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/097192 WO2021244468A1 (zh) 2020-06-05 2021-05-31 视频处理

Country Status (2)

Country Link
CN (1) CN113761988A (zh)
WO (1) WO2021244468A1 (zh)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0674315A1 (en) * 1994-03-18 1995-09-27 AT&T Corp. Audio visual dubbing system and method
CN110866968A (zh) * 2019-10-18 2020-03-06 平安科技(深圳)有限公司 基于神经网络生成虚拟人物视频的方法及相关设备
CN111212245A (zh) * 2020-01-15 2020-05-29 北京猿力未来科技有限公司 一种合成视频的方法和装置

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0674315A1 (en) * 1994-03-18 1995-09-27 AT&T Corp. Audio visual dubbing system and method
CN110866968A (zh) * 2019-10-18 2020-03-06 平安科技(深圳)有限公司 基于神经网络生成虚拟人物视频的方法及相关设备
CN111212245A (zh) * 2020-01-15 2020-05-29 北京猿力未来科技有限公司 一种合成视频的方法和装置

Also Published As

Publication number Publication date
CN113761988A (zh) 2021-12-07

Similar Documents

Publication Publication Date Title
US10997764B2 (en) Method and apparatus for generating animation
CN107799126B (zh) 基于有监督机器学习的语音端点检测方法及装置
US7636662B2 (en) System and method for audio-visual content synthesis
Vougioukas et al. Video-driven speech reconstruction using generative adversarial networks
US20150325240A1 (en) Method and system for speech input
CN108538308B (zh) 基于语音的口型和/或表情模拟方法及装置
JP2012014394A (ja) ユーザ指示取得装置、ユーザ指示取得プログラムおよびテレビ受像機
Yang et al. Analysis and predictive modeling of body language behavior in dyadic interactions from multimodal interlocutor cues
Eyben et al. Audiovisual classification of vocal outbursts in human conversation using long-short-term memory networks
Li et al. Expressive Speech Driven Talking Avatar Synthesis with DBLSTM Using Limited Amount of Emotional Bimodal Data.
CN114121006A (zh) 虚拟角色的形象输出方法、装置、设备以及存储介质
US11322151B2 (en) Method, apparatus, and medium for processing speech signal
CN113129867A (zh) 语音识别模型的训练方法、语音识别方法、装置和设备
Youssef et al. Articulatory features for speech-driven head motion synthesis
CN115312030A (zh) 虚拟角色的显示控制方法、装置及电子设备
WO2021244468A1 (zh) 视频处理
Eyben et al. Audiovisual vocal outburst classification in noisy acoustic conditions
Asadiabadi et al. Multimodal speech driven facial shape animation using deep neural networks
WO2023035969A1 (zh) 语音与图像同步性的衡量方法、模型的训练方法及装置
Ivanko Audio-visual Russian speech recognition
JP4864783B2 (ja) パタンマッチング装置、パタンマッチングプログラム、およびパタンマッチング方法
Lan et al. Low level descriptors based DBLSTM bottleneck feature for speech driven talking avatar
CN114360491A (zh) 语音合成方法、装置、电子设备及计算机可读存储介质
Campr et al. Automatic fingersign to speech translator
Ishi et al. Evaluation of a formant-based speech-driven lip motion generation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21818285

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 03/05/2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21818285

Country of ref document: EP

Kind code of ref document: A1