CN118101856A - Image processing method and electronic device - Google Patents

Image processing method and electronic device Download PDF

Info

Publication number
CN118101856A
CN118101856A CN202410339775.5A CN202410339775A CN118101856A CN 118101856 A CN118101856 A CN 118101856A CN 202410339775 A CN202410339775 A CN 202410339775A CN 118101856 A CN118101856 A CN 118101856A
Authority
CN
China
Prior art keywords
image
target
motion
training
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410339775.5A
Other languages
Chinese (zh)
Other versions
CN118101856B (en
Inventor
范凯波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Honor Device Co Ltd
Original Assignee
Honor Device Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Honor Device Co Ltd filed Critical Honor Device Co Ltd
Priority to CN202410339775.5A priority Critical patent/CN118101856B/en
Publication of CN118101856A publication Critical patent/CN118101856A/en
Application granted granted Critical
Publication of CN118101856B publication Critical patent/CN118101856B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/14Picture signal circuitry for video frequency region
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Processing Or Creating Images (AREA)

Abstract

本申请实施例应用于图像处理领域,提供了一种图像处理方法及电子设备。该方法包括:获取待处理图像;根据待处理图像,生成目标视频中的N个目标图像,以及目标视频中的N目标音频,N个目标图像与N个目标音频一一对应,N为大于1的整数,每个目标音频用于在目标音频对应的目标图像被显示的情况下播放。基于本申请的技术方法,能够使得播放的目标音频的内容与显示的目标图像的内容的协调性更高,提高用户体验。

The embodiment of the present application is applied to the field of image processing, and provides an image processing method and electronic device. The method includes: obtaining an image to be processed; generating N target images in a target video and N target audios in a target video according to the image to be processed, wherein the N target images correspond to the N target audios one by one, and N is an integer greater than 1, and each target audio is used to be played when the target image corresponding to the target audio is displayed. Based on the technical method of the present application, the content of the played target audio can be more coordinated with the content of the displayed target image, thereby improving the user experience.

Description

图像处理方法及电子设备Image processing method and electronic device

技术领域Technical Field

本申请涉及图像处理领域,并且更具体地,涉及一种图像处理方法及电子设备。The present application relates to the field of image processing, and more specifically, to an image processing method and an electronic device.

背景技术Background technique

自然界总处于运动之中。运动是最引人注目的视觉信号之一,人类对于运动非常敏感。根据图像生成动图,能够提高用户体验。Nature is always in motion. Motion is one of the most eye-catching visual signals, and humans are very sensitive to it. Generating animated images based on images can improve user experience.

同时自然界中存在各种各样的声音,为动图匹配音频,从而可以在显示动图的同时播放音频,从而增强情感表达,使得展示的内容可以更具互动性和吸引力。但是,为动图匹配的声音与显示的动图中图像的内容可能不协调,影响用户体验。At the same time, there are various sounds in nature. By matching audio to the animated image, the audio can be played while the animated image is displayed, thereby enhancing emotional expression and making the displayed content more interactive and attractive. However, the sound matched to the animated image may not be consistent with the content of the image in the displayed animated image, affecting the user experience.

发明内容Summary of the invention

本申请提供了一种图像处理方法及电子设备,能够提高播放的声音与显示的图像的内容协调性,提高用户体验。The present application provides an image processing method and an electronic device, which can improve the coordination between the played sound and the displayed image content and improve the user experience.

第一方面,提供一种图像处理方法,包括:获取待处理图像;根据所述待处理图像,生成目标视频中的N个目标图像,以及所述目标视频中的N目标音频,所述N个目标图像与所述N个目标音频一一对应,N为大于1的整数,每个目标音频用于在所述目标音频对应的目标图像被显示的情况下播放。In a first aspect, an image processing method is provided, comprising: acquiring an image to be processed; generating N target images in a target video and N target audios in the target video based on the image to be processed, wherein the N target images correspond one-to-one to the N target audios, N is an integer greater than 1, and each target audio is used to be played when the target image corresponding to the target audio is displayed.

本申请提供的方法,根据待处理图像生成多个目标图像,并且生成每个目标图像对应的音频,使得播放的目标音频与显示的目标图像的内容的协调性更高,提高用户体验。The method provided in the present application generates multiple target images according to the image to be processed, and generates audio corresponding to each target image, so that the played target audio is more coordinated with the content of the displayed target image, thereby improving the user experience.

在一些可能的实现方式中,所述根据所述待处理图像,生成目标视频中的N个目标图像,以及所述目标视频中的N个目标音频,包括:利用图像处理系统依次对多个第一图像进行处理,以得到每个第一图像对应的至少一个第二图像和每个第二图像对应的所述目标音频,所述多个目标图像包括每个第一图像对应的所述至少一个第二图像,每个第一图像对应的所述至少一个第二图像均为所述目标视频中所述第一图像之后的图像,所述图像处理系统处理的第一个所述第一图像为所述待处理图像,所述图像处理系统处理的第i个所述第一图像为所述图像处理系统处理的第i-1个所述第一图像对应的至少一个第二图像中的图像,i为大于1的正整数,所述图像处理系统包括训练得到的神经网络模型。In some possible implementations, generating N target images in a target video and N target audios in the target video according to the image to be processed includes: using an image processing system to process multiple first images in sequence to obtain at least one second image corresponding to each first image and the target audio corresponding to each second image, the multiple target images include the at least one second image corresponding to each first image, the at least one second image corresponding to each first image is an image after the first image in the target video, the first first image processed by the image processing system is the image to be processed, the i-th first image processed by the image processing system is an image in the at least one second image corresponding to the i-1-th first image processed by the image processing system, i is a positive integer greater than 1, and the image processing system includes a trained neural network model.

通过将上一次处理得到的第二图像作为图像处理系统下一次进行处理的第一图像,在目标视频的时长较长的情况下,可以避免生成的图像存在偏移或发散,使得目标视频中的目标图像具有良好的图像质量,提高用户体验。By using the second image obtained in the previous processing as the first image to be processed next time by the image processing system, when the target video is long, the generated image can be prevented from being offset or divergent, so that the target image in the target video has good image quality, thereby improving the user experience.

在一些可能的实现方式中,所述方法还包括:获取所述待处理图像中用户指示的目标主体;在所述图像处理系统处理的所述第一图像为所述待处理图像的情况下,所述图像处理系统用于对所述第一图像和目标主体区域信息进行处理,以得到所述第一图像对应的至少一个第二图像,以及每个第二图像对应的所述目标音频,所述目标主体区域信息表示目标主体区域,在所述第一图像中的所述目标主体区域记录有所述目标主体,每个第二图像中所述目标主体区域之外的其他区域记录的内容与所述第一图像中所述其他区域记录的内容相同。In some possible implementations, the method also includes: obtaining a target subject indicated by a user in the image to be processed; when the first image processed by the image processing system is the image to be processed, the image processing system is used to process the first image and target subject area information to obtain at least one second image corresponding to the first image, and the target audio corresponding to each second image, the target subject area information represents a target subject area, the target subject is recorded in the target subject area in the first image, and the contents recorded in other areas outside the target subject area in each second image are the same as the contents recorded in the other areas in the first image.

与第一图像相比,目标主体所在的目标主体区域之外的部分在第二图像中不发生变化。从而,目标视频中记录的运动是根据用户的指示进行的,提高用户对视频生成的参与感,提高用户体验。Compared with the first image, the part outside the target subject area where the target subject is located does not change in the second image. Therefore, the movement recorded in the target video is performed according to the user's instructions, which improves the user's sense of participation in video generation and improves user experience.

在一些可能的实现方式中,在所述图像处理系统处理的所述第一图像为所述待处理图像的情况下,所述图像处理系统用于对所述第一图像和目标主体区域信息进行处理,以得到所述第一图像对应的至少一个第二图像,以及每个第二图像对应的所述目标音频。In some possible implementations, when the first image processed by the image processing system is the image to be processed, the image processing system is used to process the first image and the target subject area information to obtain at least one second image corresponding to the first image, and the target audio corresponding to each second image.

在第一图像为第一图像的情况下,图像处理系统对第一图像和目标主体区域信息进行处理;而在图像处理系统处理的第一图像不是第一图像的情况下,图像处理系统可以对第一图像进行处理,目标主体区域信息可以不再作为图像处理系统的输入。从而,目标视频中可以反映出目标主体的运动以及其他主体受目标主体影响而进行的运动,使得目标视频中记录的运动更加合理,提高用户体验。When the first image is the first image, the image processing system processes the first image and the target subject area information; when the first image processed by the image processing system is not the first image, the image processing system can process the first image, and the target subject area information can no longer be used as an input to the image processing system. Thus, the target video can reflect the movement of the target subject and the movement of other subjects affected by the target subject, making the movement recorded in the target video more reasonable and improving the user experience.

在一些可能的实现方式中,所述方法还包括:获取所述用户指示的所述目标主体的目标运动趋势;在所述图像处理系统处理的所述第一图像为所述待处理图像的情况下,所述图像处理系统用于对所述第一图像、目标主体区域信息和运动信息进行处理,以得到所述至少一个第二时刻中每个第二时刻对应的所述目标图像,以及每个第二时刻对应的所述目标音频,在所述至少一个第二时刻对应的至少一个所述目标图像中所述目标主体是按照所述运动信息表示的所述目标运动趋势进行运动的。In some possible implementations, the method also includes: obtaining the target motion trend of the target subject indicated by the user; when the first image processed by the image processing system is the image to be processed, the image processing system is used to process the first image, the target subject area information and the motion information to obtain the target image corresponding to each second moment in the at least one second moment, and the target audio corresponding to each second moment, and the target subject in at least one of the target images corresponding to the at least one second moment moves according to the target motion trend represented by the motion information.

图像处理系统根据用户指示的目标主体和目标主体的目标运动趋势,生成目标视频,目标视频中的目标主体的运动是按照目标运动趋势进行的,提高用户对视频生成的参与感,提高用户体验。The image processing system generates a target video according to the target subject indicated by the user and the target motion trend of the target subject. The motion of the target subject in the target video is performed according to the target motion trend, thereby increasing the user's sense of participation in video generation and improving the user experience.

在一些可能的实现方式中,所述图像处理系统包括特征预测模型、图像生成模型和音频生成模型;所述特征预测模型用于,对第一图像进行处理,以得到至少一个预测特征,不同的预测特征对应的第二时刻不同,每个第二时刻均是所述目标视频中所述第一图像对应的第一时刻之后的时刻;所述图像生成模型用于,分别对所述至少一个所述预测特征进行处理,以得到每个第二时刻对应的第二图像;所述音频生成模型用于,分别对所述至少一个所述预测特征进行处理,以得到每个第二时刻对应的目标音频。In some possible implementations, the image processing system includes a feature prediction model, an image generation model, and an audio generation model; the feature prediction model is used to process the first image to obtain at least one predicted feature, different predicted features correspond to different second moments, and each second moment is a moment after the first moment corresponding to the first image in the target video; the image generation model is used to process the at least one predicted feature respectively to obtain a second image corresponding to each second moment; the audio generation model is used to process the at least one predicted feature respectively to obtain a target audio corresponding to each second moment.

图像处理系统对第一图像进行处理得到第二时刻对应的预测特征,根据第二时刻对应的预测特征生成第二时刻对应的目标音频和第二时刻对应的第二图像,使得对应于相同时刻的目标音频与目标图像的内容的协调性更高,提高用户体验。The image processing system processes the first image to obtain the predicted features corresponding to the second moment, and generates the target audio corresponding to the second moment and the second image corresponding to the second moment based on the predicted features corresponding to the second moment, so that the contents of the target audio and the target image corresponding to the same moment are more coordinated, thereby improving the user experience.

在一些可能的实现方式中,所述特征预测模型包括运动位移场预测模型、运动特征提取模型、图像特征提取模型和调整模型;所述运动位移场预测模型用于,对所述第一图像进行处理,以得到每个第二时刻对应的运动位移场,每个第二时刻对应的所述运动位移场表示所述第一图像中的多个像素在所述第二时刻相对所述第一时刻的位移;所述运动特征提取模型用于,分别对所述至少一个运动位移场进行特征提取,以得到每个第二时刻对应的运动特征;所述图像特征提取模型用于,对所述第一图像进行特征提取,以得到图像特征;所述调整模型用于,根据每个第二时刻对应的所述运动特征,对所述图像特征进行调整,以得到所述第二时刻对应的所述预测特征。In some possible implementations, the feature prediction model includes a motion displacement field prediction model, a motion feature extraction model, an image feature extraction model and an adjustment model; the motion displacement field prediction model is used to process the first image to obtain a motion displacement field corresponding to each second moment, and the motion displacement field corresponding to each second moment represents the displacement of multiple pixels in the first image at the second moment relative to the first moment; the motion feature extraction model is used to perform feature extraction on the at least one motion displacement field respectively to obtain the motion feature corresponding to each second moment; the image feature extraction model is used to perform feature extraction on the first image to obtain image features; the adjustment model is used to adjust the image features according to the motion features corresponding to each second moment to obtain the predicted features corresponding to the second moment.

通过预测用于表示第一图像中多个像素在第一时刻至第二时刻之间的位移的运动位移场,并根据运动位移场的运动特征,对第一图像的图像特征进行调整,使得预测特征更加准确。By predicting a motion displacement field for representing the displacement of a plurality of pixels in a first image between a first moment and a second moment, and adjusting the image features of the first image according to the motion features of the motion displacement field, the predicted features are made more accurate.

在一些可能的实现方式中,所述方法还包括:获取所述待处理图像中用户指示的目标主体;所述运动位移场预测模型用于对所述第一图像和目标主体区域信息进行处理,以得到每个第二时刻对应的所述运动位移场,所述目标主体区域信息表示目标主体区域,在所述待处理图像中的所述目标主体区域记录有所述目标主体;每个第二时刻对应的所述运动位移场表示区域外像素的位移为0,所述区域外像素在所述第一图像中位于所述目标主体区域之外。In some possible implementations, the method further includes: obtaining a target subject indicated by a user in the image to be processed; the motion displacement field prediction model is used to process the first image and the target subject area information to obtain the motion displacement field corresponding to each second moment, the target subject area information represents a target subject area, and the target subject is recorded in the target subject area in the image to be processed; the displacement of pixels outside the motion displacement field corresponding to each second moment is 0, and the pixels outside the area are located outside the target subject area in the first image.

在一些可能的实现方式中,所述方法还包括:获取用户指示的所述目标主体的目标运动趋势;在所述图像处理系统处理的所述第一图像为所述待处理图像的情况下,所述运动位移场预测模型用于对所述第一图像、所述目标主体区域信息和运动信息进行处理,以得到每个第二时刻对应的所述运动位移场,每个第二时刻对应的所述运动位移场表示的目标主体像素在所述第二时刻相对所述第一时刻的目标位移符合所述运动信息表示的所述目标运动趋势,所述目标主体像素为所述第一图像中位于所述目标主体上的像素。In some possible implementations, the method further includes: obtaining a target motion trend of the target subject indicated by a user; when the first image processed by the image processing system is the image to be processed, the motion displacement field prediction model is used to process the first image, the target subject area information and the motion information to obtain the motion displacement field corresponding to each second moment, the target displacement of the target subject pixel represented by the motion displacement field corresponding to each second moment at the second moment relative to the first moment conforms to the target motion trend represented by the motion information, and the target subject pixel is a pixel located on the target subject in the first image.

第二方面,提供一种图像处理系统的训练方法,该方法包括:获取训练样本和标签图像、标签音频,训练样本包括训练视频中的样本图像,标签图像是训练视频中样本图像之后的图像,标签音频是显示训练视频中标签图像时训练视频中的音频;利用初始图像处理系统对训练样本进行处理,以得到训练图像和训练音频;根据训练图像和标签图像的第一差异,以及训练音频和标签音频的第二差异,调整初始图像处理系统的参数,调整后的初始图像处理系统为训练得到的图像处理系统。In a second aspect, a training method for an image processing system is provided, the method comprising: obtaining training samples and label images and label audios, the training samples comprising sample images in a training video, the label images being images subsequent to the sample images in the training video, and the label audio being audio in the training video when the label images in the training video are displayed; processing the training samples using an initial image processing system to obtain training images and training audios; adjusting parameters of the initial image processing system according to a first difference between the training image and the label image, and a second difference between the training audio and the label audio, the adjusted initial image processing system being the trained image processing system.

在一些可能的实现方式中,训练样本还可以包括训练主体区域信息,训练主体区域信息表示训练主体区域,在样本图像中的训练主体区域记录有训练目标主体,标签图像中在训练主体区域之外的其他区域记录的内容与所述样本图像中该其他区域记录的内容相同。In some possible implementations, the training sample may also include training subject area information, where the training subject area information represents the training subject area, where the training subject area in the sample image records the training target subject, and the content recorded in other areas outside the training subject area in the label image is the same as the content recorded in the other areas in the sample image.

在一些可能的实现方式中,在样本图像是训练视频中的第一帧图像的情况下,训练样本包括训练主体区域信息。In some possible implementations, when the sample image is the first frame image in a training video, the training sample includes training subject region information.

在一些可能的实现方式中,训练样本包括训练运动信息。相比于样本图像,在标签图像的训练目标主体是按照训练运动信息表示的训练运动趋势运动的。In some possible implementations, the training sample includes training motion information. Compared with the sample image, the training target subject in the label image moves according to the training motion trend represented by the training motion information.

示例性地,在样本图像是训练视频中的第一帧图像的情况下,训练样本可以包括训练运动信息。Exemplarily, when the sample image is the first frame image in a training video, the training sample may include training motion information.

在一些可能的实现方式中,初始图像处理系统包括初始特征预测模型、初始图像生成模型和初始音频生成模型。初始特征预测模型用于,对训练样本进行处理,以得到训练预测特征。初始图像生成模型用于,对训练预测特征进行处理,以得到训练图像。初始音频生成模型用于,对预测特征进行处理,以得到训练音频。参数调整后的初始特征预测模型为图像处理系统中的特征预测模型,参数调整后的初始图像生成模型为图像处理系统中的图像生成模型,参数调整后的初始音频生成模型为图像处理系统中的音频生成模型。In some possible implementations, the initial image processing system includes an initial feature prediction model, an initial image generation model, and an initial audio generation model. The initial feature prediction model is used to process training samples to obtain training prediction features. The initial image generation model is used to process the training prediction features to obtain training images. The initial audio generation model is used to process the prediction features to obtain training audio. The initial feature prediction model after parameter adjustment is the feature prediction model in the image processing system, the initial image generation model after parameter adjustment is the image generation model in the image processing system, and the initial audio generation model after parameter adjustment is the audio generation model in the image processing system.

在一些可能的实现方式中,初始特征预测模型包括运动位移场预测模型、初始运动特征提取模型、初始图像特征提取模型和调整模型。运动位移场预测模型用于,对训练样本进行处理,以得到训练运动位移场,训练运动位移场表示样本图像中的多个训练像素在标签图像对应的第二训练时刻相对样本图像对应的第一训练时刻的位移。标签图像对应的第二训练时刻、样本图像对应的第一训练时刻可以分别理解为标签图像和样本图像在训练视频中的时刻。初始运动特征提取模型用于,对运动位移场进行特征提取,以得到训练运动特征。初始图像特征提取模型用于,对样本图像进行特征提取,以得到训练图像特征。调整模型用于,根据训练运动特征,对训练图像特征进行调整,以得到训练预测特征。参数调整后的初始运动特征提取模型为图像处理系统中的运动特征提取模型,参数调整后的初始图像特征提取模型为图像处理系统中的图像特征提取模型。In some possible implementations, the initial feature prediction model includes a motion displacement field prediction model, an initial motion feature extraction model, an initial image feature extraction model, and an adjustment model. The motion displacement field prediction model is used to process the training samples to obtain a training motion displacement field, and the training motion displacement field represents the displacement of multiple training pixels in the sample image at the second training moment corresponding to the label image relative to the first training moment corresponding to the sample image. The second training moment corresponding to the label image and the first training moment corresponding to the sample image can be understood as the moments of the label image and the sample image in the training video, respectively. The initial motion feature extraction model is used to extract features from the motion displacement field to obtain training motion features. The initial image feature extraction model is used to extract features from the sample image to obtain training image features. The adjustment model is used to adjust the training image features according to the training motion features to obtain training prediction features. The initial motion feature extraction model after parameter adjustment is the motion feature extraction model in the image processing system, and the initial image feature extraction model after parameter adjustment is the image feature extraction model in the image processing system.

第三方面,提供了一种图像处理装置,包括用于执行第一方面和/或第二方面的方法的单元。该装置可以是终端设备,也可以是终端设备内的芯片。In a third aspect, an image processing apparatus is provided, comprising a unit for executing the method of the first aspect and/or the second aspect. The apparatus may be a terminal device or a chip in the terminal device.

第四方面,提供了一种电子设备,包括一个或多个处理器,以及存储器;所述存储器与所述一个或多个处理器耦合,所述存储器用于存储计算机程序代码,所述计算机程序代码包括计算机指令,所述一个或多个处理器调用所述计算机指令以使得所述电子设备执行第一方面和/或第二方面的方法。In a fourth aspect, an electronic device is provided, comprising one or more processors and a memory; the memory is coupled to the one or more processors, the memory is used to store computer program code, the computer program code comprises computer instructions, and the one or more processors call the computer instructions so that the electronic device executes the method of the first aspect and/or the second aspect.

第五方面,提供了一种芯片系统,所述芯片系统应用于电子设备,所述芯片系统包括一个或多个处理器,所述一个或多个处理器用于调用计算机指令以使得所述电子设备执行第一方面和/或第二方面的方法。In a fifth aspect, a chip system is provided, which is applied to an electronic device, and the chip system includes one or more processors, and the one or more processors are used to call computer instructions so that the electronic device executes the method of the first aspect and/or the second aspect.

第六方面,提供了一种计算机可读存储介质,所述计算机可读存储介质包括指令,当所述指令在电子设备上运行时,使得所述电子设备执行第一方面和/或第二方面的方法。In a sixth aspect, a computer-readable storage medium is provided, wherein the computer-readable storage medium includes instructions, and when the instructions are executed on an electronic device, the electronic device executes the method of the first aspect and/or the second aspect.

第七方面,提供了一种计算机程序产品,当所述计算机程序产品在电子设备上运行时,使得所述电子设备执行第一方面和/或第二方面的方法。According to a seventh aspect, a computer program product is provided. When the computer program product is run on an electronic device, the electronic device executes the method of the first aspect and/or the second aspect.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是适用于本申请的一种电子设备的硬件系统的示意图;FIG1 is a schematic diagram of a hardware system of an electronic device applicable to the present application;

图2是适用于本申请的一种电子设备的软件系统的示意图;FIG2 is a schematic diagram of a software system of an electronic device applicable to the present application;

图3是本申请实施例提供的一种图像处理方法的示意性流程图;FIG3 is a schematic flow chart of an image processing method provided in an embodiment of the present application;

图4是本申请实施例提供的一种图像处理系统的示意性结构图;FIG4 is a schematic structural diagram of an image processing system provided in an embodiment of the present application;

图5是本申请实施例适用的生成模型的原理的示意图;FIG5 is a schematic diagram of the principle of a generation model applicable to an embodiment of the present application;

图6是本申请实施例提供的一种随机运动位移场预测模型的示意性结构图;FIG6 is a schematic structural diagram of a random motion displacement field prediction model provided in an embodiment of the present application;

图7是视频中的像素的位置随时间的变化情况的示意图;FIG7 is a schematic diagram of the change of the position of pixels in a video over time;

图8是本申请实施例提供的一种隐扩散模型的示意图;FIG8 is a schematic diagram of an implicit diffusion model provided in an embodiment of the present application;

图9是本申请实施例提供的图像一种图像处理系统的训练方法的示意性流程图;FIG9 is a schematic flow chart of a training method for an image processing system provided in an embodiment of the present application;

图10是本申请实施例提供的一种系统架构的示意性结构图;FIG10 is a schematic structural diagram of a system architecture provided in an embodiment of the present application;

图11至图14是本申请实施例提供的图形用户界面的示意图;11 to 14 are schematic diagrams of graphical user interfaces provided in embodiments of the present application;

图15是本申请提供的一种图像处理装置的示意性结构图。FIG. 15 is a schematic structural diagram of an image processing device provided in the present application.

具体实施方式Detailed ways

下面将结合附图,对本申请实施例中的技术方案进行描述。The technical solutions in the embodiments of the present application will be described below in conjunction with the accompanying drawings.

图1示出了一种适用于本申请的电子设备的硬件系统。FIG. 1 shows a hardware system of an electronic device suitable for the present application.

本申请实施例提供的方法可以应用于手机、平板电脑、可穿戴设备、笔记本电脑、上网本、个人数字助理(personal digital assistant,PDA)等各种能够进行图像处理的电子设备,本申请实施例对电子设备的具体类型不作任何限制。The method provided in the embodiments of the present application can be applied to various electronic devices capable of image processing, such as mobile phones, tablet computers, wearable devices, laptop computers, netbooks, personal digital assistants (PDAs), etc. The embodiments of the present application do not impose any restrictions on the specific types of electronic devices.

图1示出了电子设备100的结构示意图。电子设备100可以包括处理器110,外部存储器接口120,内部存储器121,通用串行总线(universal serial bus,USB)接口130,充电管理模块140,电源管理模块141,电池142,天线1,天线2,移动通信模块150,无线通信模块160,音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,传感器模块180,按键190,马达191,指示器192,摄像头193,显示屏194,以及用户标识模块(subscriberidentification module,SIM)卡接口195等。其中传感器模块180可以包括压力传感器180A,陀螺仪传感器180B,气压传感器180C,磁传感器180D,加速度传感器180E,距离传感器180F,接近光传感器180G,指纹传感器180H,温度传感器180J,触摸传感器180K,环境光传感器180L,骨传导传感器180M等。1 shows a schematic diagram of the structure of an electronic device 100. The electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, a button 190, a motor 191, an indicator 192, a camera 193, a display screen 194, and a subscriber identification module (SIM) card interface 195. The sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, an ambient light sensor 180L, a bone conduction sensor 180M, etc.

可以理解的是,本申请实施例示意的结构并不构成对电子设备100的具体限定。在本申请另一些实施例中,电子设备100可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件,软件或软件和硬件的组合实现。It is to be understood that the structure illustrated in the embodiment of the present application does not constitute a specific limitation on the electronic device 100. In other embodiments of the present application, the electronic device 100 may include more or fewer components than shown in the figure, or combine some components, or split some components, or arrange the components differently. The components shown in the figure may be implemented in hardware, software, or a combination of software and hardware.

处理器110可以包括一个或多个处理单元,例如:处理器110可以包括应用处理器(application processor,AP),调制解调处理器,图形处理器(graphics processingunit,GPU),图像信号处理器(image signal processor,ISP),控制器,存储器,视频编解码器,数字信号处理器(digital signal processor,DSP),基带处理器,和/或神经网络处理器(neural-network processing unit,NPU)等。其中,不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。The processor 110 may include one or more processing units, for example, the processor 110 may include an application processor (AP), a modem processor, a graphics processor (GPU), an image signal processor (ISP), a controller, a memory, a video codec, a digital signal processor (DSP), a baseband processor, and/or a neural-network processing unit (NPU), etc. Different processing units may be independent devices or integrated into one or more processors.

其中,控制器可以是电子设备100的神经中枢和指挥中心。控制器可以根据指令操作码和时序信号,产生操作控制信号,完成取指令和执行指令的控制。The controller may be the nerve center and command center of the electronic device 100. The controller may generate an operation control signal according to the instruction operation code and the timing signal to complete the control of fetching and executing instructions.

处理器110中还可以设置存储器,用于存储指令和数据。在一些实施例中,处理器110中的存储器为高速缓冲存储器。该存储器可以保存处理器110刚用过或循环使用的指令或数据。如果处理器110需要再次使用该指令或数据,可从所述存储器中直接调用。避免了重复存取,减少了处理器110的等待时间,因而提高了系统的效率。The processor 110 may also be provided with a memory for storing instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory may store instructions or data that the processor 110 has just used or cyclically used. If the processor 110 needs to use the instruction or data again, it may be directly called from the memory. This avoids repeated access, reduces the waiting time of the processor 110, and thus improves the efficiency of the system.

电子设备100通过GPU,显示屏194,以及应用处理器等实现显示功能。GPU为图像处理的微处理器,连接显示屏194和应用处理器。GPU用于执行数学和几何计算,用于图形渲染。处理器110可包括一个或多个GPU,其执行程序指令以生成或改变显示信息。The electronic device 100 implements the display function through a GPU, a display screen 194, and an application processor. The GPU is a microprocessor for image processing, which connects the display screen 194 and the application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. The processor 110 may include one or more GPUs that execute program instructions to generate or change display information.

显示屏194用于显示图像,视频等。显示屏194包括显示面板。显示面板可以采用液晶显示屏(liquid crystal display,LCD),有机发光二极管(organic light-emittingdiode,OLED),有源矩阵有机发光二极体或主动矩阵有机发光二极体(active-matrixorganic light emitting diode的,AMOLED),柔性发光二极管(flex light-emittingdiode,FLED),Miniled,MicroLed,Micro-OLED,量子点发光二极管(quantum dot lightemitting diodes,QLED)等。在一些实施例中,电子设备100可以包括1个或N个显示屏194,N为大于1的正整数。The display screen 194 is used to display images, videos, etc. The display screen 194 includes a display panel. The display panel can be a liquid crystal display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode or an active-matrix organic light-emitting diode (AMOLED), a flexible light-emitting diode (FLED), Miniled, MicroLed, Micro-OLED, a quantum dot light-emitting diode (QLED), etc. In some embodiments, the electronic device 100 may include 1 or N display screens 194, where N is a positive integer greater than 1.

外部存储器接口120可以用于连接外部存储卡,例如Micro SD卡,实现扩展电子设备100的存储能力。外部存储卡通过外部存储器接口120与处理器110通信,实现数据存储功能。例如将音乐,视频等文件保存在外部存储卡中。The external memory interface 120 can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the electronic device 100. The external memory card communicates with the processor 110 through the external memory interface 120 to implement a data storage function, such as storing music, video and other files in the external memory card.

内部存储器121可以用于存储计算机可执行程序代码,所述可执行程序代码包括指令。处理器110通过运行存储在内部存储器121的指令,从而执行电子设备100的各种功能应用以及数据处理。内部存储器121可以包括存储程序区和存储数据区。其中,存储程序区可存储操作系统,至少一个功能所需的应用程序(比如声音播放功能,图像播放功能等)等。存储数据区可存储电子设备100使用过程中所创建的数据(比如音频数据,电话本等)等。此外,内部存储器121可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件,闪存器件,通用闪存存储器(universal flash storage,UFS)等。The internal memory 121 can be used to store computer executable program codes, which include instructions. The processor 110 executes various functional applications and data processing of the electronic device 100 by running the instructions stored in the internal memory 121. The internal memory 121 may include a program storage area and a data storage area. Among them, the program storage area can store an operating system, an application required for at least one function (such as a sound playback function, an image playback function, etc.), etc. The data storage area can store data created during the use of the electronic device 100 (such as audio data, a phone book, etc.), etc. In addition, the internal memory 121 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one disk storage device, a flash memory device, a universal flash storage (UFS), etc.

压力传感器180A用于感受压力信号,可以将压力信号转换成电信号。在一些实施例中,压力传感器180A可以设置于显示屏194。压力传感器180A的种类很多,如电阻式压力传感器,电感式压力传感器,电容式压力传感器等。电容式压力传感器可以是包括至少两个具有导电材料的平行板。当有力作用于压力传感器180A,电极之间的电容改变。电子设备100根据电容的变化确定压力的强度。当有触摸操作作用于显示屏194,电子设备100根据压力传感器180A检测所述触摸操作强度。电子设备100也可以根据压力传感器180A的检测信号计算触摸的位置。在一些实施例中,作用于相同触摸位置,但不同触摸操作强度的触摸操作,可以对应不同的操作指令。例如:当有触摸操作强度小于第一压力阈值的触摸操作作用于短消息应用图标时,执行查看短消息的指令。当有触摸操作强度大于或等于第一压力阈值的触摸操作作用于短消息应用图标时,执行新建短消息的指令。The pressure sensor 180A is used to sense the pressure signal and can convert the pressure signal into an electrical signal. In some embodiments, the pressure sensor 180A can be set on the display screen 194. There are many types of pressure sensors 180A, such as resistive pressure sensors, inductive pressure sensors, capacitive pressure sensors, etc. The capacitive pressure sensor can be a parallel plate including at least two conductive materials. When a force acts on the pressure sensor 180A, the capacitance between the electrodes changes. The electronic device 100 determines the intensity of the pressure according to the change in capacitance. When a touch operation acts on the display screen 194, the electronic device 100 detects the touch operation intensity according to the pressure sensor 180A. The electronic device 100 can also calculate the touch position according to the detection signal of the pressure sensor 180A. In some embodiments, touch operations acting on the same touch position but with different touch operation intensities can correspond to different operation instructions. For example: when a touch operation with a touch operation intensity less than the first pressure threshold acts on the short message application icon, an instruction to view the short message is executed. When a touch operation with a touch operation intensity greater than or equal to the first pressure threshold acts on the short message application icon, an instruction to create a new short message is executed.

触摸传感器180K,也称“触控面板”。触摸传感器180K可以设置于显示屏194,由触摸传感器180K与显示屏194组成触摸屏,也称“触控屏”。触摸传感器180K用于检测作用于其上或附近的触摸操作。触摸传感器可以将检测到的触摸操作传递给应用处理器,以确定触摸事件类型。可以通过显示屏194提供与触摸操作相关的视觉输出。在另一些实施例中,触摸传感器180K也可以设置于电子设备100的表面,与显示屏194所处的位置不同。The touch sensor 180K is also called a "touch panel". The touch sensor 180K can be set on the display screen 194, and the touch sensor 180K and the display screen 194 form a touch screen, also called a "touch screen". The touch sensor 180K is used to detect touch operations acting on or near it. The touch sensor can pass the detected touch operation to the application processor to determine the type of touch event. Visual output related to the touch operation can be provided through the display screen 194. In other embodiments, the touch sensor 180K can also be set on the surface of the electronic device 100, which is different from the position of the display screen 194.

电子设备100的软件系统可以采用分层架构,事件驱动架构,微核架构,微服务架构,或云架构。本申请实施例以分层架构的Android系统为例,示例性说明电子设备100的软件结构。The software system of the electronic device 100 may adopt a layered architecture, an event-driven architecture, a micro-core architecture, a micro-service architecture, or a cloud architecture. The embodiment of the present application takes the Android system of the layered architecture as an example to exemplify the software structure of the electronic device 100.

对于电子设备100的软件结构,分层架构将软件分成若干个层,每一层都有清晰的角色和分工。层与层之间通过软件接口通信。在一些实施例中,将Android系统分为四层,如图2所示,从上至下分别为应用程序层,应用程序框架层,安卓运行时(Android runtime)的系统库,以及内核层。For the software structure of the electronic device 100, the layered architecture divides the software into several layers, each layer has a clear role and division of labor. The layers communicate with each other through software interfaces. In some embodiments, the Android system is divided into four layers, as shown in FIG2, from top to bottom, respectively, the application layer, the application framework layer, the system library of the Android runtime (Android runtime), and the kernel layer.

应用程序层可以包括一系列应用程序包。应用程序层可以包括相机,日历,通话,地图,导航,WLAN,蓝牙,音乐,视频,短信息,壁纸,图库,设置等应用程序(application,APP)。其中,壁纸,图库,设置等一个或多个应用程序可以用于执行本申请实施例提供的图像处理方法。图库也可以称为相册或媒体库等。The application layer may include a series of application packages. The application layer may include camera, calendar, call, map, navigation, WLAN, Bluetooth, music, video, short message, wallpaper, gallery, setting and other application (application, APP). Among them, one or more applications such as wallpaper, gallery, setting can be used to execute the image processing method provided in the embodiment of the present application. The gallery can also be called album or media library, etc.

应用程序框架层为应用程序层的应用程序提供应用编程接口(applicationprogramming interface,API)和编程框架。应用程序框架层包括一些预先定义的函数。The application framework layer provides an application programming interface (API) and programming framework for applications in the application layer. The application framework layer includes some predefined functions.

应用程序框架层可以包括窗口管理器,内容提供器,视图系统,电话管理器,资源管理器,通知管理器,数据整合模块等。The application framework layer may include a window manager, a content provider, a view system, a phone manager, a resource manager, a notification manager, a data integration module, and the like.

窗口管理器用于管理窗口程序。窗口管理器可以获取显示屏大小,判断是否有状态栏,锁定屏幕,截取屏幕等。内容提供器用来存放和获取数据,并使这些数据可以被应用程序访问。视图系统包括可视控件,例如显示文字的控件,显示图片的控件等。视图系统可用于构建应用程序。显示界面可以由一个或多个视图组成的。例如,包括短信通知图标的显示界面,可以包括显示文字的视图以及显示图片的视图。电话管理器用于提供电子设备100的通信功能。资源管理器为应用程序提供各种资源。通知管理器使应用程序可以在状态栏中显示通知信息,可以用于传达告知类型的消息,可以短暂停留后自动消失,无需用户交互。The window manager is used to manage window programs. The window manager can obtain the size of the display screen, determine whether there is a status bar, lock the screen, capture the screen, etc. The content provider is used to store and obtain data and make the data accessible to applications. The view system includes visual controls, such as controls for displaying text, controls for displaying pictures, etc. The view system can be used to build applications. The display interface can be composed of one or more views. For example, a display interface including a text message notification icon can include a view for displaying text and a view for displaying pictures. The phone manager is used to provide communication functions for the electronic device 100. The resource manager provides various resources for applications. The notification manager enables applications to display notification information in the status bar, which can be used to convey notification-type messages and can disappear automatically after a short stay without user interaction.

Android runtime包括核心库和虚拟机。Android runtime负责安卓系统的调度和管理。Android runtime includes core libraries and virtual machines. Android runtime is responsible for scheduling and management of the Android system.

核心库包含两部分:一部分是java语言需要调用的功能函数,另一部分是安卓的核心库。The core library consists of two parts: one part is the function that needs to be called by the Java language, and the other part is the Android core library.

应用程序层和应用程序框架层运行在虚拟机中。虚拟机将应用程序层和应用程序框架层的java文件执行为二进制文件。虚拟机用于执行对象生命周期的管理,堆栈管理,线程管理,安全和异常的管理,以及垃圾回收等功能。The application layer and the application framework layer run in a virtual machine. The virtual machine executes the Java files of the application layer and the application framework layer as binary files. The virtual machine is used to perform functions such as object life cycle management, stack management, thread management, security and exception management, and garbage collection.

系统库可以包括多个功能模块。例如:表面管理器(surface manager),媒体库(media libraries),三维图形处理库,2D图形引擎等。The system library can include multiple functional modules, such as surface manager, media libraries, 3D graphics processing library, 2D graphics engine, etc.

表面管理器用于对显示子系统进行管理,并且为多个应用程序提供了2D和3D图层的融合。媒体库支持多种常用的音频,视频格式回放和录制,以及静态图像文件等。媒体库可以支持多种音视频编码格式。三维图形处理库用于实现三维图形绘图,图像渲染,合成,和图层处理等。2D图形引擎是2D绘图的绘图引擎。The surface manager is used to manage the display subsystem and provide fusion of 2D and 3D layers for multiple applications. The media library supports playback and recording of multiple common audio and video formats, as well as static image files. The media library can support multiple audio and video encoding formats. The 3D graphics processing library is used to implement 3D graphics drawing, image rendering, synthesis, and layer processing. The 2D graphics engine is a drawing engine for 2D drawing.

内核层(kernel)是硬件和软件之间的层。内核层可以包括显示驱动、摄像头驱动、音频驱动和传感器驱动等驱动模块。The kernel layer is the layer between hardware and software. The kernel layer can include driver modules such as display driver, camera driver, audio driver and sensor driver.

随着电子设备的功能日趋丰富,人们可以利用电子设备的拍照功能进行图像的采集,也可以利用电子设备的通信功能进行图像的接收。As the functions of electronic devices become increasingly rich, people can use the camera function of the electronic devices to capture images, and can also use the communication function of the electronic devices to receive images.

一般来说,用户在手机的图库中存储了大量的图片。通常用户希望选择一些精美的图片或者人工精修部份图片,用来作为屏幕的壁纸。这些精美的图片只能作为背景图片成为静态壁纸,缺乏动态展示效果。Generally speaking, users store a large number of pictures in the gallery of their mobile phones. Usually, users want to select some exquisite pictures or manually edited pictures to use as screen wallpapers. These exquisite pictures can only be used as background pictures to become static wallpapers, lacking dynamic display effects.

自然界总处于运动之中,即使看似静止的场景也包含微妙的震荡,这是由风、水流、呼吸或其他自然节奏造成的。运动是最引人注目的视觉信号之一,人类尤其灵敏于此。如果拍下的图像没有运动(或者运动不太真实),通常会让人感到不自然或不真实。静态的壁纸已经不能满足人们的需求。The natural world is always in motion, and even seemingly still scenes contain subtle vibrations caused by wind, water flow, breathing, or other natural rhythms. Movement is one of the most striking visual cues, and humans are particularly sensitive to it. If the image is captured without movement (or the movement is not realistic), it will usually feel unnatural or unreal. Static wallpapers are no longer enough to meet people's needs.

动态图像可以作为壁纸。动态图像的制作是一个融合图像、语音、效果渲染和人工智能的多模态问题,处理较为复杂。Dynamic images can be used as wallpapers. The production of dynamic images is a multimodal problem that integrates images, voice, effect rendering and artificial intelligence, and the processing is relatively complex.

人工智能(artificial intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个分支,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式作出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。Artificial intelligence (AI) is the theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. In other words, artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new type of intelligent machine that can respond in a similar way to human intelligence. Artificial intelligence is also the study of the design principles and implementation methods of various intelligent machines, so that machines have the functions of perception, reasoning and decision-making.

机器学习是人工智能的一个重要分支,而深度学习又是机器学习的一个重要分支。深度学习(deep learning)是指利用多层神经网络结构,从大数据中学习现实世界中各类事物能直接用于计算机计算的表示形式(比如,图像中的事物、音频中的声音等)。在图像处理领域,深度学习在目标检测,图像生成,图像分割等问题上都取得了优越的成果。Machine learning is an important branch of artificial intelligence, and deep learning is an important branch of machine learning. Deep learning refers to the use of multi-layer neural network structures to learn from big data the representation of various things in the real world that can be directly used for computer calculations (for example, things in images, sounds in audio, etc.). In the field of image processing, deep learning has achieved excellent results in object detection, image generation, image segmentation and other problems.

机器学习可以应用于动态图像的生成。动态图像可以理解为包括多个图像但不包括声音的视频。Machine learning can be applied to the generation of dynamic images, which can be understood as videos that include multiple images but do not include sound.

自然界中存在各种各样的声音。如果仅仅进行动态图像的显示,用户体验较差。There are various sounds in nature. If only dynamic images are displayed, the user experience is poor.

为动图匹配音频,可以为动图配音。从而能够增强情感表达,使得展示的内容可以更具互动性和吸引力。但是,通过匹配得到的声音在动图显示过程中,声音与图像的内容可能不一致,影响用户体验。Matching audio to animated images can add dubbing to them. This can enhance emotional expression and make the displayed content more interactive and attractive. However, when the audio is matched to the animated image, the sound may not be consistent with the image content, affecting the user experience.

为了解决上述问题,本申请提供一种图像处理方法和电子设备。In order to solve the above problems, the present application provides an image processing method and an electronic device.

下面结合图3对本申请实施例提供的图像处理方法进行详细描述。本申请提供的方法的执行主体可以是电子设备,也可以是电子设备中能够进行图像处理的软/硬件模块,为了便于说明,以下实施例中以电子设备为例进行说明。The image processing method provided in the embodiment of the present application is described in detail below in conjunction with Figure 3. The execution subject of the method provided in the present application can be an electronic device, or a software/hardware module capable of image processing in the electronic device. For the sake of convenience, the following embodiments are described by taking the electronic device as an example.

图3是本申请实施例提供的图像处理方法的示意性流程图。该方法可以包括步骤S310至步骤S320,下面分别对这些步骤进行详细的描述。Fig. 3 is a schematic flow chart of an image processing method provided in an embodiment of the present application. The method may include steps S310 to S320, and these steps are described in detail below.

步骤S310,获取待处理图像。Step S310, obtaining an image to be processed.

待处理图像的获取方式,可以是在存储器中读取待处理图像,或接收其他电子设备发送待处理图像等。待处理图像也可以是用户在相册中选取的图像。The image to be processed may be obtained by reading the image to be processed from a memory, or receiving the image to be processed from other electronic devices, etc. The image to be processed may also be an image selected by the user in an album.

步骤S320,根据待处理图像,生成目标视频中的N个目标图像,以及目标视频中的N个目标音频,N个目标图像与N个目标音频一一对应,N为大于1的整数,每个目标音频用于在目标音频对应的目标图像被显示的情况下播放。Step S320, based on the image to be processed, generate N target images in the target video and N target audios in the target video, the N target images correspond one to one with the N target audios, N is an integer greater than 1, and each target audio is used to play when the target image corresponding to the target audio is displayed.

也就是说,根据待处理图像可以生成多个目标图像以及每个目标图像对应的目标音频,目标视频可以包括该多个目标图像以及每个目标图像对应的目标音频。每个目标图像对应的目标音频,也可以理解为与该目标图像匹配的音频。That is, multiple target images and target audio corresponding to each target image can be generated according to the image to be processed, and the target video can include the multiple target images and the target audio corresponding to each target image. The target audio corresponding to each target image can also be understood as the audio matching the target image.

根据待处理图像生成多个目标图像,并且生成每个目标图像对应的音频,使得播放的目标音频的内容与显示的目标图像内容的协调性更高,提高用户体验。A plurality of target images are generated according to the image to be processed, and audio corresponding to each target image is generated, so that the content of the played target audio is more coordinated with the content of the displayed target image, thereby improving the user experience.

在步骤S320,可以利用图像处理系统对待处理图像进行处理,以得到该N个目标图像和该N个目标音频。但是,由一张图像生成的目标图像的数量过多,可能导致目标图像存在偏移或发散,存在模糊,图像质量较差,使得用户体验较差。In step S320, the image to be processed may be processed by an image processing system to obtain the N target images and the N target audios. However, if too many target images are generated from one image, the target images may be offset or divergent, blurred, and have poor image quality, resulting in a poor user experience.

在目标视频的时长较长,目标图像的数量较多的情况下,为了使得生成的目标图像具有较高的清晰度和图像质量,可以利用图像处理系统依次对多个第一图像进行处理,每个第一图像对应的至少一个第二图像和每个第二图像对应的音频。多个目标图像包括每个第一图像对应的至少一个第二图像。每个第二图像对应的音频即为该第二图像对应的目标音频。每个第一图像对应的至少一个第二图像均为目标视频中该第一图像之后的图像。When the target video is long and the number of target images is large, in order to make the generated target image have high clarity and image quality, the image processing system can be used to process multiple first images in sequence, at least one second image corresponding to each first image and the audio corresponding to each second image. The multiple target images include at least one second image corresponding to each first image. The audio corresponding to each second image is the target audio corresponding to the second image. The at least one second image corresponding to each first image is the image after the first image in the target video.

图像处理系统处理的第一个第一图像可以是待处理图像。图像处理系统处理的第i个第一图像可以是图像处理系统处理的第i-1个第一图像对应的至少一个第二图像中的图像,i为大于1的正整数。示例性地,该第i个第一图像可以是该第i-1个第一图像对应的至少一个第二图像中在所述目标视频中的最后一个图像。The first first image processed by the image processing system may be an image to be processed. The i-th first image processed by the image processing system may be an image in at least one second image corresponding to the i-1-th first image processed by the image processing system, where i is a positive integer greater than 1. Exemplarily, the i-th first image may be the last image in the target video in at least one second image corresponding to the i-1-th first image.

图像处理系统包括训练得到的神经网络模型。The image processing system includes a trained neural network model.

示例性地,目标视频的时长大于或等于预设时长阈值的情况下,可以利用图像处理系统依次对多个第一图像进行处理。Exemplarily, when the duration of the target video is greater than or equal to a preset duration threshold, the image processing system may be used to process the plurality of first images in sequence.

在某个第一图像对应的至少一个第二图像的数量为多个的情况下,图像处理系统对该第一图像的处理,可以理解为图像处理系统对第一图像和时间信息的处理,不同的时间信息表示与第一图像不同的时间间隔。图像处理系统对第一图像和某个时间信息进行处理,可以得到第一图像对应的与第一图像的时间间隔为该时间信息表示的时间间隔的第二图像。In the case where a first image corresponds to more than one second image, the processing of the first image by the image processing system can be understood as the processing of the first image and time information by the image processing system, and different time information indicates different time intervals with the first image. The image processing system processes the first image and certain time information to obtain a second image corresponding to the first image whose time interval with the first image is the time interval indicated by the time information.

时间信息表示的时间间隔之间的差异可以是预设时长的整数倍,也可以是其他数值。该预设时长可以理解为相邻两个目标图像之间的时间间隔。也就是说,相邻两个目标图像之间的时间间隔可以相同或不同。The difference between the time intervals represented by the time information may be an integer multiple of a preset time length, or may be other values. The preset time length may be understood as the time interval between two adjacent target images. In other words, the time intervals between two adjacent target images may be the same or different.

示例性地,在时间信息表示的时间间隔之间的差异均为预设时长的整数倍的情况下,每个目标音频的时长可以是预设时长,并且,在预设时长内,目标音频中的声音可以存在或不存在变化。Exemplarily, when the differences between the time intervals represented by the time information are all integer multiples of a preset duration, the duration of each target audio may be the preset duration, and within the preset duration, the sound in the target audio may or may not change.

在时间信息表示的时间间隔之间的差异不是根据预设时长确定的情况下,每个目标音频的播放时长可以根据时间信息表示的时间间隔确定,并且,目标音频的声音可以不存在变化。When the difference between the time intervals represented by the time information is not determined according to a preset duration, the playback duration of each target audio may be determined according to the time interval represented by the time information, and the sound of the target audio may not change.

在每个第一图像对应的至少一个第二图像的数量为一个的情况下,第二图像可以是目标视频中第一图像之后预设时长的图像。也就是说,相邻两个目标图像之间的时间间隔可以相等的。In the case where the number of at least one second image corresponding to each first image is one, the second image may be an image of a preset duration after the first image in the target video. In other words, the time intervals between two adjacent target images may be equal.

图像处理系统可以包括特征预测模型、图像生成模型和音频生成模型。The image processing system may include a feature prediction model, an image generation model, and an audio generation model.

特征预测模型用于对第一图像进行处理,以得到分别对应于至少一个第二时刻的至少一个预测特征。每个第二时刻均是所述目标视频中所述第一图像对应的第一时刻之后的时刻。The feature prediction model is used to process the first image to obtain at least one prediction feature corresponding to at least one second moment, each of which is a moment after the first moment corresponding to the first image in the target video.

图像生成模型用于分别每个第二时刻对应的预测特征进行处理,以生成该第二时刻对应的第二图像。The image generation model is used to process the prediction features corresponding to each second moment respectively to generate a second image corresponding to the second moment.

音频生成模型用于分别对每个第二时刻对应的预测特征进行处理,以生成该第二时刻对应的目标音频。The audio generation model is used to process the predicted features corresponding to each second moment respectively to generate the target audio corresponding to the second moment.

与某个第二图像对应的第二时刻相同的目标音频可以理解为该第二图像对应的目标音频。The target audio that is the same as the second moment corresponding to a second image may be understood as the target audio corresponding to the second image.

图像生成模型和音频生成模型可以是生成模型。Image generation models and audio generation models may be generation models.

对第一图像进行处理得到第二时刻对应的预测特征,根据第二时刻对应的预测特征生成第二时刻对应的目标音频和第二时刻对应的第二图像,使得目标音频与第二图像的相关程度更高,即内容的协调性更高,提高用户体验。The first image is processed to obtain the predicted features corresponding to the second moment, and the target audio corresponding to the second moment and the second image corresponding to the second moment are generated according to the predicted features corresponding to the second moment, so that the target audio and the second image are more correlated, that is, the content is more coordinated, thereby improving the user experience.

示例性地,特征预测模型可以包括运动位移场预测模型、运动特征提取模型、图像特征提取模型和调整模型。Exemplarily, the feature prediction model may include a motion displacement field prediction model, a motion feature extraction model, an image feature extraction model, and an adjustment model.

运动位移场预测模型也可以称为随机运动位移场预测模型,用于对第一图像进行处理,以得到每个第二时刻对应的运动位移场,每个第二时刻对应的所述运动位移场表示第一图像中的多个像素在第二时刻相对第一时刻的位移。The motion displacement field prediction model can also be called a random motion displacement field prediction model, which is used to process the first image to obtain the motion displacement field corresponding to each second moment. The motion displacement field corresponding to each second moment represents the displacement of multiple pixels in the first image at the second moment relative to the first moment.

运动特征提取模型用于,分别对至少一个运动位移场进行特征提取,以得到每个第二时刻对应的运动特征。The motion feature extraction model is used to extract features from at least one motion displacement field to obtain motion features corresponding to each second moment.

运动特征提取模型可以是卷积神经网络。运动特征可以包括运动特征提取模型的最后一个层的输出,还可以包括运动特征提取模型多个隐含层的输出。The motion feature extraction model may be a convolutional neural network. The motion feature may include the output of the last layer of the motion feature extraction model, and may also include the output of multiple hidden layers of the motion feature extraction model.

图像特征提取模型用于,对第一图像进行特征提取,以得到图像特征。The image feature extraction model is used to extract features from the first image to obtain image features.

图像特征提取模型可以是卷积神经网络。图像特征可以包括图像特征提取模型的最后一个层的输出,还可以包括运动特征提取模型多个隐含层的输出。The image feature extraction model may be a convolutional neural network. The image feature may include the output of the last layer of the image feature extraction model, and may also include the output of multiple hidden layers of the motion feature extraction model.

调整模型用于,根据每个第二时刻对应的所述运动特征,对所述图像特征进行调整,以得到该第二时刻对应的预测特征。The adjustment model is used to adjust the image features according to the motion features corresponding to each second moment to obtain the prediction features corresponding to the second moment.

调整模型可以是激活函数,例如可以是激活函数softmax。The adjustment model may be an activation function, for example, an activation function softmax.

通过预测用于表示第一图像中多个像素在第一时刻至第二时刻之间的位移的运动位移场,并根据运动位移场的运动特征,对第一图像的图像特征进行调整,使得预测特征更加准确。By predicting a motion displacement field for representing the displacement of a plurality of pixels in a first image between a first moment and a second moment, and adjusting the image features of the first image according to the motion features of the motion displacement field, the predicted features are made more accurate.

运动位移场预测模型可以包括第一变换模块、运动纹理预测模型和第二变换模块。The motion displacement field prediction model may include a first transformation module, a motion texture prediction model and a second transformation module.

第一变换模块用于对第一图像进行傅里叶变换,以得到图像频域数据。图像频域数据可以理解为第一图像的频域表示。The first transform module is used to perform Fourier transform on the first image to obtain image frequency domain data. The image frequency domain data can be understood as the frequency domain representation of the first image.

运动纹理预测模型也可以称为随机运动纹理预测模型,用于对图像频域数据进行处理,以生成运动纹理。运动纹理预测模型可以是生成模型,例如可以是扩散模型或隐式扩散模型(latent diffusion models,LDM)。The motion texture prediction model may also be called a random motion texture prediction model, which is used to process image frequency domain data to generate motion texture. The motion texture prediction model may be a generative model, such as a diffusion model or an implicit diffusion model (latent diffusion models, LDM).

第二变换模块用于对运动纹理进行逆傅里叶变换,以得到运动位移场。The second transformation module is used to perform inverse Fourier transformation on the motion texture to obtain a motion displacement field.

在运动纹理预测模型为LDM的情况下,运动纹理预测模型为LDM包括压缩模型和运动纹理生成模型。In the case where the motion texture prediction model is LDM, the motion texture prediction model is LDM including a compression model and a motion texture generation model.

压缩模型用于对图像频域数据进行压缩,以得到图像压缩数据。压缩模型可以是编码器,例如可以是VAE中的自编码器。The compression model is used to compress the image frequency domain data to obtain image compressed data. The compression model can be an encoder, for example, an autoencoder in a VAE.

运动纹理生成模型用于对图像压缩数据进行处理,以生成运动纹理。运动纹理生成模型可以是扩散模型。The motion texture generation model is used to process the image compression data to generate motion texture. The motion texture generation model may be a diffusion model.

示例性地,运动纹理生成模型可以包括扩散模型和解压缩模型。扩散模型可以用于根据图像压缩数据,对压缩后的频域噪声数据进行多次去噪处理,以得到压缩运动纹理。解压缩模型可以用于对压缩运动纹理进行解压缩,以得到运动纹理。去噪处理,可以理解为对高斯噪声的去除。Exemplarily, the motion texture generation model may include a diffusion model and a decompression model. The diffusion model may be used to perform multiple denoising processes on the compressed frequency domain noise data according to the image compression data to obtain the compressed motion texture. The decompression model may be used to decompress the compressed motion texture to obtain the motion texture. The denoising process may be understood as the removal of Gaussian noise.

压缩后的频域噪声数据可以是压缩模型对频域噪声数据进行压缩得到的,也可以是预设的。The compressed frequency domain noise data may be obtained by compressing the frequency domain noise data with a compression model, or may be preset.

在进行步骤S320之前,还可以获取待处理图像中用户指示的目标主体。Before performing step S320, a target subject indicated by the user in the image to be processed may also be obtained.

根据用户指示的目标主体区域,可以确定目标主体区域信息。目标主体区域信息表示目标主体区域。在待处理图像中的目标主体区域记录有用户指示的目标主体。According to the target subject area indicated by the user, target subject area information can be determined. The target subject area information indicates the target subject area. The target subject indicated by the user is recorded in the target subject area in the image to be processed.

图像处理系统用于对第一图像和目标主体区域信息进行处理,以得到待处理图像对应的至少一个第二图像,以及每个第二图像对应的目标音频。The image processing system is used to process the first image and the target subject area information to obtain at least one second image corresponding to the image to be processed, and the target audio corresponding to each second image.

每个第二图像中的目标主体区域之外的其他区域记录的内容与第一图像中该其他区域记录的内容相同。The content recorded in other areas outside the target subject area in each second image is the same as the content recorded in the other areas in the first image.

也就是说,在该其他区域,第二图像与第一图像的在同一个像素的像素值是相同的。与第一图像相比,目标主体所在的目标主体区域之外的部分在第二图像中不发生变化。从而,目标视频中记录的运动是根据用户的指示进行的,提高用户对视频生成的参与感,提高用户体验。That is, in the other area, the pixel value of the second image is the same as that of the first image at the same pixel. Compared with the first image, the part outside the target subject area where the target subject is located does not change in the second image. Therefore, the movement recorded in the target video is performed according to the user's instructions, which improves the user's sense of participation in the video generation and improves the user experience.

图像的像素值可以是一个颜色值,像素值可以是表示颜色的长整数。例如,像素值可以是通过红绿蓝(red green blue,RGB)表示的颜色值,各个颜色分量中,数值越小,亮度越低,数值越大,亮度越高。对于灰度图像来说,像素值可以是灰度值。The pixel value of an image can be a color value, and the pixel value can be a long integer representing a color. For example, the pixel value can be a color value represented by red, green, and blue (RGB). In each color component, the smaller the value, the lower the brightness, and the larger the value, the higher the brightness. For a grayscale image, the pixel value can be a grayscale value.

图像处理系统中,特征预测模型用于对第一图像和目标主体区域信息进行处理,以得到每个第二时刻对应的预测特征。从而,在预测特征中体现了用户指示的目标主体的影响。图像生成模型和音频生成模型可以基于体现了用户指示的目标主体影响的预测特征进行后续处理。In the image processing system, the feature prediction model is used to process the first image and the target subject area information to obtain the prediction features corresponding to each second moment. Thus, the influence of the target subject indicated by the user is reflected in the prediction features. The image generation model and the audio generation model can perform subsequent processing based on the prediction features reflecting the influence of the target subject indicated by the user.

示例性地,特征预测模型中的运动位移场预测模型可以用于,对第一图像和目标主体区域信息进行处理,以得到每个第二时刻对应的运动位移场。从而,运动位移场中体现了用户指示的目标主体的影响。运动特征提取模型和调整模型可以基于体现了用户指示的目标主体影响的运动位移场进行后续处理。Exemplarily, the motion displacement field prediction model in the feature prediction model can be used to process the first image and the target subject area information to obtain the motion displacement field corresponding to each second moment. Thus, the motion displacement field reflects the influence of the target subject indicated by the user. The motion feature extraction model and the adjustment model can perform subsequent processing based on the motion displacement field reflecting the influence of the target subject indicated by the user.

运动位移场预测模型对于第一图像和目标主体区域信息的处理过程中,第一变换模块可以对第一图像进行傅里叶变换,以得到图像频域数据。运动纹理预测模型用于对图像频域数据和目标主体区域信息进行处理,以得到每个第二时刻对应的运动纹理。第二变换模块用于对运动纹理进行逆傅里叶变换,以得到运动位移场。During the processing of the first image and the target subject area information by the motion displacement field prediction model, the first transformation module can perform Fourier transformation on the first image to obtain image frequency domain data. The motion texture prediction model is used to process the image frequency domain data and the target subject area information to obtain the motion texture corresponding to each second moment. The second transformation module is used to perform inverse Fourier transformation on the motion texture to obtain the motion displacement field.

运动纹理预测模型中,压缩模型可以用于对图像频域数据进行压缩,以得到图像压缩数据。运动纹理生成模型可以用于对图像压缩数据和目标主体区域信息进行处理,以生成运动纹理。In the motion texture prediction model, the compression model can be used to compress the image frequency domain data to obtain image compression data. The motion texture generation model can be used to process the image compression data and the target subject area information to generate motion texture.

或者,压缩模型还可以用于对目标主体区域信息进行压缩,以得到压缩区域信息。运动纹理生成模型可以用于对压缩区域信息和图像压缩数据进行处理,以生成运动纹理。Alternatively, the compression model can also be used to compress the target subject area information to obtain compressed area information. The motion texture generation model can be used to process the compressed area information and the image compression data to generate motion texture.

在获取待处理图像中用户指示的目标主体之后,如果对于每个第一图像,图像处理系统均对第一图像和目标主体区域信息进行处理,可能导致目标视频中主体的运动不合理。例如,在实际中,目标主体的运动可能导致其他主体的运动,而其他主体可能位于目标主体区域之外。After obtaining the target subject indicated by the user in the image to be processed, if the image processing system processes the first image and the target subject area information for each first image, the movement of the subject in the target video may be unreasonable. For example, in practice, the movement of the target subject may cause the movement of other subjects, and the other subjects may be located outside the target subject area.

为了使得目标视频中主体的运动更加合理,在图像处理系统处理的第一图像为待处理图像的情况下,图像处理系统可以对第一图像和目标主体区域信息进行处理;而在图像处理系统处理的第一图像不是第一图像的情况下,图像处理系统可以对第一图像进行处理,目标主体区域信息可以不再作为图像处理系统的输入。从而,目标视频中可以反映出目标主体的运动以及其他主体受目标主体影响而进行的运动,使得目标视频中主体的运动更加合理,提高用户体验。In order to make the movement of the subject in the target video more reasonable, when the first image processed by the image processing system is the image to be processed, the image processing system can process the first image and the target subject area information; and when the first image processed by the image processing system is not the first image, the image processing system can process the first image, and the target subject area information can no longer be used as the input of the image processing system. Thus, the movement of the target subject and the movement of other subjects affected by the target subject can be reflected in the target video, making the movement of the subject in the target video more reasonable and improving the user experience.

在获取待处理图像中用户指示的目标主体的情况下,在步骤S320之前,还可以获取用户指示的目标主体的目标运动趋势。In the case of acquiring the target subject indicated by the user in the image to be processed, before step S320 , the target motion trend of the target subject indicated by the user may also be acquired.

目标主体的目标运动趋势可以包括目标主体的运动方向、运动幅度、运动速度等中的一个或多个。The target motion trend of the target subject may include one or more of the target subject's motion direction, motion amplitude, motion speed, and the like.

图像处理系统可以用于对所述第一图像、目标主体区域信息和运动信息进行处理,以得到至少一个第二时刻中每个第二时刻对应的所述目标图像,以及每个第二时刻对应的目标音频,在所述至少一个第二时刻对应的至少一个所述目标图像中所述目标主体是按照运动信息表示的目标运动趋势进行运动的。The image processing system can be used to process the first image, target subject area information and motion information to obtain the target image corresponding to each second moment in at least one second moment, and the target audio corresponding to each second moment. In at least one of the target images corresponding to the at least one second moment, the target subject moves according to the target motion trend represented by the motion information.

图像处理系统根据用户指示的目标主体和目标主体的目标运动趋势,生成目标视频,目标视频中的目标主体的运动是按照目标运动趋势进行的,提高用户对视频生成的参与感,提高用户体验。The image processing system generates a target video according to the target subject indicated by the user and the target motion trend of the target subject. The motion of the target subject in the target video is performed according to the target motion trend, thereby increasing the user's sense of participation in video generation and improving the user experience.

用户指示的目标主体的目标运动趋势一般是对待处理图像中的待处理图像而言的。由于目标主体的运动可能受到多种因素的影响,在目标视频中,目标主体的运动趋势可以不是一直与目标运动趋势相同。The target motion trend of the target subject indicated by the user is generally for the image to be processed in the image to be processed. Since the motion of the target subject may be affected by various factors, the motion trend of the target subject may not always be the same as the target motion trend in the target video.

在图像处理系统处理的第一图像为待处理图像的情况下,图像处理系统可以对第一图像、目标主体区域信息和运动信息进行处理;而在图像处理系统处理的第一图像不是第一图像的情况下,图像处理系统可以对第一图像进行处理,或者,图像处理系统可以对第一图像和目标主体区域信息进行处理,运动信息可以不再作为图像处理系统的输入。When the first image processed by the image processing system is the image to be processed, the image processing system can process the first image, the target subject area information and the motion information; and when the first image processed by the image processing system is not the first image, the image processing system can process the first image, or the image processing system can process the first image and the target subject area information, and the motion information may no longer be used as an input to the image processing system.

示例性地,特征预测模型中的运动位移场预测模型可以用于,对第一图像、目标主体区域信息和运动信息进行处理,以得到每个第二时刻对应的运动位移场。从而,运动位移场中体现了用户指示的目标主体的目标运动趋势影响。运动特征提取模型和调整模型可以基于运动位移场进行后续处理。Exemplarily, the motion displacement field prediction model in the feature prediction model can be used to process the first image, the target subject area information and the motion information to obtain the motion displacement field corresponding to each second moment. Thus, the motion displacement field reflects the target motion trend influence of the target subject indicated by the user. The motion feature extraction model and the adjustment model can perform subsequent processing based on the motion displacement field.

运动位移场预测模型对于第一图像、目标主体区域信息和运动信息的处理过程中,第一变换模块可以对第一图像进行傅里叶变换,以得到图像频域数据。运动纹理预测模型用于对图像频域数据、目标主体区域信息和运动信息进行处理,以得到每个第二时刻对应的运动纹理。第二变换模块用于对运动纹理进行逆傅里叶变换,以得到运动位移场。In the process of processing the first image, the target subject area information and the motion information by the motion displacement field prediction model, the first transformation module can perform Fourier transformation on the first image to obtain image frequency domain data. The motion texture prediction model is used to process the image frequency domain data, the target subject area information and the motion information to obtain the motion texture corresponding to each second moment. The second transformation module is used to perform inverse Fourier transformation on the motion texture to obtain the motion displacement field.

运动纹理预测模型中,压缩模型可以用于对图像频域数据进行压缩,以得到图像压缩数据。运动纹理生成模型可以用于对图像压缩数据、目标主体区域信息和运动信息进行处理,以生成运动纹理。In the motion texture prediction model, the compression model can be used to compress the image frequency domain data to obtain image compression data. The motion texture generation model can be used to process the image compression data, target subject area information and motion information to generate motion texture.

或者,压缩模型还可以用于对目标主体区域信息进行压缩,以得到压缩区域信息。运动纹理生成模型可以用于对压缩区域信息、图像压缩数据和运动信息进行处理,以生成运动纹理。Alternatively, the compression model can also be used to compress the target subject area information to obtain compressed area information. The motion texture generation model can be used to process the compressed area information, image compression data and motion information to generate motion texture.

又或者,压缩模型可以分别对图像频域数据、目标主体区域信息和运动信息分别进行压缩,并运动纹理生成模型可以用于对压缩后的图像频域数据、压缩后的目标主体区域信息和压缩后的运动信息进行处理,以生成运动纹理。Alternatively, the compression model can compress the image frequency domain data, target subject area information and motion information respectively, and the motion texture generation model can be used to process the compressed image frequency domain data, compressed target subject area information and compressed motion information to generate motion texture.

从而,目标视频中可以反映出目标主体的运动以及其他主体受目标主体影响,并且目标主体按照目标运动趋势开始进行运动,使得目标视频中主体的运动更加合理,提高用户体验。Thus, the target video can reflect the movement of the target subject and the influence of other subjects on the target subject, and the target subject starts to move according to the target movement trend, making the movement of the subject in the target video more reasonable and improving the user experience.

本申请实施例提供的图像处理方法,根据待处理图像生成多个目标图像,并且生成每个目标图像对应的音频,使得播放的目标音频的内容与显示的目标图像内容的协调性更高,提高用户体验。The image processing method provided in the embodiment of the present application generates multiple target images according to the image to be processed, and generates audio corresponding to each target image, so that the content of the played target audio is more coordinated with the content of the displayed target image, thereby improving the user experience.

应理解,上述举例说明是为了帮助本领域技术人员理解本申请实施例,而非要将本申请实施例限于所例示的具体数值或具体场景。本领域技术人员根据所给出的上述举例说明,显然可以进行各种等价的修改或变化,这样的修改或变化也落入本申请实施例的范围内。It should be understood that the above examples are intended to help those skilled in the art understand the embodiments of the present application, rather than to limit the embodiments of the present application to the specific numerical values or specific scenarios illustrated. Those skilled in the art can obviously make various equivalent modifications or changes based on the above examples, and such modifications or changes also fall within the scope of the embodiments of the present application.

下面结合图4,对图3所示的方法中使用的图像处理系统进行说明。The image processing system used in the method shown in FIG. 3 is described below in conjunction with FIG. 4 .

图4是本申请实施例提供的一种图像处理系统的示意性结构图。FIG. 4 is a schematic structural diagram of an image processing system provided in an embodiment of the present application.

图像处理系统400包括随机运动位移场预测模型410、运动特征提取模型420、图像特征提取模型430、调整模型440、图像生成模型450、音频生成模型460。The image processing system 400 includes a random motion displacement field prediction model 410 , a motion feature extraction model 420 , an image feature extraction model 430 , an adjustment model 440 , an image generation model 450 , and an audio generation model 460 .

图像处理系统400可以用于依次对多个第一图像进行处理,以得到多个第二图像以及每个第二图像对应的音频。图像处理系统400的输出包括声音和图像,图像处理系统400的输出可以理解为是多模态的。The image processing system 400 can be used to process multiple first images in sequence to obtain multiple second images and audio corresponding to each second image. The output of the image processing system 400 includes sound and image, and the output of the image processing system 400 can be understood as multimodal.

图像处理系统400对多个第一图像的处理是依次进行的。该多个第一图像中,图像处理系统400进行第一次处理所使用的第一图像可以是待处理图像。图像处理系统400进行第二次及之后的处理所使用的第一图像可以是上一次处理得到的第二图像。根据该多个第二图像以及每个第二图像对应的音频,可以得到视频。该视频可以包括多个第二图像以及每个第二图像对应的音频。该多个第二图像中,不同的第二图像对应不同的时刻。该视频可以理解为图像处理系统400生成的视频。The image processing system 400 processes the multiple first images sequentially. Among the multiple first images, the first image used by the image processing system 400 for the first processing may be the image to be processed. The first image used by the image processing system 400 for the second and subsequent processing may be the second image obtained in the previous processing. A video can be obtained based on the multiple second images and the audio corresponding to each second image. The video may include multiple second images and the audio corresponding to each second image. Among the multiple second images, different second images correspond to different moments. The video can be understood as a video generated by the image processing system 400.

示例性地,该视频还可以包括待处理图像。待处理图像可以理解为该视频的第一帧图像。Exemplarily, the video may also include an image to be processed, which may be understood as the first frame of the video.

图像处理系统400每次对第一图像进行处理得到的第二图像的数量可以是一个或多个。The number of second images obtained by the image processing system 400 each time processing a first image may be one or more.

在图像处理系统400对第一图像处理得到的第二图像的数量是一个的情况下,该第二图像是该第一图像对应的时刻之后的时刻对应的图像。该第二图像可以作为下一次图像处理系统400进行处理的第一图像。When the number of second images obtained by the image processing system 400 from processing the first image is one, the second image is an image corresponding to a time after the time corresponding to the first image. The second image can be used as the first image to be processed by the image processing system 400 next time.

在图像处理系统400对第一图像处理得到的第二图像的数量是多个的情况下,该多个第二图像分别是对应于该第一图像对应的时刻之后的多个时刻的图像。该多个第二图像中对应的时刻最晚的图像可以作为下一次图像处理系统400进行处理的第一图像。该多个第二图像中对应的时刻最晚的图像,也可以理解为对应的时刻与第一图像对应的时刻之间的时间间隔最大的图像。When the number of second images obtained by the image processing system 400 from processing the first image is multiple, the multiple second images are images corresponding to multiple moments after the moment corresponding to the first image. The image with the latest moment among the multiple second images can be used as the first image to be processed by the image processing system 400 next time. The image with the latest moment among the multiple second images can also be understood as the image with the largest time interval between the moment corresponding to the first image and the moment corresponding to the first image.

在第一图像是待处理图像的情况下,图像处理系统400的输入还可以包括目标主体区域信息和/或运动信息等。In the case where the first image is the image to be processed, the input of the image processing system 400 may further include target subject area information and/or motion information, etc.

目标主体区域信息用于表示待处理图像中的目标主体区域。目标主体区域可以是根据用户指示的待处理图像中的目标主体确定的。待处理图像中,目标主体位于目标主体区域中。目标主体区域的形状可以是预设的,也可以是根据目标主体的形状确定的。目标主体区域的形状可以是规则或不规则的。The target subject region information is used to indicate the target subject region in the image to be processed. The target subject region may be determined according to the target subject in the image to be processed indicated by the user. In the image to be processed, the target subject is located in the target subject region. The shape of the target subject region may be preset or determined according to the shape of the target subject. The shape of the target subject region may be regular or irregular.

目标主体区域还可以包括待处理图像中位于目标主体周围的区域,例如,目标主体区域可以包括待处理图像中与目标主体所在的区域之间的距离小于或等于预设距离的全部或部分区域。The target subject area may also include an area around the target subject in the image to be processed. For example, the target subject area may include all or part of an area in the image to be processed whose distance to the area where the target subject is located is less than or equal to a preset distance.

目标主体区域信息可以通过图像或其他方式表示。例如,目标主体区域信息可以是区域图像。区域图像与待处理图像具有相同的尺寸。区域图像可以通过每个像素点的值表示该像素点是否位于目标主体区域中。示例性地,位于目标主体区域中的像素点的值可以是1,位于目标主体区域之外的像素点的值可以是0;或者,位于目标主体区域中的像素点的值大于或等于预设值,位于目标主体区域之外的像素点的值可以小于该预设值。不同的像素点可以理解为不同的点,即不同的位置。The target subject area information may be represented by an image or other means. For example, the target subject area information may be an area image. The area image has the same size as the image to be processed. The area image may indicate whether the pixel is located in the target subject area by the value of each pixel. Exemplarily, the value of the pixel located in the target subject area may be 1, and the value of the pixel located outside the target subject area may be 0; or, the value of the pixel located in the target subject area may be greater than or equal to a preset value, and the value of the pixel located outside the target subject area may be less than the preset value. Different pixels may be understood as different points, i.e., different positions.

在目标主体区域的形状为预设形状的情况下,目标主体区域信息可以通过位置和尺寸等信息表示目标主体区域。When the shape of the target subject area is a preset shape, the target subject area information may represent the target subject area through information such as position and size.

例如,如果预设形状为方形,目标主体区域信息可以包括位于目标主体区域的对角线的两个顶点在待处理头像中的位置,或者,目标主体区域信息可以包括目标主体区域的中心在待处理头像中的位置,以及用于表示目标主体区域的长度和宽度的信息,如长度和宽度,或者长度和长宽比等。而如果预设形状为圆形,目标主体区域信息可以包括目标主体区域的圆心在待处理头像中的位置,以及用于表示目标主体区域的半径的信息。For example, if the preset shape is a square, the target body area information may include the positions of two vertices located on the diagonal line of the target body area in the avatar to be processed, or the target body area information may include the position of the center of the target body area in the avatar to be processed, and information used to represent the length and width of the target body area, such as the length and width, or the length and aspect ratio, etc. If the preset shape is a circle, the target body area information may include the position of the center of the target body area in the avatar to be processed, and information used to represent the radius of the target body area.

在待处理图像上,以待处理图像的中心或某个顶点为原点,可以建立坐标系。某个点在待处理头像中的位置,可以通过坐标的方式表示。On the image to be processed, a coordinate system can be established with the center or a vertex of the image to be processed as the origin. The position of a point in the image to be processed can be expressed in the form of coordinates.

运动信息可以用于表示用户指示的目标主体的目标运动趋势,例如运动幅度、运动方向、运动速度等中的一个或多个。The motion information may be used to indicate the target motion trend of the target subject indicated by the user, such as one or more of the motion amplitude, motion direction, motion speed, etc.

第一图像可以表示为I0。随机运动位移场预测模型410用于对第一图像I0进行处理,以得到第一图像I0的至少一个随机运动位移场,第一图像I0的至少一个随机运动位移场对应于在第一图像I0对应的时刻之后的至少一个时刻。不同的随机运动位移场对应的时刻不同。The first image may be represented as I 0. The random motion displacement field prediction model 410 is used to process the first image I 0 to obtain at least one random motion displacement field of the first image I 0 , and the at least one random motion displacement field of the first image I 0 corresponds to at least one time after the time corresponding to the first image I 0. Different random motion displacement fields correspond to different times.

位移场是指在空间中多个点的位移。每个点的位移包括大小和方向。每个点的位移可以通过矢量表示。The displacement field refers to the displacement of multiple points in space. The displacement of each point includes magnitude and direction. The displacement of each point can be represented by a vector.

第一图像I0在某个时刻的随机预设位移场表示第一图像I0中的多个像素在该时刻所在的位置相对第一图像I0中该多个像素的位置的位移。随机预设位移场表示的第一图像I0中的多个像素可以是第一图像I0中的全部或部分像素。The random preset displacement field of the first image I 0 at a certain moment represents the displacement of the positions of the multiple pixels in the first image I 0 at the moment relative to the positions of the multiple pixels in the first image I 0. The multiple pixels in the first image I 0 represented by the random preset displacement field may be all or part of the pixels in the first image I 0 .

运动特征提取模型420用于对第一图像I0的随机运动位移场D进行特征提取,可以得到运动特征。运动特征可以理解为第一图像I0的随机运动位移场D的特征。The motion feature extraction model 420 is used to extract features from the random motion displacement field D of the first image I 0 , and obtain motion features. The motion features can be understood as features of the random motion displacement field D of the first image I 0 .

图像特征提取模型430用于对第一图像I0进行特征提取,以得到第一特征。第一特征可以理解为第一图像I0的特征,也可以称为图像特征。The image feature extraction model 430 is used to extract features from the first image I 0 to obtain a first feature. The first feature can be understood as a feature of the first image I 0 , and can also be called an image feature.

调整模型440用于根据运动特征对第一特征进行调整,以得到第二特征。第二特征也可以称为预测特征。The adjustment model 440 is used to adjust the first feature according to the motion feature to obtain the second feature. The second feature can also be called a prediction feature.

图像生成模型450用于对第二特征进行处理,以生成第二图像。The image generation model 450 is used to process the second feature to generate a second image.

音频生成模型460用于对第二特征进行处理,以生成第二图像对应的音频。The audio generation model 460 is used to process the second feature to generate audio corresponding to the second image.

第一图像I0的随机运动位移场与第一图像I0可以具有相同的尺寸。The random motion displacement field of the first image I 0 may have the same size as the first image I 0 .

随机运动位移场预测模型410、运动特征提取模型420、图像特征提取模型430、图像生成模型450、音频生成模型460均为是深度神经网络(deep neural network,DNN)。The random motion displacement field prediction model 410 , the motion feature extraction model 420 , the image feature extraction model 430 , the image generation model 450 , and the audio generation model 460 are all deep neural networks (DNN).

神经网络可以是由神经单元组成的,神经单元可以是指以xs和截距l为输入的运算单元,该运算单元的输出可以为:A neural network may be composed of neural units, and a neural unit may refer to an operation unit with x s and an intercept l as input, and the output of the operation unit may be:

其中,s=1、2、……n,n为大于1的自然数,Ws为xs的权重,b为神经单元的偏置。f为神经单元的激活函数(activation functions),用于将非线性特性引入神经网络中,来将神经单元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层卷积层的输入。激活函数可以是sigmoid函数。神经网络是将许多个上述单一的神经单元联结在一起形成的网络,即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经单元组成的区域。Where s=1, 2, ...n, n is a natural number greater than 1, Ws is the weight of xs , and b is the bias of the neural unit. f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into the output signal. The output signal of the activation function can be used as the input of the next convolutional layer. The activation function can be a sigmoid function. A neural network is a network formed by connecting many of the above-mentioned single neural units together, that is, the output of one neural unit can be the input of another neural unit. The input of each neural unit can be connected to the local receptive field of the previous layer to extract the features of the local receptive field. The local receptive field can be an area composed of several neural units.

DNN也可以称为多层神经网络,可以理解为具有多层隐含层的神经网络。按照不同层的位置对DNN进行划分,DNN内部的神经网络可以分为三类:输入层,隐含层,输出层。一般来说第一层是输入层,最后一层是输出层,中间的层数都是隐含层。层与层之间是全连接的,也就是说,第i层的任意一个神经元一定与第i+1层的任意一个神经元相连。DNN can also be called a multi-layer neural network, which can be understood as a neural network with multiple hidden layers. According to the position of different layers, DNN can be divided into three categories: input layer, hidden layer, and output layer. Generally speaking, the first layer is the input layer, the last layer is the output layer, and the layers in between are all hidden layers. The layers are fully connected, that is, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer.

虽然DNN看起来很复杂,但是就每一层的工作来说,其实并不复杂。深度神经网络中的每一层的工作可以用数学表达式来描述:从物理层面深度神经网络中的每一层的工作可以理解为通过五种对输入空间(输入向量的集合)的操作,完成输入空间到输出空间的变换(即矩阵的行空间到列空间),这五种操作包括:1、升维/降维;2、放大/缩小;3、旋转;4、平移;5、“弯曲”。其中1、2、3的操作由/>完成,4的操作由/>完成,5的操作则由/>来实现。这里之所以用“空间”二字来表述是因为被分类的对象并不是单个事物,而是一类事物,空间是指这类事物所有个体的集合。其中,/>是权重向量,该向量中的每一个值表示该层神经网络中的一个神经元的权重值。该向量/>决定着上文所述的输入空间到输出空间的空间变换,即每一层的权重/>控制着如何变换空间。训练深度神经网络的目的,也就是最终得到训练好的神经网络的所有层的权重矩阵(由很多层的向量W形成的权重矩阵)。因此,神经网络的训练过程本质上就是学习控制空间变换的方式,更具体的就是学习权重矩阵。Although DNN looks complicated, the work of each layer is not complicated. The work of each layer in the deep neural network can be expressed mathematically as To describe: From a physical level, the work of each layer in a deep neural network can be understood as completing the transformation from input space to output space (i.e., from row space to column space of a matrix) through five operations on the input space (a set of input vectors). These five operations include: 1. Dimension increase/reduction; 2. Enlargement/reduction; 3. Rotation; 4. Translation; 5. "Bending". Operations 1, 2, and 3 are represented by/> Completed, the operation of 4 is done by/> Completed, the operation of 5 is done by/> The reason why the word "space" is used here is because the object being classified is not a single thing, but a class of things, and space refers to the collection of all individuals of this class of things. Among them, /> is a weight vector, each value in which represents the weight value of a neuron in this layer of neural network. Determines the spatial transformation from the input space to the output space mentioned above, that is, the weight of each layer/> Controls how to transform space. The purpose of training a deep neural network is to eventually obtain the weight matrix of all layers of the trained neural network (the weight matrix formed by many layers of vectors W ). Therefore, the training process of a neural network is essentially learning how to control spatial transformation, more specifically learning the weight matrix.

因此,DNN简单来说就是如下线性关系表达式:,其中,/>是输入向量,/>是输出向量,/>是偏移向量,W是权重矩阵(也称系数),/>是激活函数。每一层仅仅是对输入向量/>经过如此简单的操作得到输出向量/>。由于DNN层数多,系数W和偏移向量/>的数量也比较多。这些参数在DNN中的定义如下所述:以系数W为例:假设在一个三层的DNN中,第二层的第4个神经元到第三层的第2个神经元的线性系数定义为/>。上标3代表系数W所在的层数,而下标对应的是输出的第三层索引2和输入的第二层索引4。Therefore, DNN is simply the following linear relationship expression: , where /> is the input vector, /> is the output vector, /> is the offset vector, W is the weight matrix (also called coefficient), /> is the activation function. Each layer is just an input vector/> After such a simple operation, the output vector is obtained/> . Due to the large number of DNN layers, the coefficient W and the offset vector/> The number of these parameters is also relatively large. The definitions of these parameters in DNN are as follows: Take the coefficient W as an example: Assume that in a three-layer DNN, the linear coefficient from the 4th neuron in the second layer to the 2nd neuron in the third layer is defined as/> The superscript 3 represents the layer number of the coefficient W , while the subscripts correspond to the third layer index 2 of the output and the second layer index 4 of the input.

综上,第L−1层的第k个神经元到第L层的第j个神经元的系数定义为In summary, the coefficient from the kth neuron in the L−1th layer to the jth neuron in the Lth layer is defined as .

需要注意的是,输入层是没有W参数的。在深度神经网络中,更多的隐含层让网络更能够刻画现实世界中的复杂情形。理论上而言,参数越多的模型复杂度越高,“容量”也就越大,也就意味着它能完成更复杂的学习任务。训练深度神经网络的也就是学习权重矩阵的过程,其最终目的是得到训练好的深度神经网络的所有层的权重矩阵(由很多层的向量W形成的权重矩阵)。It should be noted that the input layer does not have a W parameter. In a deep neural network, more hidden layers allow the network to better depict complex situations in the real world. Theoretically, the more parameters a model has, the higher its complexity and the greater its "capacity", which means it can complete more complex learning tasks. Training a deep neural network is the process of learning the weight matrix, and its ultimate goal is to obtain the weight matrix of all layers of the trained deep neural network (the weight matrix formed by many layers of vector W ).

随机运动位移场预测模型410可以是神经网络模型,例如可以是卷积神经网络。The random motion displacement field prediction model 410 may be a neural network model, for example, a convolutional neural network.

卷积神经网络是一种带有卷积结构的深度神经网络,是一种深度学习(deeplearning)架构,深度学习架构是指通过机器学习的算法,在不同的抽象层级上进行多个层次的学习。作为一种深度学习架构,CNN是一种前馈(feed-forward)人工神经网络,该前馈人工神经网络中的各个神经元可以对输入其中的数据作出响应。A convolutional neural network is a deep neural network with a convolutional structure. It is a deep learning architecture. A deep learning architecture refers to multiple levels of learning at different levels of abstraction through machine learning algorithms. As a deep learning architecture, CNN is a feed-forward artificial neural network in which each neuron can respond to the data input into it.

卷积神经网络包含了一个由卷积层和子采样层构成的特征抽取器。该特征抽取器可以看作是滤波器,卷积过程可以看作是使用一个可训练的滤波器与一个输入的数据或者卷积特征平面(feature map)做卷积。卷积层是指卷积神经网络中对输入信号进行卷积处理的神经元层。在卷积神经网络的卷积层中,一个神经元可以只与部分邻层神经元连接。一个卷积层中,通常包含若干个特征平面,每个特征平面可以由一些矩形排列的神经单元组成。同一特征平面的神经单元共享权重,这里共享的权重就是卷积核。以输入的数据为图像为例,共享权重可以理解为提取图像信息的方式与位置无关。这其中隐含的原理是:图像的某一部分的统计信息与其他部分是一样的。即意味着在某一部分学习的图像信息也能用在另一部分上。所以对于图像上的所有位置,我们都能使用同样的学习得到的图像信息。在同一卷积层中,可以使用多个卷积核来提取不同的图像信息,一般地,卷积核数量越多,卷积操作反映的图像信息越丰富。Convolutional neural networks contain a feature extractor consisting of a convolution layer and a subsampling layer. The feature extractor can be regarded as a filter, and the convolution process can be regarded as using a trainable filter to convolve an input data or convolution feature plane (feature map). The convolution layer refers to the neuron layer in the convolutional neural network that performs convolution processing on the input signal. In the convolution layer of the convolutional neural network, a neuron can only be connected to some neurons in the adjacent layers. A convolution layer usually contains several feature planes, each of which can be composed of some rectangularly arranged neural units. The neural units in the same feature plane share weights, and the shared weights here are the convolution kernels. Taking the input data as an image as an example, shared weights can be understood as the way to extract image information is independent of position. The implicit principle is that the statistical information of a part of the image is the same as that of other parts. This means that the image information learned in a certain part can also be used in another part. So for all positions on the image, we can use the same learned image information. In the same convolution layer, multiple convolution kernels can be used to extract different image information. Generally speaking, the more convolution kernels there are, the richer the image information reflected by the convolution operation.

卷积核可以以随机大小的矩阵的形式初始化,在卷积神经网络的训练过程中卷积核可以通过学习得到合理的权重。另外,共享权重带来的直接好处是减少卷积神经网络各层之间的连接,同时又降低了过拟合的风险。The convolution kernel can be initialized in the form of a matrix of random size, and the convolution kernel can obtain reasonable weights through learning during the training process of the convolutional neural network. In addition, the direct benefit of shared weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.

卷积神经网络可以包括输入层、卷积层以及神经网络层。卷积神经网络还可以包括池化层。A convolutional neural network may include an input layer, a convolutional layer, and a neural network layer. A convolutional neural network may also include a pooling layer.

卷积层可以包括很多个卷积算子,卷积算子也称为核,其在自然语言处理中的作用相当于一个从输入的语音或语义信息中提取特定信息的过滤器,卷积算子本质上可以是一个权重矩阵,这个权重矩阵通常被预先定义。The convolution layer can include many convolution operators, also called kernels. Its role in natural language processing is equivalent to a filter that extracts specific information from the input speech or semantic information. The convolution operator can essentially be a weight matrix, which is usually predefined.

这些权重矩阵中的权重值在实际应用中需要经过大量的训练得到,通过训练得到的权重值形成的各个权重矩阵可以从输入数据中提取信息,从而帮助卷积神经网络进行正确的预测。The weight values in these weight matrices need to be obtained through a lot of training in practical applications. The weight matrices formed by the weight values obtained through training can extract information from the input data, thereby helping the convolutional neural network to make correct predictions.

由于常常需要减少训练参数的数量,因此卷积层之后常常需要周期性的引入池化层。可以是一层卷积层后面跟一层池化层,也可以是多层卷积层后面接一层或多层池化层。Since it is often necessary to reduce the number of training parameters, it is often necessary to periodically introduce a pooling layer after the convolution layer. It can be a convolution layer followed by a pooling layer, or multiple convolution layers can be followed by one or more pooling layers.

在数据处理过程中,池化层的唯一目的就是减少数据的空间大小。池化层可以包括平均池化算子和/或最大池化算子,以用于对输入数据特征进行采样得到较小尺寸的数据特征。In the data processing process, the only purpose of the pooling layer is to reduce the spatial size of the data. The pooling layer may include an average pooling operator and/or a maximum pooling operator to sample the input data features to obtain data features of smaller size.

在经过卷积层/池化层的处理后,卷积神经网络还不足以输出所需要的输出信息。因为如前所述,卷积层/池化层只会提取特征,并减少输入数据带来的参数。然而为了生成最终的输出信息(所需要的类信息或其他相关信息),卷积神经网络需要利用神经网络层来生成一个或者一组所需要的类的数量的输出。因此,在神经网络层中可以包括多层隐含层以及输出层,该多层隐含层中所包含的参数可以根据具体的任务类型的相关训练数据进行预先训练得到,例如该任务类型可以包括语音或语义识别、分类或生成等等。After being processed by the convolution layer/pooling layer, the convolutional neural network is not sufficient to output the required output information. Because as mentioned above, the convolution layer/pooling layer will only extract features and reduce the parameters brought by the input data. However, in order to generate the final output information (the required class information or other related information), the convolutional neural network needs to use the neural network layer to generate one or a group of outputs of the required number of classes. Therefore, the neural network layer may include multiple hidden layers and an output layer, and the parameters contained in the multiple hidden layers can be pre-trained according to the relevant training data of the specific task type. For example, the task type may include speech or semantic recognition, classification or generation, etc.

在神经网络层中的多层隐含层之后,也就是整个卷积神经网络的最后层为输出层,该输出层具有类似分类交叉熵的损失函数,具体用于计算预测误差,一旦整个卷积神经网络的前向传播完成,反向传播就会开始更新前面提到的各层的权重值以及偏差,以减少卷积神经网络的损失,及卷积神经网络通过输出层输出的结果和理想结果之间的误差。After the multiple hidden layers in the neural network layer, the last layer of the entire convolutional neural network is the output layer. The output layer has a loss function similar to the classification cross entropy, which is specifically used to calculate the prediction error. Once the forward propagation of the entire convolutional neural network is completed, the back propagation will begin to update the weight values and biases of the previously mentioned layers to reduce the loss of the convolutional neural network and the error between the result output by the convolutional neural network through the output layer and the ideal result.

应当理解,上述对于卷积神经网络的介绍仅作为一种卷积神经网络的示例,在具体的应用中,卷积神经网络还可以以其他网络模型的形式存在。例如,多个卷积层/池化层并行,将分别提取的特征均输入给全神经网络层进行处理等。It should be understood that the above introduction to the convolutional neural network is only an example of a convolutional neural network. In specific applications, the convolutional neural network can also exist in the form of other network models. For example, multiple convolutional layers/pooling layers are in parallel, and the features extracted from each layer are input to the full neural network layer for processing.

随机运动位移场预测模型410可以是生成模型。The random motion displacement field prediction model 410 may be a generative model.

生成模型本质上是一组概率分布。如图5中的(a)所示,训练数据集中训练数据可以理解为从分布为的数据集中取出的与该数据集独立同分布的随机样本。右侧就是其生成式模型(即生成模型)。生成模型确定与分布/>距离最近的分布/>,接着在分布/>上进行新样本的采集,便可以获得源源不断的新数据。The generative model is essentially a set of probability distributions. As shown in (a) of Figure 5, the training data in the training dataset can be understood as The random samples taken from the data set are independent and identically distributed with the data set. The right side is its generative model (i.e., generative model). Generative model determination and distribution/> The nearest distribution/> , then in the distribution/> By collecting new samples, we can obtain a steady stream of new data.

生成模型的结构可以是生成对抗网络(generative adversarial networks,GAN)、变量自动编码器(VAE)、基于流的生成模型(flow-based model)或扩散模型等。The structure of the generative model can be a generative adversarial network (GAN), a variational autoencoder (VAE), a flow-based model, or a diffusion model.

图5中的(b)示出了GAN的原理。GAN至少包括两个模块:一个模块是生成模型(generative model或generator),另一个模块是判别模型(discriminative model或discriminator),通过这两个模块互相博弈学习,从而产生更好的输出。GAN的基本原理如下:以生成图片的GAN为例,假设有两个网络,生成模型G和判别模型D,其中G是一个生成图片的网络,它接收一个随机的噪声z,通过这个噪声生成图片,记做G(z);D是一个判别网络,用于判别一张图片是不是“真实的”。在理想的状态下,G可以生成足以“以假乱真”的图片G(z),而D难以判定G生成的图片究竟是不是真实的,即D(G(z)) = 0.5。这样就得到了一个优异的生成模型G,它可以用来生成图片或其他类型的数据。(b) in Figure 5 shows the principle of GAN. GAN consists of at least two modules: one module is the generative model (generator), and the other module is the discriminative model (discriminator). These two modules learn from each other through game, so as to produce better output. The basic principle of GAN is as follows: Taking the GAN that generates pictures as an example, suppose there are two networks, the generative model G and the discriminative model D, where G is a network that generates pictures. It receives a random noise z and generates pictures through this noise, which is recorded as G(z); D is a discriminative network used to determine whether a picture is "real". Under ideal conditions, G can generate pictures G(z) that are "real", while D has difficulty in determining whether the pictures generated by G are real, that is, D(G(z)) = 0.5. In this way, an excellent generative model G is obtained, which can be used to generate pictures or other types of data.

图5中的(c)示出了VAE的原理。VAE是以自编码器结构为基础的深度生成模型。自编码器在降维和特征提取等领域应用广泛。VAE包括自编码器器和解码器,通过自编码器的编码过程将数据映射为低维空间的隐变量。然后通过解码器进行解码过程,将隐变量还原为重构数据。为了使得解码过程具有生成能力,而不是唯一的映射关系,隐变量为服从正态分布的随机变量。(c) in Figure 5 shows the principle of VAE. VAE is a deep generative model based on the autoencoder structure. Autoencoders are widely used in fields such as dimensionality reduction and feature extraction. VAE includes an autoencoder and a decoder. The data is mapped to latent variables in a low-dimensional space through the encoding process of the autoencoder. Then the decoder performs a decoding process to restore the latent variables to reconstructed data. In order to make the decoding process have generative capabilities rather than a unique mapping relationship, the latent variables are random variables that obey a normal distribution.

图5中的(d)示出了基于流的生成模型的原理。基于流的生成模型也可以称为流模型。流模型包括一系列的可逆变换器。利用该一系列可逆变换器对数据x进行处理,可以表示为f(x),得到变换后的数据z。对变换后的数据z进行f(x)的逆变换f-1(z),可以得到重构数据。流模型可以使得模型能够更加精确的学习到数据分布,它的损失函数是一个负对数似然函数。流模型的训练过程,可以理解为学习从数据x的复杂分布到数据z的简单分布的转换所需的可逆变换f(x)。流模型的训练过程,在随机采样得到数据z,并利用f(x)的逆变换f-1(z)将数据z转换为重构数据。(d) in Figure 5 shows the principle of the flow-based generative model. The flow-based generative model can also be called a flow model. The flow model includes a series of reversible transformers. The data x is processed by the series of reversible transformers, which can be expressed as f(x) to obtain the transformed data z. The transformed data z is subjected to the inverse transformation f -1 (z) of f(x) to obtain the reconstructed data. The flow model enables the model to learn the data distribution more accurately, and its loss function is a negative log-likelihood function. The training process of the flow model can be understood as learning the reversible transformation f(x) required to transform from the complex distribution of data x to the simple distribution of data z. In the training process of the flow model, data z is obtained by random sampling, and the inverse transformation f -1 (z) of f(x) is used to convert the data z into the reconstructed data.

图5中的(e)示出了扩散模型的原理。扩散模型本质上是一个马尔可夫链架构,只是其中训练过程用到了深度学习的反向传播算法。扩散模型的思想来源于非平衡热力学,算法的理论基础是通过变分推断训练参数化的马尔可夫链。马尔可夫链的一个关键性质是平稳性,即如果一个概率随时间变化,那么在马尔可夫链的作用下,它会趋向于某种平稳分布,时间越长,分布越平稳。该过程要求具备“无记性”的性质,即下一状态的概率分布,只能由当前状态决定,在时间序列中它前面事件均与之无关。(e) in Figure 5 shows the principle of the diffusion model. The diffusion model is essentially a Markov chain architecture, but the training process uses the back-propagation algorithm of deep learning. The idea of the diffusion model comes from non-equilibrium thermodynamics, and the theoretical basis of the algorithm is to train parameterized Markov chains through variational inference. A key property of the Markov chain is stationarity, that is, if a probability changes over time, then under the action of the Markov chain, it will tend to a certain stationary distribution, and the longer the time, the more stable the distribution. This process requires the property of "no memory", that is, the probability distribution of the next state can only be determined by the current state, and the previous events in the time series are irrelevant.

扩散模型的框架可以采用去噪扩散概率模型(denoising diffusionprobabilistic model,DDPM)、基于分数的生成模型(score-based Generative Models,SGM)或随机微分方程(stochastic differential equation,SDE)等。The framework of the diffusion model can adopt the denoising diffusion probabilistic model (DDPM), score-based generative models (SGM) or stochastic differential equation (SDE).

扩散模型的模型结构可以是卷积神经网络,例如可以是U型网络(U-NET)、基于梯度去噪的分数模型(noise conditional score networks,NCSN)、NCSN的升级版(NCSN++)等。The model structure of the diffusion model can be a convolutional neural network, for example, a U-net, a noise conditional score network (NCSN), an upgraded version of NCSN (NCSN++), etc.

扩散模型的使用涉及正向扩散过程(forward diffusion process)和反向扩散过程(reverse diffusion process)。The use of diffusion models involves forward diffusion process and reverse diffusion process.

正向扩散过程也可以称为前向扩散过程、前行过程或扩散过程,对应着分子热动力学中的扩散过程。反向扩散过程也可以称为逆扩散过程、抽象过程、反向过程或去噪过程。在正向扩散过程中,通过缓慢添加噪声生成样本的马尔可夫链。在反向扩散过程中,通过缓慢去除噪声生成样本的马尔可夫链。The forward diffusion process can also be called the forward diffusion process, the forward process or the diffusion process, which corresponds to the diffusion process in molecular thermodynamics. The reverse diffusion process can also be called the reverse diffusion process, the abstract process, the reverse process or the denoising process. In the forward diffusion process, the Markov chain of samples is generated by slowly adding noise. In the reverse diffusion process, the Markov chain of samples is generated by slowly removing noise.

马尔科夫链包括一系列的状态和一系列的变化概率,这里的状态指的是含有不同的噪音等级的数据,变化概率指的是从当前状态变化到下一状态的概率,使用变化矩阵来实现。The Markov chain consists of a series of states and a series of change probabilities. The state here refers to data with different noise levels, and the change probability refers to the probability of changing from the current state to the next state, which is implemented using a change matrix.

如图5中的(e)所示,对于数据,在正向扩散过程中,在正向扩散过程的T步的第t步中,向数据/>添加少量高斯噪声,以得到数据/>,数据/>表示经过t步得到的数据。其中,t=1,2…,T,T为正整数。也就是说,对于数据/>,经过T步,每一步得到的数据分别为/>,…,/>。数据/>为纯噪声。As shown in (e) of Figure 5, for the data , in the forward diffusion process, in the tth step of the T-step forward diffusion process, to the data/> Add a small amount of Gaussian noise to get the data/> , data/> represents the data obtained after t steps. Where t=1,2…,T, T is a positive integer. That is to say, for data/> , after T steps, the data obtained in each step are respectively/> , , ..., /> . Data/> is pure noise.

在不同的步数索引t下添加的噪声可以是不同的,不同的步数情况下添加的噪声的变化情况可以称为差异时间表,例如可以是线性时间表、平方时间表、方差时间表等。差异时间表可以是预设的。The noise added under different step indexes t may be different, and the change of the noise added under different step conditions may be called a difference schedule, for example, a linear schedule, a square schedule, a variance schedule, etc. The difference schedule may be preset.

对于图5中的(f)所示的图像数据,正向扩散过程就是不断在图像上加噪声直到图像变成一个纯噪声。从数据到数据/>就是一个马尔可夫链,表示状态空间经过从一个状态到另一个状态转换的随机过程。For the image data shown in (f) in Figure 5, the forward diffusion process is to continuously add noise to the image until the image becomes pure noise. To data/> It is a Markov chain, which represents the random process of transition from one state to another in the state space.

正向扩散过程中,步数索引t可以表示添加噪声的次数。During the forward diffusion process, the step index t can represent the number of times noise is added.

数据,/>,…,/>可以是尺寸相同的图像,也可以是音频、频域数据或其他类型的数据。data ,/> , ..., /> It can be images of the same size, or it can be audio, frequency domain data, or other types of data.

通过正向扩散过程的得到的数据到数据/>,可以训练扩散模型。利用初始扩散模型对步数索引t分别为t=1,2…,T的情况下每个步数索引t对应的训练噪声数据进行处理以得到每个步数索引t对应的训练去噪数据,并根据该步数索引t对应的训练去噪数据与该步数索引t对应的标签去噪数据之间的差异,调整初始扩散模型的参数。该差异可以通过损失值表示。扩散模型为参数调整后的初始扩散模型。Data obtained through the forward diffusion process To data/> , the diffusion model can be trained. The initial diffusion model is used to process the training noise data corresponding to each step index t when the step index t is t=1,2…,T to obtain the training denoised data corresponding to each step index t, and the parameters of the initial diffusion model are adjusted according to the difference between the training denoised data corresponding to the step index t and the label denoised data corresponding to the step index t. The difference can be represented by the loss value. The diffusion model is the initial diffusion model after parameter adjustment.

步数索引t=T对应的训练噪声数据可以是纯噪声数据,t为T之外的其他值情况下步数索引t对应的训练噪声数据可以理解为在不带噪声的数据上添加t次噪声得到的数据,步数索引t对应的标签去噪数据可以理解为在不带噪声的数据上添加t-1次噪声得到的数据。The training noise data corresponding to the step index t=T can be pure noise data. When t is other values than T, the training noise data corresponding to the step index t can be understood as the data obtained by adding t times of noise to the noise-free data. The label denoised data corresponding to the step index t can be understood as the data obtained by adding t-1 times of noise to the noise-free data.

也就是说,步数索引t=T对应的训练噪声数据可以理解为数据,t为T之外的其他值情况下步数索引t对应的训练噪声数据可以理解为数据/>,步数索引t对应的标签去噪数据可以理解为数据/>That is, the training noise data corresponding to the step index t=T can be understood as data , when t is other than T, the training noise data corresponding to the step index t can be understood as data/> , the label denoised data corresponding to the step index t can be understood as data/> .

在正向扩散过程中,步数索引t=T对应的训练噪声数据可以是纯噪声数据,t为T之外的其他值情况下步数索引t对应的训练噪声数据可以是在初始扩散模型对纯噪声数据去除t-1次噪声得到的。In the forward diffusion process, the training noise data corresponding to the step index t=T can be pure noise data. When t is a value other than T, the training noise data corresponding to the step index t can be obtained by removing t-1 noises from the pure noise data in the initial diffusion model.

初始扩散模型的输出可以是训练去噪数据,即训练去噪数据可以是初始扩散模型的输出。或者,初始扩散模型的输出可以是噪声,即训练去噪数据可以是在训练噪声数据的基础上去除初始扩散模型的输出表示的噪声得到的。The output of the initial diffusion model may be training denoised data, that is, the training denoised data may be the output of the initial diffusion model. Alternatively, the output of the initial diffusion model may be noise, that is, the training denoised data may be obtained by removing the noise represented by the output of the initial diffusion model based on the training noise data.

在步数索引t对应的训练去噪数据是在步数索引t对应的训练噪声数据的基础上去除初始扩散模型的输出表示的噪声得到的情况下,根据步数索引t对应的训练去噪数据与标签去噪数据之间的差异调整初始扩散模型的参数,可以理解为初始扩散模型的输出与训练噪声数据相对训练去噪数据添加的噪声之间的差异调整初始扩散模型的参数。该差异可以通过损失值表示。In the case where the training denoised data corresponding to the step index t is obtained by removing the noise represented by the output of the initial diffusion model based on the training noise data corresponding to the step index t, the parameters of the initial diffusion model are adjusted according to the difference between the training denoised data corresponding to the step index t and the label denoised data. This can be understood as adjusting the parameters of the initial diffusion model by the difference between the output of the initial diffusion model and the noise added by the training noise data relative to the training denoised data. The difference can be represented by a loss value.

反向扩散过程可以理解为扩散模型的推理过程。在反向扩散过程中,t可以理解为剩余迭代次数加1。The back diffusion process can be understood as the inference process of the diffusion model. In the back diffusion process, t can be understood as the remaining number of iterations plus 1.

反向扩散过程中,从噪声数据出发,逐渐去除数据/>中的噪声,将数据/>还原至原始数据/>。图5中的(e)以图像数据原始数据/>为图像为例进行说明。对于图5中的(e)所示的图像数据,反向扩散过程从纯噪声图像中逐渐对其进行去噪,直到得到真实的图像。该纯噪声图像也可以称为高斯噪声图像或噪声图像。In the reverse diffusion process, from the noise data Start by gradually removing data /> The noise in the data Restore to original data/> . (e) in Figure 5 shows the original image data/> Take the image as an example. For the image data shown in (e) in Figure 5, the reverse diffusion process gradually denoises it from the pure noise image until a real image is obtained. The pure noise image can also be called a Gaussian noise image or a noise image.

如图5中的(g)所示,利用训练得到的扩散模型,基于步数索引t,对输入数据进行去噪,得到输出数据。输入数据可以理解为数据,输出数据可以理解为数据/>As shown in (g) of Figure 5, the trained diffusion model is used to denoise the input data based on the step index t to obtain the output data. The input data can be understood as data , the output data can be understood as data/> .

正向扩散过程和反向扩散过程,都可以通过相对应的随机微分方程和常微分方程表示。并且扩散模型的优化目标(预测每一步所添加的噪声)实际上可以理解为学习一个当前输入对目标数据分布最优的梯度方向。这非常符合直观理解,即我们对输入所添加的噪声实际上使得输入远离了其原本的数据分布,而学习一个数据空间上最优的梯度方向实际上便等同于往去除噪声的方向去行走。在反向扩散过程中,使用确定性的常微分方程可以得到确定性的采样结果;在正向扩散过程,也可以通过构造反向扩散过程的常微分方程的逆过程,来得到正向扩散过程最终的加噪声结果(实际上,如果我们有一条确定性的路径,那么正向扩散过程和反向扩散过程,无非是正着走一遍和反着走一遍)。这个结论使得扩散生成变得高度可控,不用担心扩散模型对数据处理得到的数据/>与数据/>完全相似,使得一系列的调控成为可能。Both the forward diffusion process and the reverse diffusion process can be represented by corresponding stochastic differential equations and ordinary differential equations. And the optimization goal of the diffusion model (predicting the noise added at each step) can actually be understood as learning a gradient direction that is optimal for the current input to the target data distribution. This is very consistent with intuitive understanding, that is, the noise we add to the input actually makes the input far away from its original data distribution, and learning an optimal gradient direction in the data space is actually equivalent to walking in the direction of removing noise. In the reverse diffusion process, the use of deterministic ordinary differential equations can obtain deterministic sampling results; in the forward diffusion process, the final noise-added result of the forward diffusion process can also be obtained by constructing the inverse process of the ordinary differential equation of the reverse diffusion process (in fact, if we have a deterministic path, then the forward diffusion process and the reverse diffusion process are nothing more than walking forward and backward). This conclusion makes diffusion generation highly controllable, and there is no need to worry about the diffusion model's effects on the data. Processed data /> With data/> The complete similarity makes a series of controls possible.

扩散模型的训练过程,可以简单的理解为通过神经网络学习从纯噪声数据逐渐对数据去除高斯噪声的过程。The training process of the diffusion model can be simply understood as the process of gradually removing Gaussian noise from pure noise data through neural network learning.

对于一个T步的扩散模型,每一步的索引为t。在正向扩散过程中,我们从一个真实图像开始,在每一步随机生成一些高斯噪声,然后将生成的噪声逐步加入到输入图像中,当T足够大时,得到加噪声后的图像便接近一个高斯噪声图像,例如对于模型DDPM可以设置T=1000 。通过一个训练过程使得神经网络学习从数据/>到数据/>添加的噪声,然后在反向扩散过程,从噪声数据/>开始(训练时是真实图像加噪声的结果,采样时是随机噪声),通过逐渐去除噪声的方式得到最后要生成的图像。这意味着扩散模型的每一次处理,用于生成与输入的数据/>相似的数据/>。从根本上说,扩散模型的工作原理,是通过连续添加高斯噪声来破坏噪声数据/>,然后通过反转这个噪声逐渐增加的过程,来学习恢复数据。利用扩散模型对随机采样的噪声进行多次去除高斯噪声的去噪处理,通过反向扩散过程,可以生成数据。For a T-step diffusion model, the index of each step is t. In the forward diffusion process, we start from a real image At the beginning, some Gaussian noise is randomly generated at each step, and then the generated noise is gradually added to the input image. When T is large enough, the image after adding noise is close to a Gaussian noise image. For example, for the model DDPM, T=1000 can be set. Through a training process, the neural network learns to learn from the data/> To data/> The added noise is then used in a reverse diffusion process to transform the noisy data At the beginning (the result of adding noise to the real image during training and random noise during sampling), the final image to be generated is obtained by gradually removing the noise. This means that each processing of the diffusion model is used to generate data that is the same as the input./> Similar data/> Fundamentally, the diffusion model works by corrupting noisy data by continuously adding Gaussian noise./> , and then learn to recover the data by reversing this process of gradually increasing noise. The randomly sampled noise is denoised multiple times to remove Gaussian noise using the diffusion model, and data can be generated through the reverse diffusion process.

在反向扩散过程中,扩散模型的输入,除了数据,还可以包括条件。条件可以包括用于表示步数索引t的信息,还可以包括其他信息。该其他信息可以与数据/>相关,从而,扩散模型生成的数据与该其他信息更加相符。In the back diffusion process, the input of the diffusion model, in addition to the data , and may also include conditions. The conditions may include information for indicating the step index t, and may also include other information. The other information may be related to the data /> related, so that the data generated by the diffusion model are more consistent with this other information.

扩散模型其实是一种隐变量模型(latent variable model),使用马尔可夫链(Markov chain, MC)映射到隐空间(latent space)。通过马尔可夫链,在每一步t中逐渐将噪声添加到数据中以获得后验概率/>,其中/>表示t分别为1至T情况下的数据/>,代表输入的数据,同时也是隐空间,隐空间与输入数据具有相同维度。The diffusion model is actually a latent variable model that uses a Markov chain (MC) to map to a latent space. Through the Markov chain, noise is gradually added to the data at each step t. To obtain the posterior probability/> , where/> Indicates data when t is 1 to T respectively/> , represents the input data and is also the latent space. The latent space has the same dimension as the input data.

在一些实施例中,第一图像I0的随机运动位移场是在时间域进行预测的。In some embodiments, the random motion displacement field of the first image I 0 is predicted in the time domain.

通过自回归的方式,可以在时间域进行随机运动位移场的预测。自回归是根据之前的某些时刻的序列值来预测接下来应该输出的值。Through autoregression, the random motion displacement field can be predicted in the time domain. Autoregression is to predict the value that should be output next based on the sequence value at certain previous moments.

随机运动位移场预测模型410可以对第一图像进行处理,以得到第一图像I0的随机运动位移场。第一图像I0的随机运动位移场对应的时刻为第一图像对应的时刻之后的时刻。The random motion displacement field prediction model 410 may process the first image to obtain the random motion displacement field of the first image I 0. The moment corresponding to the random motion displacement field of the first image I 0 is a moment after the moment corresponding to the first image.

在随机运动位移场预测模型410在时间域预测第一图像I0的随机运动位移场的情况下,随机运动位移场预测模型410的训练数据可以包括训练样本和标签随机运动位移场。随机运动位移场预测模型410的训练数据中的训练样本包括样本图像。When the random motion displacement field prediction model 410 predicts the random motion displacement field of the first image I0 in the time domain, the training data of the random motion displacement field prediction model 410 may include training samples and label random motion displacement field. The training samples in the training data of the random motion displacement field prediction model 410 include sample images.

利用初始随机运动位移场预测模型对训练样本进行处理,可以得到训练随机运动位移场。根据训练随机运动位移场和标签随机运动位移场之间的差异,调整初始随机运动位移场预测模型的参数,可以得到随机运动位移场预测模型410。该差异可以通过损失值表示。随机运动位移场预测模型410可以理解为参数调整后的初始随机运动位移场预测模型。The training sample is processed using the initial random motion displacement field prediction model to obtain a training random motion displacement field. According to the difference between the training random motion displacement field and the label random motion displacement field, the parameters of the initial random motion displacement field prediction model are adjusted to obtain a random motion displacement field prediction model 410. The difference can be represented by a loss value. The random motion displacement field prediction model 410 can be understood as the initial random motion displacement field prediction model after parameter adjustment.

样本图像可以是训练视频中的一帧图像,标签随机运动位移场可以是根据训练视频确定的样本图像在训练时刻的随机运动位移场。训练时刻为训练视频中样本图像对应的时刻之后时刻。标签随机运动位移场,即样本图像在训练时刻的随机运动位移场,表示样本图像中的多个像素在训练时刻相对样本图像对应的时刻的位置。The sample image may be a frame image in a training video, and the label random motion displacement field may be a random motion displacement field of the sample image at a training time determined according to the training video. The training time is a time after the time corresponding to the sample image in the training video. The label random motion displacement field, i.e., the random motion displacement field of the sample image at the training time, indicates the positions of multiple pixels in the sample image at the training time relative to the time corresponding to the sample image.

示例性地,随机运动位移场预测模型410也可以对第一图像和时间信息进行处理,以得到该时间信息表示的时刻的随机运动位移场。时间信息表示的时刻可以是第一图像对应的时刻之后的时刻。随机运动位移场对应的时刻为时间信息表示的时刻。Exemplarily, the random motion displacement field prediction model 410 may also process the first image and the time information to obtain the random motion displacement field at the moment indicated by the time information. The moment indicated by the time information may be a moment after the moment corresponding to the first image. The moment corresponding to the random motion displacement field is the moment indicated by the time information.

在随机运动位移场预测模型410的输入不包括时间信息的情况下,随机运动位移场预测模型410对于不同的第一图像进行处理得到的随机运动位移场对应的时刻,与随机运动位移场预测模型410处理的第一图像对应的时刻,之间的时间间隔可以是相等的。也就是说,随机运动位移场预测模型410对第一图像进行处理,可以得到与第一图像对应的时刻之间为预设时间间隔的时刻对应的随机运动位移场。In the case where the input of the random motion displacement field prediction model 410 does not include time information, the time intervals between the moments corresponding to the random motion displacement fields obtained by the random motion displacement field prediction model 410 for processing different first images and the moments corresponding to the first images processed by the random motion displacement field prediction model 410 may be equal. In other words, the random motion displacement field prediction model 410 processes the first images to obtain the random motion displacement fields corresponding to the moments corresponding to the first images that are a preset time interval apart.

而在随机运动位移场预测模型410的输入包括时间信息的情况下,时间信息可以理解为时间嵌入(embedding)。在时间信息变化的情况下,随机运动位移场预测模型410可以输出对应于不同时间信息表示的时刻的随机运动位移场。也就是说,通过引入时间信息,根据第一图像,可以预测第一图像的时刻之后的不同时刻的随机运动位移场。When the input of the random motion displacement field prediction model 410 includes time information, the time information can be understood as time embedding. When the time information changes, the random motion displacement field prediction model 410 can output the random motion displacement field corresponding to the moment represented by different time information. That is, by introducing time information, the random motion displacement field at different moments after the moment of the first image can be predicted based on the first image.

在随机运动位移场预测模型410的输入包括时间信息的情况下,随机运动位移场预测模型410的训练数据中的训练样本还可以包括训练时间信息。训练时间信息用于表示训练时刻。In the case where the input of the random motion displacement field prediction model 410 includes time information, the training samples in the training data of the random motion displacement field prediction model 410 may also include training time information. The training time information is used to indicate the training moment.

示例性地,输入随机运动位移场预测模型410中的第一图像,也可以是添加噪声后的第一图像。在第一图像上添加噪声,可以理解为在空间域添加的噪声。在输入随机运动位移场预测模型410中的第一图像是添加噪声后的第一图像的情况下,随机运动位移场预测模型410的训练数据中,样本图像可以是对训练视频中的一帧图像添加噪声后得到的图像。Exemplarily, the first image input into the random motion displacement field prediction model 410 may also be the first image after adding noise. Adding noise to the first image may be understood as adding noise in the spatial domain. In the case where the first image input into the random motion displacement field prediction model 410 is the first image after adding noise, in the training data of the random motion displacement field prediction model 410, the sample image may be an image obtained by adding noise to a frame image in the training video.

示例性地,输入随机运动位移场预测模型410中的时间信息,也可以是添加噪声后的时间信息。添加噪声后的时间信息可以表示添加噪声后的训练时刻。在时间信息上添加噪声,可以理解为在时间域添加噪声。在输入随机运动位移场预测模型410中的第一图像是添加噪声后的第一图像的情况下,随机运动位移场预测模型410的训练数据中,训练时间信息可以是对表示训练时刻的信息添加噪声得到的信息。Exemplarily, the time information input into the random motion displacement field prediction model 410 may also be the time information after adding noise. The time information after adding noise may represent the training moment after adding noise. Adding noise to the time information may be understood as adding noise in the time domain. In the case where the first image input into the random motion displacement field prediction model 410 is the first image after adding noise, in the training data of the random motion displacement field prediction model 410, the training time information may be the information obtained by adding noise to the information representing the training moment.

通过在时间域或空间域添加噪声,可以理解为运动加入了随机噪声,为运动增加了随机性。By adding noise in the time domain or space domain, it can be understood that random noise is added to the motion, which increases the randomness of the motion.

根据某个时刻的随机运动位移场,可以确定该时刻的第二图像。该第二图像也可以作为第一图像,利用图像处理系统400再次进行处理,从而可以得到更多的时刻的第二图像。According to the random motion displacement field at a certain moment, the second image at that moment can be determined. The second image can also be used as the first image and processed again by the image processing system 400, so as to obtain second images at more moments.

在样本图像是根据训练视频中的第一帧图像确定的情况下,随机运动位移场预测模型410的训练数据中的训练样本还可以包括训练目标主体区域信息和/或训练运动信息。In the case where the sample image is determined based on the first frame image in the training video, the training sample in the training data of the random motion displacement field prediction model 410 may also include training target subject area information and/or training motion information.

训练目标主体区域信息可以用于指示训练目标主体在训练视频的第一帧图像中所在的区域,该区域也可以称为训练目标主体区域。The training target subject region information may be used to indicate a region where the training target subject is located in the first frame image of the training video, and this region may also be referred to as a training target subject region.

在训练视频开始后的一段时间内,训练目标主体处于运动的状态。示例性地,在训练视频中,训练目标主体可以始终处于运动的状态。For a period of time after the training video starts, the training target subject is in a state of motion. For example, in the training video, the training target subject may always be in a state of motion.

并且,在训练视频开始后的一段时间内,训练视频中训练目标主体之外其他主体均处于静止状态。Moreover, for a period of time after the training video starts, all subjects except the training target subject in the training video are in a stationary state.

训练运动信息可以表示训练目标主体在训练视频中的运动幅度、训练目标主体在训练视频开始时的运动方等中的一个或多个。The training motion information may represent one or more of the motion amplitude of the training target subject in the training video, the motion direction of the training target subject at the beginning of the training video, and the like.

示例性地,训练视频的时间长度可以是预设的,也可以是随机确定的,还可以是按照预设的规则确定的,还可以是人工确定的。Exemplarily, the duration of the training video may be preset, randomly determined, determined according to a preset rule, or manually determined.

计算机图形学中相关的研究表明,自然界的运动特别是振荡型的运动,可以描述为一组谐振动的叠加,这些谐振可以由不同的频率、振幅和相位来表示。Relevant research in computer graphics shows that the movement in nature, especially the oscillatory movement, can be described as the superposition of a set of harmonic vibrations, which can be represented by different frequencies, amplitudes and phases.

在时间域和/或空间域内添加随机噪声,可能会导致画面不真实或者不稳定。在时间域进行随机运动位移场的预测,根据随机运动位移场生成的视频中,在较长的时间段按内,无法保证具有很好的时间一致性和空间连续性,随着生成的第二图像的数量的增加,包括该多个第二图像的视频可能发生漂移或发散。Adding random noise in the time domain and/or space domain may cause the picture to be unrealistic or unstable. In the prediction of the random motion displacement field in the time domain, the video generated according to the random motion displacement field cannot be guaranteed to have good temporal consistency and spatial continuity over a long period of time. As the number of generated second images increases, the video including the multiple second images may drift or diverge.

视频发生漂移和发散,可以理解为较为靠后的帧对应的图像中记录的对象存在重影或模糊,甚至无法辨认。Video drift and divergence can be understood as ghosting or blurring of objects recorded in images corresponding to later frames, or even making them unrecognizable.

在另一些实施例中,随机运动位移场可以是在频域进行预测得到的。In other embodiments, the random motion displacement field may be predicted in the frequency domain.

如图6所示,随机运动位移场预测模型410可以包括第一变换模块411、随机运动纹理预测模型412和第二变换模块413。As shown in FIG. 6 , the random motion displacement field prediction model 410 may include a first transformation module 411 , a random motion texture prediction model 412 , and a second transformation module 413 .

第一变换模块411用于对第一图像I0进行傅里叶变换,将第一图像I0转换至频域,得到图像频域数据。傅立叶变换,可以通过快速傅立叶变换(fast Fourier transform,FFT)实现。The first transform module 411 is used to perform Fourier transform on the first image I 0 , convert the first image I 0 into the frequency domain, and obtain image frequency domain data. The Fourier transform can be implemented by fast Fourier transform (FFT).

随机运动纹理预测模型412用于对图像频域数据进行处理,得到第一图像I0的随机运动纹理S。The random motion texture prediction model 412 is used to process the image frequency domain data to obtain the random motion texture S of the first image I 0 .

随机运动纹理S也可以称为随机运动频谱,包括第一图像I0的全部或部分像素的运动轨迹的频率表示,即频域下的运动轨迹。随机运动纹理S包括多个像素中每个像素的运动轨迹的频率表示。每个像素的运动轨迹的频率表示可以包括在K个频率f0至fK中每个频点的分量。The random motion texture S may also be referred to as a random motion spectrum, and includes a frequency representation of the motion trajectory of all or part of the pixels of the first image I 0 , that is, a motion trajectory in the frequency domain. The random motion texture S includes a frequency representation of the motion trajectory of each pixel in the plurality of pixels. The frequency representation of the motion trajectory of each pixel may include a component of each frequency point in the K frequencies f 0 to f K.

第一图像I0中位置为p的像素的运动轨迹的频率表示,可以表示为。位置为p的像素可以是第一图像I0的某个像素。位置为p的像素,也可以理解为像素p。The frequency representation of the motion trajectory of the pixel at position p in the first image I 0 can be expressed as The pixel at position p may be a pixel of the first image I 0. The pixel at position p may also be understood as pixel p.

第二变换模块413用于对第一图像I0的随机运动纹理S进行逆傅立叶变换,从而将随机运动纹理S转换为在第一图像I0对应的时刻之后的至少一个时刻对应的第一图像I0的随机运动位移场D。The second transformation module 413 is used to perform inverse Fourier transform on the random motion texture S of the first image I0 , so as to convert the random motion texture S into a random motion displacement field D of the first image I0 corresponding to at least one time after the time corresponding to the first image I0 .

示例性地,第二变换模块413用于第一图像I0的随机运动纹理S中像素p的运动轨迹的频率表示进行转换,以得到像素p在至少一个时刻的运动位移/>Exemplarily, the second transformation module 413 is used for frequency representation of the motion trajectory of pixel p in the random motion texture S of the first image I0. Convert to obtain the motion displacement of pixel p at at least one moment/> .

也就是说,第一图像I0在某个时刻对应的随机运动位移场D包括随机运动纹理S表示中的该多个像素中每个像素p在该时刻的运动位移That is, the random motion displacement field D corresponding to the first image I0 at a certain moment includes the motion displacement of each pixel p in the plurality of pixels represented by the random motion texture S at that moment. .

每个预设时刻的随机运动位移场D中,像素p在某个时刻的运动位移用于表示像素p在该时刻的位置相对第一图像I0中的位置的变化,即表示像素p在该时刻相对第一图像对应的时刻的位移。也就是说,在该某个时刻,像素p的位置可以表示为/>。因此,结合像素p在第一图像I0中的位置,根据像素p的运动位移/>,可以确定像素p在该时刻的图像中所在的位置。In the random motion displacement field D at each preset moment, the motion displacement of pixel p at a certain moment is It is used to indicate the change of the position of pixel p at this moment relative to the position in the first image I 0 , that is, it indicates the displacement of pixel p at this moment relative to the moment corresponding to the first image. That is to say, at this moment, the position of pixel p can be expressed as / > Therefore, combined with the position of pixel p in the first image I 0 , according to the motion displacement of pixel p/> , the position of pixel p in the image at that moment can be determined.

随机运动纹理预测模型412的训练数据可以包括训练样本和标签随机运动纹理。随机运动纹理预测模型412的训练样本可以包括训练图像频域数据。利用初始随机运动纹理预测模型对训练样本进行处理,得到训练随机运动纹理。根据训练随机运动纹理和标签随机运动纹理之间的差异,对初始随机运动纹理预测模型的参数进行调整,可以得到随机运动纹理预测模型412。随机运动纹理预测模型412即参数调整后的初始随机运动纹理预测模型。对初始随机运动纹理预测模型的参数的调整,可以最小化训练随机运动纹理和标签随机运动纹理之间的差异,或使得该差异逐渐收敛。该差异可以通过损失值表示。The training data of the random motion texture prediction model 412 may include training samples and label random motion textures. The training samples of the random motion texture prediction model 412 may include training image frequency domain data. The training samples are processed using the initial random motion texture prediction model to obtain the training random motion texture. According to the difference between the training random motion texture and the label random motion texture, the parameters of the initial random motion texture prediction model are adjusted to obtain the random motion texture prediction model 412. The random motion texture prediction model 412 is the initial random motion texture prediction model after parameter adjustment. The adjustment of the parameters of the initial random motion texture prediction model can minimize the difference between the training random motion texture and the label random motion texture, or make the difference gradually converge. The difference can be represented by a loss value.

训练图像频域数据可以是对训练视频中的样本图像进行傅里叶变换得到的。样本图像可以是训练视频中的一帧图像。The training image frequency domain data may be obtained by performing Fourier transform on a sample image in the training video. The sample image may be a frame of image in the training video.

标签随机运动纹理包括样本图像中的多个像素中每个像素的运动轨迹的频率表示。标签随机运动纹理可以是对训练视频中样本图像之后的一段视频进行分析得到的。The labeled random motion texture includes a frequency representation of a motion trajectory of each pixel in a plurality of pixels in the sample image. The labeled random motion texture may be obtained by analyzing a segment of video following the sample image in the training video.

通过特征追踪(feature tracking)、光流提取(optical flow)或粒子视频(particle video)的方式对训练视频中样本图像之后的一段视频进行处理,可以得到样本图像中的多个像素中每个像素的运动轨迹。对每个像素点的运动轨迹分别进行傅里叶变换,可以得到该运动轨迹的频域表示,从而得到标签随机运动纹理。By processing a video after the sample image in the training video through feature tracking, optical flow extraction or particle video, the motion trajectory of each pixel in the sample image can be obtained. The motion trajectory of each pixel point is Fourier transformed to obtain the frequency domain representation of the motion trajectory, thereby obtaining the label random motion texture.

特征跟踪是计算机视觉中的一项关键任务。特征跟踪通过对特征点跟踪,可以在视频中确定特征点的位置变化。通过检测中视频的多个图像中具有共同性质的图像特征点,并将不同图像中相同的特征点进行匹配,可以实现图像的对比和跟踪,确定特征点在视频中不同图像中的位置,实现对特征点的跟踪,从而对特征点进行运动估计。Feature tracking is a key task in computer vision. Feature tracking can determine the position changes of feature points in a video by tracking feature points. By detecting image feature points with common properties in multiple images in a video and matching the same feature points in different images, image comparison and tracking can be achieved, the position of feature points in different images in the video can be determined, and the feature points can be tracked, thereby estimating the motion of the feature points.

根据特征点的运动估计结果,可以确定该视频中第一帧图像中特征点的运动轨迹。特征点的运动轨迹,也可以理解为像素点的运动轨迹。According to the motion estimation result of the feature point, the motion trajectory of the feature point in the first frame image of the video can be determined. The motion trajectory of the feature point can also be understood as the motion trajectory of the pixel point.

光流(optical flow或optic flow)是关于视域中的物体运动检测中的概念。用来描述相对于观察者的运动所造成的观测目标、表面或边缘的运动。光流是空间运动物体在观察成像平面上的像素运动的瞬时速度,是利用图像序列中像素在时间域上的变化以及相邻帧之间的相关性来找到上一帧跟当前帧之间存在的对应关系,从而计算出相邻帧之间物体的运动信息的一种方法。Optical flow (or optic flow) is a concept in the detection of object motion in the field of view. It is used to describe the motion of an observed target, surface or edge caused by the motion relative to the observer. Optical flow is the instantaneous speed of the pixel motion of a moving object in space on the observation imaging plane. It is a method that uses the change of pixels in the time domain in an image sequence and the correlation between adjacent frames to find the corresponding relationship between the previous frame and the current frame, thereby calculating the motion information of objects between adjacent frames.

粒子视频(particle video)是一个视频和相应的粒子集。也就是说,粒子视频使用一组粒子来表示视频运动。粒子i具有随时间变化的位置(xi(t);yi(t)),其定义在粒子的开始帧和结束帧之间。通过向前、然后向后扫过视频,可以构建粒子视频。每个粒子都是具有长持续时间轨迹和其他特性的图像点样本。A particle video is a video and a corresponding set of particles. That is, a particle video uses a set of particles to represent video motion. Particle i has a time-varying position (xi(t); yi(t)) defined between the particle's start and end frames. A particle video can be constructed by sweeping the video forward and then backward. Each particle is an image point sample with a long-duration trajectory and other properties.

粒子视频通过增强长程外观一致性和运动一致性,改进了帧到帧的光流。Particle videos improve frame-to-frame optical flow by enforcing long-range appearance consistency and motion consistency.

图7中的(a)、(b)和(c)分别示出了基于特征点追踪、光流估计和粒子点视频的方式确定的视频中的像素的位置随时间的变化情况。图7中,横轴表示时间,纵轴表示像素在视频的画面中的横坐标。如图7中的(b)所示,光流估计容易导致光流场过度平滑但轨迹较短,产生的运动纹理不真实,或者出现斑点。如图7中的(a)所示,特征点追踪的方式,提取的运动轨迹长但稀疏。如图7中的(c)所示,粒子视频的方式提取的运动轨迹稠密且较长。因此,可以通过粒子视频的方式,对训练视频进行像素的运动轨迹的提取。(a), (b) and (c) in Figure 7 respectively show the change of the position of pixels in the video determined by feature point tracking, optical flow estimation and particle point video over time. In Figure 7, the horizontal axis represents time and the vertical axis represents the horizontal coordinate of the pixel in the video. As shown in (b) in Figure 7, optical flow estimation can easily lead to excessive smoothing of the optical flow field but a short trajectory, and the resulting motion texture is unrealistic or spots appear. As shown in (a) in Figure 7, the motion trajectory extracted by feature point tracking is long but sparse. As shown in (c) in Figure 7, the motion trajectory extracted by particle video is dense and long. Therefore, the motion trajectory of pixels in the training video can be extracted by particle video.

本申请实施例通过在频率域中表示图像中每个像素的运动纹理,也就是将第一图像I0转换为图像频域数据,并根据图像频域数据确定随机运动纹理,避免生成的视频随着时间增加发生漂移或发散。The embodiment of the present application avoids drift or divergence of the generated video as time increases by representing the motion texture of each pixel in the image in the frequency domain, that is, converting the first image I 0 into image frequency domain data and determining the random motion texture based on the image frequency domain data.

具体处理时,可以将获取的真实视频中的第10帧以及之后的149帧视频作为训练视频,生成多个像素点的运动轨迹。去除运动估计错误的轨迹、由于相机运动导致运动幅度过大的轨迹以及各个时刻所像素点都运动的轨迹,以得到过滤后的多个像素点的运动轨迹。根据过滤后的多个像素点的运动轨迹,可以得到标签随机运动纹理。标签随机运动纹理可以理解为训练过程中的真值(ground truth)。In specific processing, the 10th frame and the subsequent 149 frames of the acquired real video can be used as training videos to generate motion trajectories of multiple pixels. Trajectories with incorrect motion estimation, trajectories with excessive motion due to camera motion, and trajectories where all pixels move at all times are removed to obtain the motion trajectories of multiple pixels after filtering. Based on the motion trajectories of multiple pixels after filtering, a labeled random motion texture can be obtained. The labeled random motion texture can be understood as the ground truth in the training process.

一般来说,运动纹理在频率上具有特定的分布特性。运动纹理的幅度范围可以从0到100,并且随着频率的增加大致呈指数衰减。由于扩散模型的输出位于0到1之间,模型才能够稳定的训练和去噪。因此,从真实视频中提取的标签随机运动纹理,可以是归一化得到的。Generally speaking, motion texture has a specific distribution characteristic in frequency. The amplitude of motion texture can range from 0 to 100, and decays roughly exponentially with increasing frequency. Since the output of the diffusion model is between 0 and 1, the model can be stably trained and denoised. Therefore, the labeled random motion texture extracted from the real video can be normalized.

如果依据图像的宽度和高度将标签随机运动纹理的幅度缩放到[0,1],那么在较高频率处几乎所有的系数都会接近于零。这样的训练数据会训练出来的模型,会导致运动不准确。当归一化的标签随机运动纹理的幅度非常接近于零时,在推理过程中即使是很小的预测误差,在反归一化后也可能导致很大的相对误差。为了解决这个问题,可以采用一种简单但有效的频率自适应归一化技术。具体而言,根据未进行归一化的运动纹理,计算的统计数据,从而独立地对每个频率处理的傅立叶系进行归一化。If the amplitude of the label random motion texture is scaled to [0,1] according to the width and height of the image, almost all coefficients will be close to zero at higher frequencies. Such training data will train the model, resulting in inaccurate motion. When the amplitude of the normalized label random motion texture is very close to zero, even a small prediction error during inference may lead to a large relative error after denormalization. To solve this problem, a simple but effective frequency adaptive normalization technique can be used. Specifically, the Fourier system of each frequency processing is normalized independently based on the statistics calculated based on the unnormalized motion texture.

随机运动纹理预测模型412输出的随机运动纹理S可以是包括4K个通道的2维运动频谱图,其中K为随机运动纹理S中的频率数量,也可以理解为傅立叶系数的数量。在随机运动纹理S中的每个频率上,可以用四个标量来表示x和y维度的复傅立叶变换系数。图像频域数据可以通过图像的形式表示,x和y可以分别表示图像的形式表示的图像频域数据中的坐标。图像的形式表示的图像频域数据中,坐标x和y可以通过复傅立叶变换系数表示。复傅立叶变换系数z可以表示为z=x+iy,i为虚数单位。The random motion texture S output by the random motion texture prediction model 412 can be a 2D motion spectrum diagram including 4K channels, where K is the number of frequencies in the random motion texture S, which can also be understood as the number of Fourier coefficients. At each frequency in the random motion texture S, four scalars can be used to represent the complex Fourier transform coefficients of the x and y dimensions. The image frequency domain data can be represented in the form of an image, and x and y can respectively represent the coordinates in the image frequency domain data represented in the form of an image. In the image frequency domain data represented in the form of an image, the coordinates x and y can be represented by complex Fourier transform coefficients. The complex Fourier transform coefficient z can be expressed as z=x+iy, where i is an imaginary unit.

大多数自然振荡等运动主要由低频分量组成,运动的频谱随着频率增加呈指数下降,这表明大多数自然振动等运动可以由低频频率来很好地表示。在K=16的情况下,傅立叶系数的数量足以在一系列视频和场景中真实地重现原始的自然运动。因此,K可以大于或等于16,且小于预设阈值。Most natural oscillations and other motions are mainly composed of low-frequency components, and the spectrum of the motion decreases exponentially as the frequency increases, which indicates that most natural vibrations and other motions can be well represented by low-frequency frequencies. In the case of K=16, the number of Fourier coefficients is sufficient to realistically reproduce the original natural motion in a series of videos and scenes. Therefore, K can be greater than or equal to 16 and less than a preset threshold.

随机运动纹理预测模型412可以是神经网络模型,例如可以是生成模型,如扩散模型,或LDM模型等。The random motion texture prediction model 412 may be a neural network model, for example, a generation model such as a diffusion model, or an LDM model.

在随机运动纹理预测模型412的训练过程中,根据训练随机运动纹理和标签随机运动纹理之间的差异,对初始随机运动纹理预测模型的参数进行调整,可以理解为根据每次进行高斯噪声的去除得到的训练去噪数据与该次去噪对应的标签去噪数据之间的差异,对初始随机运动纹理预测模型的参数进行调整。每次去噪对应的标签去噪数据是根据标签随机运动纹理确定的。最后一次去噪对应的标签去噪数据为标签随机运动纹理,其他次去噪对应的标签去噪数据是在标签随机运动纹理的基础上添加高斯噪声得到的。每次去噪对应的标签去噪数据中添加的高斯噪声的程度是不同的。During the training process of the random motion texture prediction model 412, the parameters of the initial random motion texture prediction model are adjusted according to the difference between the training random motion texture and the label random motion texture. It can be understood that the parameters of the initial random motion texture prediction model are adjusted according to the difference between the training denoised data obtained by removing Gaussian noise each time and the label denoised data corresponding to the denoising. The label denoised data corresponding to each denoising is determined based on the label random motion texture. The label denoised data corresponding to the last denoising is the label random motion texture, and the label denoised data corresponding to other denoising is obtained by adding Gaussian noise to the label random motion texture. The degree of Gaussian noise added to the label denoised data corresponding to each denoising is different.

在样本图像是训练视频中的第一帧图像的情况下,随机运动纹理预测模型412的训练数据中的训练样本还可以包括训练目标主体区域信息和/或训练运动信息。In the case where the sample image is the first frame image in the training video, the training sample in the training data of the random motion texture prediction model 412 may also include training target subject area information and/or training motion information.

在一些实施例中,随机运动纹理预测模型412可以是扩散模型。在随机运动纹理预测模型412为扩散模型的情况下,条件可以包括图像频域数据。如果第一图像是待处理图像,则条件还可以包括目标主体区域信息和/或运动信息。In some embodiments, the random motion texture prediction model 412 may be a diffusion model. In the case where the random motion texture prediction model 412 is a diffusion model, the condition may include image frequency domain data. If the first image is an image to be processed, the condition may also include target subject area information and/or motion information.

在随机运动纹理预测模型412的训练过程中,利用初始随机运动纹理预测模型对训练样本进行处理,可以理解为利用初始随机运动纹理预测模型,基于训练样本,对噪声频域数据进行多次高斯噪声的去除,从而得到训练随机运动纹理。During the training process of the random motion texture prediction model 412, the training samples are processed using the initial random motion texture prediction model. This can be understood as using the initial random motion texture prediction model to remove Gaussian noise from the noise frequency domain data multiple times based on the training samples, thereby obtaining the training random motion texture.

噪声频域数据可以理解为频域表示的噪声,即纯高斯噪声数据的频域表示。训练样本可以理解为扩散模型训练过程中的条件。Noise frequency domain data can be understood as noise represented in the frequency domain, that is, the frequency domain representation of pure Gaussian noise data. Training samples can be understood as the conditions in the diffusion model training process.

随机运动纹理S用四个通道表示一个频率。预测具有K个频率的随机运动纹理S的直接方法,是采用扩散模型作为随机运动纹理预测模型412。扩散模型的输出是一个具有4K通道的张量。训练一个含有大量输出通道的模型,往往会导致输出产生过度平滑和准确性降低。另一种方法是向扩散模型注入额外的频率参数来独立预测每个频率的运动频谱,但这会导致频率域中的不相关预测,从而产生不真实的动作。为了避免上述问题,随机运动纹理预测模型412可以采用LDM。The random motion texture S uses four channels to represent one frequency. A direct method to predict a random motion texture S with K frequencies is to use a diffusion model as the random motion texture prediction model 412. The output of the diffusion model is a tensor with 4K channels. Training a model with a large number of output channels often results in over-smoothing of the output and reduced accuracy. Another method is to inject additional frequency parameters into the diffusion model to independently predict the motion spectrum of each frequency, but this will result in unrelated predictions in the frequency domain, resulting in unrealistic motion. In order to avoid the above problems, the random motion texture prediction model 412 can adopt LDM.

在另一些实施例中,随机运动纹理预测模型412可以是LDM。In some other embodiments, the random motion texture prediction model 412 may be an LDM.

随机运动纹理预测模型412也可以是LDM。LDM在保持生成图像质量的同时,比扩散模型更加高效。The random motion texture prediction model 412 may also be an LDM, which is more efficient than the diffusion model while maintaining the quality of the generated image.

如图8中的(a)所示,LDM 500可以包括两个部分,分别为编码器510、扩散模型(diffusion modals,DM)520和解码器530。编码器510和解码器530可以分别为变分自编码器(variational auto - encoder,VAE)中的自编码器和解码器。As shown in (a) of Fig. 8 , the LDM 500 may include two parts, namely, an encoder 510, a diffusion model (DM) 520, and a decoder 530. The encoder 510 and the decoder 530 may be an autoencoder and a decoder in a variational auto-encoder (VAE), respectively.

编码器510用于对输入数据进行压缩,也就是将输入数据映射到隐空间。隐空间也可以称为潜在空间。基于编码器510对数据进行压缩的结果,利用扩散模型520进行多次迭代,每次迭代均进行高斯噪声的去除,从而多次迭代的结果即为去噪压缩数据。解码器530用于对去噪压缩数据进行解码,可以得到去噪数据。The encoder 510 is used to compress the input data, that is, to map the input data to a latent space. The latent space can also be called a potential space. Based on the result of the encoder 510 compressing the data, the diffusion model 520 is used to perform multiple iterations, and each iteration removes Gaussian noise, so that the result of multiple iterations is denoised compressed data. The decoder 530 is used to decode the denoised compressed data to obtain denoised data.

在随机运动纹理预测模型412采用LDM 500的情况下,如图8中的(a)所示,LDM 500中编码器510用于对图像频域数据进行压缩,以得到图像压缩数据。编码器510对图像频域数据的压缩,也可以理解为将图像频域数据映射到隐空间。图像压缩数据也可以称为频率嵌入(embedding)。图像压缩数据的数据量小于图像频域数据的数据量。When the random motion texture prediction model 412 adopts the LDM 500, as shown in (a) of FIG8 , the encoder 510 in the LDM 500 is used to compress the image frequency domain data to obtain image compressed data. The compression of the image frequency domain data by the encoder 510 can also be understood as mapping the image frequency domain data to a latent space. The image compressed data can also be referred to as frequency embedding. The amount of the image compressed data is less than the amount of the image frequency domain data.

LDM 500中编码器510还可以用于对噪声频域数据进行压缩,得到压缩后的噪声频域数据。The encoder 510 in the LDM 500 may also be used to compress the noise frequency domain data to obtain compressed noise frequency domain data.

LDM 500中的扩散模型520,用于根据条件,对压缩后的噪声频域数据进行多次去噪处理,以得到压缩数据。LDM 500中的解码器530用于对压缩数据进行解码,以得到第一图像I0的随机运动纹理S。输入扩散模型520的条件包括图像压缩数据。输入扩散模型520的条件还可以包括步数索引t。The diffusion model 520 in the LDM 500 is used to perform multiple denoising processes on the compressed noise frequency domain data according to the conditions to obtain compressed data. The decoder 530 in the LDM 500 is used to decode the compressed data to obtain the random motion texture S of the first image I 0. The conditions input to the diffusion model 520 include the image compression data. The conditions input to the diffusion model 520 may also include the step index t.

也就是说,扩散模型520每一次的去噪处理,是根据条件,对上一次去噪处理得到的数据进行的。上一次去噪处理得到的数据可以理解为数据,本次去噪处理得到的数据可以理解为/>。数据/>可以是经过扩散模型520对压缩后的噪声频域数据进行T-t次去噪处理得到的,T为扩散模型520进行去噪处理的总次数。That is to say, each denoising process of the diffusion model 520 is performed on the data obtained by the previous denoising process according to the conditions. The data obtained by the previous denoising process can be understood as data , the data obtained from this denoising process can be understood as/> . Data/> The noise frequency domain data may be obtained by performing Tt denoising processes on the compressed noise frequency domain data by the diffusion model 520 , where T is the total number of denoising processes performed by the diffusion model 520 .

噪声频域数据可以是对运动纹理添加频域的高斯噪声得到的。The noise frequency domain data may be obtained by adding Gaussian noise in the frequency domain to the motion texture.

在随机运动纹理预测模型412为LDM的情况下,随机运动纹理预测模型412中的编码器和解码器可以是预训练得到的。对编码器和解码器的预训练,训练数据可以包括标签数据。编码器和解码器的预训练过程中使用的训练数据也可以称为预训练数据。When the random motion texture prediction model 412 is an LDM, the encoder and decoder in the random motion texture prediction model 412 may be pre-trained. For the pre-training of the encoder and decoder, the training data may include label data. The training data used in the pre-training process of the encoder and decoder may also be referred to as pre-training data.

在预训练过程中,可以利用初始编码器对标签数据进行压缩得到训练压缩数据,利用初始解码器对训练压缩数据进行解压缩得到训练解压缩数据。根据标签数据与训练解压缩数据之间的差异,调整初始编码器和初始解码器的参数,调整后的初始编码器可以作为随机运动纹理预测模型412中的编码器510,调整后的初始解码器可以作为随机运动纹理预测模型412中的解码器530。该差异可以通过损失值表示。In the pre-training process, the label data can be compressed by the initial encoder to obtain the training compressed data, and the training compressed data can be decompressed by the initial decoder to obtain the training decompressed data. According to the difference between the label data and the training decompressed data, the parameters of the initial encoder and the initial decoder are adjusted, and the adjusted initial encoder can be used as the encoder 510 in the random motion texture prediction model 412, and the adjusted initial decoder can be used as the decoder 530 in the random motion texture prediction model 412. The difference can be represented by a loss value.

如图8中的(b)所示,为了获得LDM 500中扩散模型训练所需的训练数据,可以将标签数据输入编码器510,以得到未添加噪声的标签隐空间特征。之后在隐空间特征中逐渐加入高斯噪声,产生不同噪声程度的标签隐变量特征。训练数据包括未添加噪声的标签隐空间特征和不同噪声程度的标签隐变量特征。其中,噪声程度最高的标签隐变量特征可以理解为纯噪声。As shown in (b) of FIG8 , in order to obtain the training data required for the diffusion model training in LDM 500, the label data can be input into the encoder 510 to obtain the label latent space features without adding noise. Then, Gaussian noise is gradually added to the latent space features to generate label latent variable features with different noise levels. The training data includes the label latent space features without adding noise and the label latent variable features with different noise levels. Among them, the label latent variable features with the highest noise level can be understood as pure noise.

对于应用于本申请的图像处理系统的LDM 500,标签数据可以是标签运动纹理。For the LDM 500 applied to the image processing system of the present application, the label data may be a label motion texture.

训练LDM 500中的扩散模型的过程中,利用初始扩散模型对噪声程度最高的标签隐变量特征进行多次去噪处理,依次得到多个训练隐变量特征。During the process of training the diffusion model in LDM 500, the label latent variable features with the highest noise level are subjected to multiple denoising processes using the initial diffusion model to obtain multiple training latent variable features in turn.

扩散模型进行去噪处理的次数与在隐空间特征中加入噪声的次数相同。该多个训练隐变量特征与除噪声程度最高的标签隐变量特征之外的多个标签隐变量特征一一对应。第一个得到的训练隐变量特征对应的标签隐变量特征为除噪声程度最高的标签隐变量特征之外的多个标签隐变量特征中噪声程度最高的标签隐变量特征,最后一个得到的训练隐变量特征对应的标签隐变量特征为噪声程度最低的标签隐变量特征。每个训练隐变量特征对应的标签隐变量特征中的噪声程度随该多个训练隐变量特征的得到的序号的增加而降低。The number of times the diffusion model performs denoising is the same as the number of times noise is added to the latent space features. The multiple training latent variable features correspond one-to-one to the multiple label latent variable features except the label latent variable feature with the highest degree of noise. The label latent variable feature corresponding to the first obtained training latent variable feature is the label latent variable feature with the highest degree of noise among the multiple label latent variable features except the label latent variable feature with the highest degree of noise, and the label latent variable feature corresponding to the last obtained training latent variable feature is the label latent variable feature with the lowest degree of noise. The degree of noise in the label latent variable feature corresponding to each training latent variable feature decreases as the sequence number of the multiple training latent variable features increases.

根据每个训练隐变量特征与该训练隐变量特征对应的标签隐变量特征之间的差异,调整初始扩散模型的参数,可以得到扩散模型。扩散模型为参数调整后的初始扩散模型。每个训练隐变量特征与该训练隐变量特征对应的标签隐变量特征之间的差异,可以通过损失值表示。According to the difference between each training latent variable feature and the label latent variable feature corresponding to the training latent variable feature, the parameters of the initial diffusion model are adjusted to obtain a diffusion model. The diffusion model is the initial diffusion model after parameter adjustment. The difference between each training latent variable feature and the label latent variable feature corresponding to the training latent variable feature can be represented by a loss value.

如果第一图像是待处理图像,则条件还可以包括目标主体区域信息和/或运动信息。If the first image is an image to be processed, the condition may further include target subject area information and/or motion information.

在随机运动纹理预测模型412为LDM的情况下,对于运动纹理的生成的过程,也可以理解为频域协调去噪的过程。When the random motion texture prediction model 412 is LDM, the process of generating motion texture can also be understood as a process of frequency domain coordinated denoising.

在又一些实施例中,随机运动纹理预测模型412可以包括压缩模型和扩散模型。In yet other embodiments, the random motion texture prediction model 412 may include a compression model and a diffusion model.

压缩模型可以用于对图像频域数据进行压缩,以得到图像压缩数据。扩散模型可以根据条件,对噪声频域数据进行多次去噪处理,以得到第一图像I0的随机运动纹理S。输入扩散模型520的条件可以包括图像压缩数据,还可以包括步数索引t。The compression model can be used to compress the image frequency domain data to obtain image compression data. The diffusion model can perform multiple denoising processes on the noise frequency domain data according to the conditions to obtain the random motion texture S of the first image I 0. The conditions input to the diffusion model 520 can include the image compression data and can also include the step index t.

多次去噪处理中,扩散模型的每一次去噪处理可以理解为根据条件对输入数据进行高斯噪声的去除,得到的数据可以作为扩散模型下一次去噪处理的输入数据。第一次去噪处理的输入数据为噪声频域数据,第一次之后的其他次的去噪处理过程中扩散模型的输入数据为上一次去噪处理扩散模型的输出。扩散模型最后一次去噪处理得到的数据为第一图像I0的随机运动纹理S。In multiple denoising processes, each denoising process of the diffusion model can be understood as removing Gaussian noise from the input data according to conditions, and the obtained data can be used as the input data for the next denoising process of the diffusion model. The input data of the first denoising process is the noise frequency domain data, and the input data of the diffusion model in other denoising processes after the first one is the output of the diffusion model of the previous denoising process. The data obtained by the last denoising process of the diffusion model is the random motion texture S of the first image I 0.

该多次去噪处理的次数为T,噪声频域数据可以理解为频域表示的噪声。也就是说,扩散模型的第一次去噪处理的输入,即,可以理解为纯高斯噪声数据的频域表示。The number of times of the multiple denoising processes is T, and the noise frequency domain data can be understood as the noise represented in the frequency domain. In other words, the input of the first denoising process of the diffusion model, ie, can be understood as the frequency domain representation of pure Gaussian noise data.

以数据为噪声频域数据为例,扩散模型的第i次去噪处理,可以理解为根据条件对输入数据/>进行高斯噪声的去除,以得到数据/>,t可以表示为T-i+1。扩散模型最后一次去噪处理得到的数据/>为第一图像I0的随机运动纹理S。By data Taking the noise frequency domain data as an example, the i-th denoising process of the diffusion model can be understood as the input data according to the conditions/> Gaussian noise is removed to obtain data/> , t can be expressed as T-i+1. The data obtained by the last denoising process of the diffusion model/> is the random motion texture S of the first image I 0 .

在随机运动纹理预测模型412包括压缩模型和扩散模型的情况下,压缩模型可以是预训练得到的。对压缩模型的预训练可以参见对编码器510的预训练。压缩模型可以是编码器。In the case where the random motion texture prediction model 412 includes a compression model and a diffusion model, the compression model may be obtained by pre-training. The pre-training of the compression model may refer to the pre-training of the encoder 510. The compression model may be an encoder.

在随机运动纹理预测模型412包括压缩模型和扩散模型的情况下,对随机运动纹理预测模型412的训练,可以理解为对扩散模型的训练。When the random motion texture prediction model 412 includes a compression model and a diffusion model, the training of the random motion texture prediction model 412 can be understood as the training of the diffusion model.

在随机运动纹理预测模型412的训练过程中,利用初始随机运动纹理预测模型对训练样本进行处理,可以理解为,利用VAE对训练样本中的图像频域数据进行压缩得到训练图像压缩数据,并利用初始扩散模型,基于训练样本,对噪声频域数据进行多次高斯噪声的去除,从而得到训练随机运动纹理。训练样本包括训练图像压缩数据。In the training process of the random motion texture prediction model 412, the training samples are processed using the initial random motion texture prediction model. It can be understood that the image frequency domain data in the training samples is compressed using the VAE to obtain training image compression data, and the noise frequency domain data is repeatedly Gaussian noise removed based on the training samples using the initial diffusion model to obtain the training random motion texture. The training samples include the training image compression data.

根据训练随机运动纹理和标签随机运动纹理之间的差异,对初始随机运动纹理预测模型的参数进行调整,可以理解为,根据每次进行高斯噪声的去除得到的训练去噪数据与该次去噪对应的标签去噪数据之间的差异,对初始随机运动纹理预测模型的参数进行调整。每次去噪对应的标签去噪数据是根据标签随机运动纹理确定的。最后一次去噪对应的标签去噪数据为标签随机运动纹理,其他次去噪对应的标签去噪数据是在标签随机运动纹理的基础上添加高斯噪声得到的。每次去噪对应的标签去噪数据中添加的高斯噪声的程度是不同的。According to the difference between the training random motion texture and the label random motion texture, the parameters of the initial random motion texture prediction model are adjusted. It can be understood that the parameters of the initial random motion texture prediction model are adjusted according to the difference between the training denoised data obtained by removing Gaussian noise each time and the label denoised data corresponding to the denoising. The label denoised data corresponding to each denoising is determined based on the label random motion texture. The label denoised data corresponding to the last denoising is the label random motion texture, and the label denoised data corresponding to other denoising is obtained by adding Gaussian noise to the label random motion texture. The degree of Gaussian noise added to the label denoised data corresponding to each denoising is different.

每次进行高斯噪声的去除得到的训练去噪数据,可以作为下一次高斯噪声的去除的基础。也就是说,第一次高斯噪声的去除可以是对噪声频域数据进行的,其他次高斯噪声的去除可以是对上一次高斯噪声的去除得到的训练去噪数据进行的。The training denoising data obtained by each Gaussian noise removal can be used as the basis for the next Gaussian noise removal. In other words, the first Gaussian noise removal can be performed on the noise frequency domain data, and the other Gaussian noise removal can be performed on the training denoising data obtained by the previous Gaussian noise removal.

第二变换模块413对第一图像I0的随机运动纹理S进行处理。第一图像I0的随机运动纹理S经过逆傅立叶变换,可以得到第一图像I0之后的至少一个预设时刻的随机运动位移场。每个预设时刻的随机运动位移场中,第一图像I0中每个像素p的运动位移场是对该像素的运动轨迹的频率表示Sp进行逆傅立叶变换得到的。The second transformation module 413 processes the random motion texture S of the first image I 0. The random motion texture S of the first image I 0 is subjected to inverse Fourier transformation to obtain a random motion displacement field at least one preset time after the first image I 0. In the random motion displacement field at each preset time, the motion displacement field of each pixel p in the first image I 0 is obtained by inverse Fourier transformation of the frequency representation Sp of the motion trajectory of the pixel.

逆傅立叶变换可以是逆快速傅立叶变换。也就是说,每个像素p的运动位移场,其中,/>表示逆快速傅立叶变换。预设时刻的随机运动位移场可以表示为,其中,P为第一图像I0中像素的数量。The inverse Fourier transform can be an inverse fast Fourier transform. That is, the motion displacement field of each pixel p is , where /> represents the inverse fast Fourier transform. The random motion displacement field at a preset time can be expressed as , where P is the number of pixels in the first image I 0 .

根据预设时刻的随机运动位移场中每个像素的运动位移场,可以确定该像素在预设时刻的位置。According to the motion displacement field of each pixel in the random motion displacement field at the preset moment, the position of the pixel at the preset moment can be determined.

与扩散模型对图像频域数据进行处理相比,LDM 500对图像频域数据进行压缩并对对压缩得到的图像压缩数据进行处理,能够降低整体的数据处理量,提高处理效率。Compared with the diffusion model processing the image frequency domain data, the LDM 500 compresses the image frequency domain data and processes the compressed image data, which can reduce the overall data processing amount and improve the processing efficiency.

利用生成式的扩散模型来预测未来时刻的运动位移场,可以对运动的主体进行细粒度的控制。在频域内进行处理以得到运动纹理,不是直接处理原始的RGB像素,既提升了计算效率,又可以使生成的运动表示保持较长时间的一致性。Using a generative diffusion model to predict the motion displacement field at future moments can provide fine-grained control over the subject of motion. Processing in the frequency domain to obtain motion textures instead of directly processing the original RGB pixels not only improves computational efficiency, but also keeps the generated motion representation consistent for a longer period of time.

根据第一图像I0的随机运动位移场确定的各个像素在下一时刻的位置中,多个像素下一时刻的位置可能存在重合。而根据第一图像I0的随机运动位移场确定的各个像素在下一时刻的位置所表示的图像中,也可能存在某个位置位于各个像素p在下一时刻的位置之外,也就是说,根据第一图像I0的随机运动位移场D确定的各个像素在下一时刻的位置确定的图像中可能存在空洞。In the positions of each pixel at the next moment determined according to the random motion displacement field of the first image I 0 , the positions of multiple pixels at the next moment may overlap. In the image represented by the positions of each pixel at the next moment determined according to the random motion displacement field of the first image I 0 , there may also be a position outside the position of each pixel p at the next moment, that is, there may be a hole in the image determined by the positions of each pixel at the next moment determined according to the random motion displacement field D of the first image I 0 .

为了避免上述情况的出现,可以根据第一图像I0的随机运动位移场D的特征对第一图像I0的特征进行调整,并根据调整后的特征生成下一时刻的第二图像,以及下一时刻的音频。下一时刻的音频可以理解为下一时刻的第二图像对应的音频。In order to avoid the above situation, the features of the first image I 0 can be adjusted according to the features of the random motion displacement field D of the first image I 0 , and the second image at the next moment and the audio at the next moment are generated according to the adjusted features. The audio at the next moment can be understood as the audio corresponding to the second image at the next moment.

图像特征提取模型430可以是卷积神经网络,包括多个卷积层。第一特征可以仅包括图像特征提取模型430对第一图像I0进行特征提取过程中最后一个卷积层输出的第一子特征。或者,第一特征可以包括图像特征提取模型430中多个卷积层分别输出的多个第一子特征。输出该多个第一子特征的多个卷积层包括图像特征提取模型430的最后一个卷积层。The image feature extraction model 430 may be a convolutional neural network including multiple convolutional layers. The first feature may only include the first sub-feature output by the last convolutional layer in the process of the image feature extraction model 430 performing feature extraction on the first image I 0. Alternatively, the first feature may include multiple first sub-features respectively output by multiple convolutional layers in the image feature extraction model 430. The multiple convolutional layers that output the multiple first sub-features include the last convolutional layer of the image feature extraction model 430.

当卷积神经网络有多个卷积层的时候,初始的卷积层往往提取较多的一般特征,该一般特征也可以称之为低级别的特征;随着卷积神经网络深度的加深,越往后的卷积层提取到的特征越来越复杂,比如高级别的语义之类的特征,语义越高的特征越适用于待解决的问题。When a convolutional neural network has multiple convolutional layers, the initial convolutional layer often extracts more general features, which can also be called low-level features. As the depth of the convolutional neural network increases, the features extracted by the subsequent convolutional layers become more and more complex, such as high-level semantic features. Features with higher semantics are more suitable for the problem to be solved.

在第一特征包括图像特征提取模型430中多个卷积层分别输出的多个第一子特征的情况下,第一特征可以表示为,其中,G为该多个第一子特征的数量,/>分别为该多个第一子特征,J为大于1的正整数。In the case where the first feature includes a plurality of first sub-features respectively output by a plurality of convolutional layers in the image feature extraction model 430, the first feature can be expressed as , where G is the number of the first sub-features, /> are respectively the multiple first sub-features, and J is a positive integer greater than 1.

从而,根据第一特征进行第二图像以及第二图像对应的音频的生成,考虑更多信息的影响,使得第二图像以及第二图像对应的音频更加准确。Thus, the second image and the audio corresponding to the second image are generated according to the first feature, and the influence of more information is taken into consideration, so that the second image and the audio corresponding to the second image are more accurate.

图像特征提取模型430中多个卷积层分别输出的多个第一子特征的尺寸不同。该多个第一子特征也可以理解为图像特征提取模型430对第一图像I0进行多尺度编码得到的第一图像I0的多尺度特征。图像特征提取模型430可以理解为编码器。The sizes of the multiple first sub-features output by the multiple convolutional layers in the image feature extraction model 430 are different. The multiple first sub-features can also be understood as multi-scale features of the first image I 0 obtained by the image feature extraction model 430 performing multi-scale encoding on the first image I 0. The image feature extraction model 430 can be understood as an encoder.

在该多个第一子特征中,图像特征提取模型430越往后的卷积层输出的第一子特征的尺寸越小。因此,图像特征提取模型430也可以称为特征金字塔提取器。Among the multiple first sub-features, the size of the first sub-feature output by the later convolutional layers of the image feature extraction model 430 is smaller. Therefore, the image feature extraction model 430 can also be called a feature pyramid extractor.

运动特征提取模型420可以是卷积神经网络,包括多个卷积层。运动特征可以仅包括运动特征提取模型420对第一图像I0的随机运动位移场D进行特征提取过程中最后一个卷积层输出的运动子特征。或者,运动特征可以包括运动特征提取模型420中多个卷积层分别输出的多个运动子特征。输出该多个运动子特征的多个卷积层包括运动特征提取模型420的最后一个卷积层。The motion feature extraction model 420 may be a convolutional neural network including a plurality of convolutional layers. The motion feature may only include the motion sub-feature output by the last convolutional layer in the process of extracting the random motion displacement field D of the first image I0 by the motion feature extraction model 420. Alternatively, the motion feature may include a plurality of motion sub-features respectively output by a plurality of convolutional layers in the motion feature extraction model 420. The plurality of convolutional layers that output the plurality of motion sub-features include the last convolutional layer of the motion feature extraction model 420.

运动特征提取模型420中多个卷积层分别输出的多个运动子特征的尺寸不同。在该多个运动子特征中,运动特征提取模型420越往后的卷积层输出的第一子特征的尺寸越小。因此,运动特征提取模型420也可以称为特征金字塔提取器。The sizes of the multiple motion sub-features output by the multiple convolution layers in the motion feature extraction model 420 are different. Among the multiple motion sub-features, the size of the first sub-feature output by the later convolution layers of the motion feature extraction model 420 is smaller. Therefore, the motion feature extraction model 420 can also be called a feature pyramid extractor.

在第一特征包括多个第一子特征的情况下,不同的第一子特征对应于不同的运动子特征。每个第一子特征的尺寸与该第一子特征对应的运动子特征的尺寸相同。In the case where the first feature includes a plurality of first sub-features, different first sub-features correspond to different motion sub-features. The size of each first sub-feature is the same as the size of the motion sub-feature corresponding to the first sub-feature.

调整模型440可以根据每个第一子特征对应的运动子特征,对该第一子特征进行调整,以得到该第一子特征对应的第二子特征。第二特征包括该多个第一子特征中每个第一子特征对应的第二子特征。The adjustment model 440 can adjust each first sub-feature according to the motion sub-feature corresponding to the first sub-feature to obtain a second sub-feature corresponding to the first sub-feature. The second feature includes the second sub-feature corresponding to each first sub-feature in the plurality of first sub-features.

或者,第一特征可以是对多个第一子特征进行拼接得到的,运动子特征可以是对多个运动子特征进行拼接得到的。也就是说,对多个运动子特征进行拼接的拼接结果,以及对多个第一子特征进行拼接的拼接结果,可以输入调整模型440,从而调整模型440可以输出第二特征。Alternatively, the first feature may be obtained by splicing multiple first sub-features, and the motion sub-feature may be obtained by splicing multiple motion sub-features. In other words, the splicing result of splicing multiple motion sub-features and the splicing result of splicing multiple first sub-features may be input into the adjustment model 440, so that the adjustment model 440 may output the second feature.

根据多个运动子特征,可以确定第一图像I0中每个像素p对应的权重。每个像素p对应的权重可以与该像素p在该多个运动子特征中的幅度正相关。According to the multiple motion sub-features, a weight corresponding to each pixel p in the first image I 0 may be determined. The weight corresponding to each pixel p may be positively correlated with the amplitude of the pixel p in the multiple motion sub-features.

示例性地,像素p对应的权重可以表示为,其中,G为该多个运动子特征的数量,/>表示像素p在第g个运动子特征中的幅度,g为正整数且g≤G。运动子特征的幅度可以是不同的。像素p在第g个运动子特征中的幅度,可以是通过将第g个运动子特征放缩至与第一图像相同的幅度得到的。对运动子特征的放缩,可以通过插值实现。For example, the weight corresponding to pixel p can be expressed as , where G is the number of the multiple motion sub-features, /> represents the amplitude of pixel p in the g-th motion sub-feature, where g is a positive integer and g≤G. The amplitudes of motion sub-features can be different. The amplitude of pixel p in the g-th motion sub-feature can be obtained by scaling the g-th motion sub-feature to the same amplitude as the first image. Scaling of the motion sub-feature can be achieved by interpolation.

调整模型440可以根据多个第一子特征、多个运动子特征,以及第一图像I0中多个像素对应的权重,可以得到第二子特征。The adjustment model 440 can obtain the second sub-feature according to the multiple first sub-features, the multiple motion sub-features, and the weights corresponding to the multiple pixels in the first image I 0 .

调整模型440可以利用激活函数(activation function)对第一特征进行调整。激活函数也可以称为激励函数,是模型整个结构中的非线性扭曲力。激活函数往往是一个无参数的固定的非线性变换,它决定着一个神经元输出的值的范围。The adjustment model 440 can adjust the first feature using an activation function. The activation function can also be called an excitation function, which is a nonlinear distortion force in the entire structure of the model. The activation function is often a fixed nonlinear transformation without parameters, which determines the range of values output by a neuron.

调整模型440利用激活函数对第一特征进行的调整,也可以理解为调整模型440进行扭曲计算。扭曲(Warping)用于描述将数据按照一定的变换规则映射为另一个数据的过程。通过对数据进行几何变换,使得变换后的数据与参考数据对齐或匹配。常见的扭曲计算包括仿射变换、透视变换以及一般的非线性变换。The adjustment model 440 makes an adjustment to the first feature using the activation function, which can also be understood as the adjustment model 440 performing a warping calculation. Warping is used to describe the process of mapping data to another data according to a certain transformation rule. By performing a geometric transformation on the data, the transformed data is aligned or matched with the reference data. Common warping calculations include affine transformation, perspective transformation, and general nonlinear transformation.

调整模型440可以进行直接平均扭曲或基线深度扭曲等。Adjusting the model 440 may perform a direct mean warp or a baseline depth warp, among others.

或者,调整模型440可以利用激活函数softmax,对多个第一子特征、多个运动子特征以及第一图像I0中多个像素p分别对应的权重进行处理,以得到第二特征。也就是说,调整模型440可以利用激活函数softmax,实现对运动特征、第一图像I0中多个像素p分别对应的权重,以及第一特征进行处理,得到了第二特征。第二特征也可以称为扭曲特征。Alternatively, the adjustment model 440 may use the activation function softmax to process the weights corresponding to the plurality of first sub-features, the plurality of motion sub-features, and the plurality of pixels p in the first image I 0 , respectively, to obtain the second feature. In other words, the adjustment model 440 may use the activation function softmax to process the motion feature, the weights corresponding to the plurality of pixels p in the first image I 0 , and the first feature, respectively, to obtain the second feature. The second feature may also be called a distortion feature.

通过在图像生成过程中,采用扭曲策略,避免了生成的图像中出现空洞,也避免了多个原像素映射至同一位置。采用多尺度技术,即根据特征提取模型中多个层的输出,进行图像的生成,生成的图像更加细致和准确,有利于准确的预测运动模态,得到更加真实的视频。By using a distortion strategy in the image generation process, holes are avoided in the generated image, and multiple original pixels are avoided from being mapped to the same position. The multi-scale technology is used, that is, the image is generated based on the output of multiple layers in the feature extraction model. The generated image is more detailed and accurate, which is conducive to accurately predicting the motion mode and obtaining a more realistic video.

图像生成模型450与音频生成模型460可以是生成模型。图像生成模型450与音频生成模型460的结构可以相同或不同。The image generation model 450 and the audio generation model 460 may be generation models. The structures of the image generation model 450 and the audio generation model 460 may be the same or different.

图像生成模型450与音频生成模型460分别对相同的第二特征进行处理,以得到第二图像和第二图像对应的音频。对于根据相同的第二特征生成得到的第二图像和音频,可以设置相同的标识,不同的第二图像可以设置不同的标识。也就是说,第二图像和第二图像对应的音频可以设置有相同的标识,以表示图像和音频的对应关系。从而,使得在包括多个第二图像和每个第二图像对应的音频的目标视频播放的过程中,在显示任一个第二图像的情况下,可以根据该第二图像的标识,确定该第二图像对应的音频,避免目标视频播放过程中图像和音频的播放速度不一致导致显示的图像与播放的音频的内容之间存在较大差异。The image generation model 450 and the audio generation model 460 process the same second feature respectively to obtain the second image and the audio corresponding to the second image. For the second image and audio generated according to the same second feature, the same identifier can be set, and different identifiers can be set for different second images. In other words, the second image and the audio corresponding to the second image can be set with the same identifier to indicate the correspondence between the image and the audio. Thus, during the playback of a target video including multiple second images and the audio corresponding to each second image, when any second image is displayed, the audio corresponding to the second image can be determined based on the identifier of the second image, thereby avoiding a large difference between the content of the displayed image and the played audio due to inconsistent playback speeds of the image and audio during the playback of the target video.

对第二图像和第二图像对应的音频设置的标识,也可以理解为时间嵌入。The identification of the second image and the audio setting corresponding to the second image may also be understood as time embedding.

图像处理系统400中,运动特征提取模型420、图像特征提取模型430、图像生成模型450和音频生成模型460是联合训练得到的。In the image processing system 400, the motion feature extraction model 420, the image feature extraction model 430, the image generation model 450 and the audio generation model 460 are jointly trained.

对于编码器510的训练,可以是利用图像训练数据对初始编码器和初始解码器进行训练。编码器510可以是训练得到的参数调整后的初始编码器。The training of the encoder 510 may be to train the initial encoder and the initial decoder using the image training data. The encoder 510 may be the initial encoder after the parameters are adjusted obtained through the training.

对于运动特征提取模型420、图像特征提取模型430、图像生成模型450和音频生成模型460的联合训练,可以是利用联合训练数据对初始运动特征提取模型、初始图像特征提取模型、初始图像生成模型和初始音频生成模型进行训练。运动特征提取模型420可以是训练得到的参数调整后的初始运动特征提取模型,图像特征提取模型430可以是训练得到的参数调整后的初始图像特征提取模型,图像生成模型450可以是训练得到的参数调整后的初始图像生成模型,音频生成模型460可以是训练得到的参数调整后的初始音频生成模型。For the joint training of the motion feature extraction model 420, the image feature extraction model 430, the image generation model 450 and the audio generation model 460, the initial motion feature extraction model, the initial image feature extraction model, the initial image generation model and the initial audio generation model may be trained using the joint training data. The motion feature extraction model 420 may be the initial motion feature extraction model after training and parameter adjustment, the image feature extraction model 430 may be the initial image feature extraction model after training and parameter adjustment, the image generation model 450 may be the initial image generation model after training and parameter adjustment, and the audio generation model 460 may be the initial audio generation model after training and parameter adjustment.

联合训练数据可以包括训练样本和标签数据,训练样本包括样本图像和训练随机运动位移场,标签数据包括标签图像和标签音频。初始运动特征提取模型用于对训练随机运动位移场进行特征提取,以得到训练运动特征。初始图像特征提取模型用于对训练图像进行特征提取,得到第一训练运动特征。调整模型440用于根据运动特征对第一训练特征进行调整,得到第二训练特征。初始图像生成模型用于根据第二训练特征,生成训练图像。初始音频生成模型用于根据第二训练特征,生成训练音频。根据训练图像与标签图像之间的差异,以及标签音频与训练音频之间的差异,调整初始运动特征提取模型、初始图像特征提取模型、初始图像生成模型和初始音频生成模型的参数,以得到运动特征提取模型420、图像特征提取模型430、图像生成模型450和音频生成模型460。该差异可以通过损失值表示。The joint training data may include training samples and label data, the training samples include sample images and training random motion displacement fields, and the label data include label images and label audio. The initial motion feature extraction model is used to extract features from the training random motion displacement field to obtain training motion features. The initial image feature extraction model is used to extract features from the training image to obtain a first training motion feature. The adjustment model 440 is used to adjust the first training feature according to the motion feature to obtain a second training feature. The initial image generation model is used to generate a training image according to the second training feature. The initial audio generation model is used to generate training audio according to the second training feature. According to the difference between the training image and the label image, and the difference between the label audio and the training audio, the parameters of the initial motion feature extraction model, the initial image feature extraction model, the initial image generation model, and the initial audio generation model are adjusted to obtain a motion feature extraction model 420, an image feature extraction model 430, an image generation model 450, and an audio generation model 460. The difference can be represented by a loss value.

本申请实施例提供的图像处理系统,可以根据图像,对视频中的多个图像和音频进行生成。根据单张待处理图像,可以实现有声动态视频的生成。该视频可以作为电子设备的壁纸。在一些情况下,生成的视频可以无缝循环播放。The image processing system provided in the embodiment of the present application can generate multiple images and audio in a video based on the image. Based on a single image to be processed, the generation of a dynamic video with sound can be achieved. The video can be used as a wallpaper for an electronic device. In some cases, the generated video can be played in a seamless loop.

图像处理系统可以根据用户的指示的目标主体和目标主体的目标运动趋势进行视频的生成,实现用户对运动的编辑,提高用户的参与度,提高用户体验。根据用户的指示生成的视频,可以理解为可交互的动画。The image processing system can generate a video based on the target subject indicated by the user and the target motion trend of the target subject, so as to enable the user to edit the motion, improve the user's participation and user experience. The video generated based on the user's instructions can be understood as an interactive animation.

在训练过程中使用的训练视频可以是摄像头采集得到的。从而,根据训练视频训练得到的图像处理系统,生成的视频真实性较高,能够实现动力学仿真。应当理解,用户指示的目标运动趋势,可以理解为在目标主体上施加的外力。The training video used in the training process can be acquired by a camera. Therefore, the image processing system trained according to the training video can generate a video with high authenticity and can realize dynamic simulation. It should be understood that the target motion trend indicated by the user can be understood as an external force applied to the target body.

下面结合图9,对图3所示的图像处理方法中使用的图像处理系统的训练方法进行说明。The following describes the training method of the image processing system used in the image processing method shown in FIG. 3 in conjunction with FIG. 9 .

图9是本申请实施例提供的一种图像处理系统的训练方法的示意性流程图。图9所示的方法包括步骤S910至S930。Fig. 9 is a schematic flow chart of a training method for an image processing system provided in an embodiment of the present application. The method shown in Fig. 9 includes steps S910 to S930.

步骤S910,获取训练样本和标签图像、标签音频。Step S910, obtaining training samples, labeled images, and labeled audio.

训练样本可以包括训练视频中的样本图像,标签图像可以是训练视频中样本图像之后的图像,标签音频可以是显示训练视频中标签图像时训练视频中的音频。The training sample may include a sample image in a training video, the label image may be an image following the sample image in the training video, and the label audio may be the audio in the training video when the label image in the training video is displayed.

步骤S920,利用初始图像处理系统对训练样本进行处理,以得到训练图像和训练音频。Step S920: Process the training samples using the initial image processing system to obtain training images and training audio.

步骤S930,根据训练图像和标签图像的第一差异,以及训练音频和标签音频的第二差异,调整初始图像处理系统的参数,调整后的初始图像处理。Step S930, adjusting parameters of the initial image processing system according to the first difference between the training image and the label image, and the second difference between the training audio and the label audio, and processing the adjusted initial image.

系统为训练得到的图像处理系统。The system is a trained image processing system.

第一差异和第二差异,均可以利用损失值表示。示例性地,第一差异可以通过感知损失表示。感知损失可以是视觉几何组(visual geometry group,VGG)感知损失,或感知图像块相似度(learned perceptual image patch similarity,LPIPS)等。The first difference and the second difference may both be represented by a loss value. For example, the first difference may be represented by a perceptual loss. The perceptual loss may be a visual geometry group (VGG) perceptual loss, or a learned perceptual image patch similarity (LPIPS), etc.

在训练神经网络模型的过程中,因为神经网络模型的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然,在第一次更新之前通常会有初始化的过程,即为神经网络模型中的各层预先配置参数),比如,如果模型的预测值高了,就调整权重向量让它预测低一些,不断地调整,直到神经网络模型能够预测出真正想要的目标值或与真正想要的目标值非常接近的值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值即损失值(loss)越高表示差异越大,那么神经网络模型的训练就变成了尽可能缩小这个loss的过程。In the process of training the neural network model, because the output of the neural network model is as close as possible to the value that we really want to predict, we can compare the predicted value of the current network with the target value that we really want, and then update the weight vector of each layer of the neural network according to the difference between the two (of course, there is usually an initialization process before the first update, that is, pre-configuring parameters for each layer in the neural network model). For example, if the model's predicted value is high, adjust the weight vector to make it predict lower, and keep adjusting until the neural network model can predict the target value that we really want or a value that is very close to the target value that we really want. Therefore, it is necessary to pre-define "how to compare the difference between the predicted value and the target value", which is the loss function or objective function, which are important equations used to measure the difference between the predicted value and the target value. Among them, taking the loss function as an example, the higher the output value of the loss function, that is, the loss value (loss), the greater the difference, so the training of the neural network model becomes a process of minimizing this loss as much as possible.

采用误差反向传播(back propagation,BP)算法可以在训练过程中修正初始的神经网络模型中参数的大小,使得神经网络模型的重建误差损失越来越小。具体地,前向传递输入信号直至输出会产生误差损失,通过反向传播误差损失信息来更新初始的神经网络模型中参数,从而使误差损失收敛。反向传播算法是以误差损失为主导的反向传播运动,旨在得到最优的神经网络模型的参数,例如,权重矩阵。The back propagation (BP) algorithm can be used to correct the size of the parameters in the initial neural network model during the training process, so that the reconstruction error loss of the neural network model becomes smaller and smaller. Specifically, the forward transmission of the input signal to the output will generate error loss, and the parameters in the initial neural network model are updated by backpropagating the error loss information, so that the error loss converges. The backpropagation algorithm is a backpropagation movement dominated by error loss, which aims to obtain the optimal parameters of the neural network model, such as the weight matrix.

可选地,训练样本还可以包括训练时间信息,训练时间信息用于表示训练时间间隔,标签图像和标签音频可以是与训练视频中在样本图像之后且与样本图像的时间间隔为训练时间间隔的图像。Optionally, the training sample may further include training time information, where the training time information is used to indicate a training time interval, and the label image and the label audio may be images in the training video that follow the sample image and whose time interval with the sample image is the training time interval.

可选地,训练样本还可以包括训练主体区域信息,训练主体区域信息表示训练主体区域,在样本图像中的训练主体区域记录有训练目标主体,标签图像中在训练主体区域之外的其他区域记录的内容与所述样本图像中该其他区域记录的内容相同。Optionally, the training sample may further include training subject area information, which represents the training subject area. The training subject area in the sample image records the training target subject, and the content recorded in other areas outside the training subject area in the label image is the same as the content recorded in the other areas in the sample image.

训练目标主体可以是训练视频中运动的主体。例如,训练目标主体可以是训练视频开始时运动的主体。The training target subject may be a subject moving in the training video. For example, the training target subject may be a subject moving at the beginning of the training video.

可选地,在样本图像是训练视频中的第一帧图像的情况下,训练样本包括训练主体区域信息。Optionally, when the sample image is the first frame image in a training video, the training sample includes training subject region information.

应当理解,图像处理系统的训练样本的数量为多个。在步骤S910,可以获取每个训练样本,以及该训练样本对应的标签图像、该训练样本对应的标签音频。It should be understood that the number of training samples of the image processing system is multiple. In step S910, each training sample, as well as the label image corresponding to the training sample and the label audio corresponding to the training sample can be obtained.

可选地,训练样本还可以包括训练运动信息。在训练样本包括训练运动信息的情况下,相比于样本图像,在标签图像的训练目标主体是按照训练运动信息表示的训练运动趋势运动的。也就是说,训练目标主体的运动符合训练运动信息表示的训练运动趋势。Optionally, the training sample may further include training motion information. When the training sample includes the training motion information, compared to the sample image, the training target subject in the label image moves according to the training motion trend represented by the training motion information. In other words, the motion of the training target subject conforms to the training motion trend represented by the training motion information.

可选地,在样本图像是训练视频中的第一帧图像的情况下,训练样本包括训练运动信息。Optionally, when the sample image is the first frame image in a training video, the training sample includes training motion information.

可选地,初始图像处理系统包括初始特征预测模型、初始图像生成模型和初始音频生成模型。Optionally, the initial image processing system includes an initial feature prediction model, an initial image generation model and an initial audio generation model.

初始特征预测模型用于,对训练样本进行处理,以得到训练预测特征。The initial feature prediction model is used to process the training samples to obtain training prediction features.

初始图像生成模型用于,对训练预测特征进行处理,以得到训练图像。The initial image generation model is used to process the training prediction features to obtain the training image.

初始音频生成模型用于,对预测特征进行处理,以得到训练音频。The initial audio generation model is used to process the predicted features to obtain training audio.

参数调整后的初始特征预测模型为图像处理系统中的特征预测模型,参数调整后的初始图像生成模型为图像处理系统中的图像生成模型,参数调整后的初始音频生成模型为图像处理系统中的音频生成模型。The initial feature prediction model after parameter adjustment is the feature prediction model in the image processing system, the initial image generation model after parameter adjustment is the image generation model in the image processing system, and the initial audio generation model after parameter adjustment is the audio generation model in the image processing system.

可选地,初始特征预测模型包括运动位移场预测模型、初始运动特征提取模型、初始图像特征提取模型和调整模型。Optionally, the initial feature prediction model includes a motion displacement field prediction model, an initial motion feature extraction model, an initial image feature extraction model and an adjustment model.

运动位移场预测模型用于,对训练样本进行处理,以得到训练运动位移场,训练运动位移场表示样本图像中的多个训练像素在标签图像对应的第二训练时刻相对样本图像对应的第一训练时刻的位移。标签图像对应的第二训练时刻、样本图像对应的第一训练时刻可以分别理解为标签图像和样本图像在训练视频中的时刻。The motion displacement field prediction model is used to process the training samples to obtain the training motion displacement field, which represents the displacement of multiple training pixels in the sample image at the second training moment corresponding to the label image relative to the first training moment corresponding to the sample image. The second training moment corresponding to the label image and the first training moment corresponding to the sample image can be understood as the moments of the label image and the sample image in the training video, respectively.

初始运动特征提取模型用于,对运动位移场进行特征提取,以得到训练运动特征。The initial motion feature extraction model is used to extract features from the motion displacement field to obtain training motion features.

初始图像特征提取模型用于,对样本图像进行特征提取,以得到训练图像特征。The initial image feature extraction model is used to extract features from sample images to obtain training image features.

调整模型用于,根据训练运动特征,对训练图像特征进行调整,以得到训练预测特征。The adjustment model is used to adjust the training image features according to the training motion features to obtain the training prediction features.

参数调整后的初始运动特征提取模型为图像处理系统中的运动特征提取模型,参数调整后的初始图像特征提取模型为图像处理系统中的图像特征提取模型。The initial motion feature extraction model after parameter adjustment is the motion feature extraction model in the image processing system, and the initial image feature extraction model after parameter adjustment is the image feature extraction model in the image processing system.

运动位移场预测模型可以是预训练得到的神经网络模型。The motion displacement field prediction model may be a pre-trained neural network model.

或者,初始特征预测模型可以包括初始运动特征提取模型、初始图像特征提取模型和调整模型。Alternatively, the initial feature prediction model may include an initial motion feature extraction model, an initial image feature extraction model, and an adjustment model.

通过特征追踪、光流提取或粒子视频的方式对训练样本进行处理,可以得到确定训练运动位移场。通过特征追踪、光流提取或粒子视频的方式处理得到的训练运动位移场,也可以称为光流场。The training motion displacement field can be determined by processing the training samples by feature tracking, optical flow extraction or particle video. The training motion displacement field obtained by feature tracking, optical flow extraction or particle video processing can also be called an optical flow field.

可选地,运动位移场预测模型可以包括第一变换模块、运动纹理预测模型和第二变换模块。Optionally, the motion displacement field prediction model may include a first transformation module, a motion texture prediction model and a second transformation module.

第一变换模块用于对样本图像进行傅里叶变换,以得到训练图像频域数据。The first transformation module is used to perform Fourier transformation on the sample image to obtain frequency domain data of the training image.

运动纹理预测模型用于对训练图像频域数据进行处理,以生成训练运动纹理。The motion texture prediction model is used to process the frequency domain data of the training image to generate the training motion texture.

第二变换模块用于对训练运动纹理进行逆傅里叶变换,以得到的训练运动位移场。The second transformation module is used to perform inverse Fourier transformation on the training motion texture to obtain a training motion displacement field.

训练运动纹理可以理解为训练运动位移场的频域表示。通过逆傅里叶变换,可以将频域数据转换为时域数据。Training motion texture can be understood as training the frequency domain representation of motion displacement field. Through inverse Fourier transform, frequency domain data can be converted into time domain data.

运动位移场预测模型中,运动位移场预测模型可以是预训练得到的。In the motion displacement field prediction model, the motion displacement field prediction model may be obtained through pre-training.

可选地,运动纹理预测模型可以包括压缩模型和运动纹理生成模型。也就是说,运动纹理预测模型可以是LDM模型。Optionally, the motion texture prediction model may include a compression model and a motion texture generation model. That is, the motion texture prediction model may be an LDM model.

压缩模型用于对训练图像频域数据进行压缩,以得到训练图像压缩数据。The compression model is used to compress the training image frequency domain data to obtain training image compressed data.

运动纹理生成模型用于对训练图像压缩数据进行处理,以生成训练运动纹理。The motion texture generation model is used to process the training image compression data to generate training motion texture.

在训练样本包括训练主体区域信息的情况下,运动纹理生成模型用于对训练图像压缩数据和训练主体区域信息进行处理,以生成训练运动纹理。In the case where the training sample includes training subject region information, the motion texture generation model is used to process the training image compression data and the training subject region information to generate training motion texture.

在训练样本包括训练主体区域信息和训练运动信息的情况下,运动纹理生成模型用于对训练图像压缩数据、训练主体区域信息和训练运动信息进行处理,以生成训练运动纹理。In the case where the training sample includes training subject region information and training motion information, the motion texture generation model is used to process the training image compression data, the training subject region information and the training motion information to generate training motion texture.

运动位移场预测模型的预训练,可以包括对压缩模型的第一预训练和对运动纹理生成模型的第二预训练。The pre-training of the motion displacement field prediction model may include a first pre-training of the compression model and a second pre-training of the motion texture generation model.

第一预训练可以利用第一预训练数据训练得到压缩模型。在第一预训练过程中,可以利用初始压缩模型对第一预训练数据进行压缩处理,以得到预训练压缩数据,并可以利用初始解压缩模型对预训练压缩数据进行解压缩,以得到预训练解压缩数据。预训练压缩数据的数据量小于第一预训练数据的数据量。在第一预训练过程中,根据预训练解压缩数据和第一预训练数据之间的差异,可以调整初始压缩模型和初始解压缩模型的参数,参数调整后的初始压缩模型可以是运动纹理预测模型中的压缩模型。The first pre-training can be performed using the first pre-training data to obtain a compression model. During the first pre-training process, the first pre-training data can be compressed using the initial compression model to obtain pre-training compressed data, and the pre-training compressed data can be decompressed using the initial decompression model to obtain pre-training decompressed data. The amount of data of the pre-training compressed data is less than the amount of data of the first pre-training data. During the first pre-training process, the parameters of the initial compression model and the initial decompression model can be adjusted according to the difference between the pre-training decompressed data and the first pre-training data, and the initial compression model after the parameter adjustment can be the compression model in the motion texture prediction model.

其中,第一预训练数据可以是图像,第一预训练数据也可以是对图像进行傅里叶变换得到的频域数据。The first pre-training data may be an image, or may be frequency domain data obtained by performing Fourier transform on the image.

应当理解,第一预训练过程中,第一预训练数据的数量可以是一个或多个。It should be understood that in the first pre-training process, the number of the first pre-training data may be one or more.

第二预训练的预训练数据可以包括预训练图像频域数据和标签运动纹理。预训练图像频域数据可以是对预训练视频中的预训练图像进行傅里叶变换得到的。标签运动纹理可以是对预训练位移场进行傅里叶变换得到的,也就是说,标签运动纹理可以理解为预训练位移场的频域表示。预训练位移场表示预训练视频中预训练图像中多个像素在预训练视频中预训练图像之后的至少一个图像中的位移。预训练位移场可以是利用粒子点视频的方式对预训练视频进行处理得到的。The pre-trained data of the second pre-training may include pre-trained image frequency domain data and label motion texture. The pre-trained image frequency domain data may be obtained by Fourier transforming the pre-trained image in the pre-trained video. The label motion texture may be obtained by Fourier transforming the pre-trained displacement field, that is, the label motion texture may be understood as a frequency domain representation of the pre-trained displacement field. The pre-trained displacement field represents the displacement of multiple pixels in the pre-trained image in the pre-trained video in at least one image after the pre-trained image in the pre-trained video. The pre-trained displacement field may be obtained by processing the pre-trained video in the manner of a particle point video.

应当理解,第二与训练过程中使用的预训练数据的数量可以是一个或多个。It should be understood that the number of pre-training data used in the second training process can be one or more.

第二预训练过程中,可以利用压缩模型对预训练图像频域数据进行压缩以得到预训练图像压缩数据,利用初始运动纹理生成模型对预训练图像压缩数据进行处理,以得到预训练运动纹理。根据预训练运动纹理和标签运动纹理之间的差异,调整初始运动纹理生成模型的参数,参数调整后的初始运动纹理生成模型可以是运动纹理预测模型中的运动纹理生成模型。In the second pre-training process, the pre-training image frequency domain data can be compressed using the compression model to obtain pre-training image compressed data, and the pre-training image compressed data can be processed using the initial motion texture generation model to obtain the pre-training motion texture. According to the difference between the pre-training motion texture and the label motion texture, the parameters of the initial motion texture generation model are adjusted, and the initial motion texture generation model after parameter adjustment can be the motion texture generation model in the motion texture prediction model.

训练数据可以包括或不包括训练主体区域信息。在对运动纹理生成模型的第二预训练的过程中,考虑主体区域,或者主体区域和运动趋势的影响。The training data may or may not include training subject region information. In the second pre-training process of the motion texture generation model, the influence of the subject region, or the subject region and the motion trend is considered.

第二预训练的预训练数据可以包括预训练主体区域信息。预训练主体区域信息表示预训练图像中预训练主体所在的预训练主体区域。在标签运动纹理表示的预训练位移场中,预训练主体区域之外的像素的位移为0。The pre-training data of the second pre-training may include pre-training subject area information. The pre-training subject area information indicates a pre-training subject area where the pre-training subject is located in the pre-training image. In the pre-training displacement field represented by the label motion texture, the displacement of pixels outside the pre-training subject area is 0.

第二预训练的预训练数据也可以包括预训练主体区域信息和预训练运动信息。预训练运动信息表示预训练主体的预训练运动趋势。在标签运动纹理中,预训练主体区域中的像素的位置表示预训练主体的运动符合预训练运动趋势。The pre-trained data of the second pre-training may also include pre-trained subject area information and pre-trained motion information. The pre-trained motion information indicates the pre-trained motion trend of the pre-trained subject. In the label motion texture, the position of the pixel in the pre-trained subject area indicates that the motion of the pre-trained subject conforms to the pre-trained motion trend.

可选地,运动纹理生成模型可以包括扩散模型和解压缩模型。解压缩模型可以是经过第一预训练调整后的初始解压缩模型。Optionally, the motion texture generation model may include a diffusion model and a decompression model. The decompression model may be an initial decompression model adjusted through the first pre-training.

运动纹理生成模型用于根据训练图像压缩数据进行处理,对压缩后的频域噪声数据进行多次去噪处理,以得到训练压缩运动纹理。The motion texture generation model is used to process the compressed data of the training image and perform multiple denoising processes on the compressed frequency domain noise data to obtain the training compressed motion texture.

压缩后的频域噪声数据可以是压缩模型对频域噪声数据进行压缩得到的,也可以是预设的。The compressed frequency domain noise data may be obtained by compressing the frequency domain noise data with a compression model, or may be preset.

解压缩模型可以用于对训练压缩运动纹理进行解压缩,以得到训练运动纹理。The decompression model can be used to decompress the training compressed motion texture to obtain the training motion texture.

初始运动纹理生成模型可以包括解压缩模型和初始扩散模型。在对运动纹理生成模型的第二预训练的过程中,还可以利用压缩模型对频域噪声数据进行压缩,以得到压缩后的频域噪声数据。初始运动纹理生成模型可以用于根据压缩后的频域噪声数据,对压缩后的频域噪声数据进行多次去噪处理,以得到预训练压缩运动纹理。解压缩模型可以用于对预训练压缩运动纹理进行解压缩,以得到预训练运动纹理。The initial motion texture generation model may include a decompression model and an initial diffusion model. In the second pre-training process of the motion texture generation model, the frequency domain noise data may be compressed using the compression model to obtain compressed frequency domain noise data. The initial motion texture generation model may be used to perform multiple denoising processes on the compressed frequency domain noise data according to the compressed frequency domain noise data to obtain a pre-trained compressed motion texture. The decompression model may be used to decompress the pre-trained compressed motion texture to obtain the pre-trained motion texture.

通过本申请实施例提供的方法训练得到的图像处理系统可以应用在图3所示的图像处理方法中。The image processing system trained by the method provided in the embodiment of the present application can be applied to the image processing method shown in FIG. 3 .

上文结合图3至图9,详细描述了本申请实施例的图像处理方法,下面将结合图10和图11,详细描述本申请的装置实施例。应理解,本申请实施例中的图像处理装置可以执行前述本申请实施例的各种图像处理方法,即以下各种产品的具体工作过程,可以参考前述方法实施例中的对应过程。The above describes in detail the image processing method of the embodiment of the present application in conjunction with Figures 3 to 9, and the device embodiment of the present application will be described in detail below in conjunction with Figures 10 and 11. It should be understood that the image processing device in the embodiment of the present application can perform the various image processing methods of the aforementioned embodiments of the present application, that is, the specific working processes of the following various products can refer to the corresponding processes in the aforementioned method embodiments.

图10是本申请实施例提供的一种系统架构的示意性结构图。FIG. 10 is a schematic structural diagram of a system architecture provided in an embodiment of the present application.

如所述系统架构1100所示,数据采集设备1160用于采集训练数据,本申请实施例中训练数据包括:训练样本和标签图像、标签音频。训练样本可以包括训练视频中的样本图像。标签图像可以是训练视频中样本图像之后的图像,标签音频可以是显示训练视频中标签图像时训练视频中的音频。As shown in the system architecture 1100, the data acquisition device 1160 is used to collect training data. In the embodiment of the present application, the training data includes: training samples, label images, and label audio. The training samples may include sample images in the training video. The label image may be an image after the sample image in the training video, and the label audio may be the audio in the training video when the label image in the training video is displayed.

数据采集设备1160还可以用于采集第一预训练数据和第二预训练数据。The data collection device 1160 may also be used to collect first pre-training data and second pre-training data.

本申请实施例中第一预训练数据是图像,也可以是对图像进行傅里叶变换得到的频域数据。In the embodiment of the present application, the first pre-training data is an image, and may also be frequency domain data obtained by performing Fourier transform on the image.

本申请实施例中第二预训练数据可以包括预训练图像频域数据和标签运动纹理。第二预训练数据还可以包括预训练主体区域信息。第二预训练数据还可以包括预训练运动信息。In the embodiment of the present application, the second pre-training data may include pre-training image frequency domain data and label motion texture. The second pre-training data may also include pre-training subject area information. The second pre-training data may also include pre-training motion information.

将训练数据存入数据库1130,训练设备1120基于数据库1130中维护的训练数据训练得到目标模型/规则1101。The training data is stored in the database 1130 , and the training device 1120 obtains the target model/rule 1101 through training based on the training data maintained in the database 1130 .

在本申请提供的实施例中,该图像处理系统是通过训练得到的。训练设备1120如何基于训练数据得到目标模型/规则1101的详细介绍可以参见图9的说明。也就是说,训练设备1120可以用于执行图9所示的图像处理系统的训练方法。In the embodiment provided in the present application, the image processing system is obtained through training. A detailed description of how the training device 1120 obtains the target model/rule 1101 based on the training data can be found in the description of FIG9 . That is, the training device 1120 can be used to execute the training method of the image processing system shown in FIG9 .

目标模型/规则1101可以是图4所示的图像处理系统400。该目标模型/规则1101能够用于实现本申请实施例提供图3所示的图像处理方法,即将待处理图像输入目标模型/规则1101,可以得到目标视频。本申请实施例中的目标模型/规则1101具体可以为图像处理系统。The target model/rule 1101 may be the image processing system 400 shown in FIG4. The target model/rule 1101 may be used to implement the image processing method shown in FIG3 provided in the embodiment of the present application, that is, the image to be processed is input into the target model/rule 1101, and the target video may be obtained. The target model/rule 1101 in the embodiment of the present application may specifically be an image processing system.

需要说明的是,在实际的应用中,所述数据库1130中维护的训练数据不一定都来自于数据采集设备1160的采集,也有可能是从其他设备接收得到的。另外需要说明的是,训练设备1120也不一定完全基于数据库1130维护的训练数据进行目标模型/规则1101的训练,也有可能从云端或其他地方获取训练数据进行模型训练,上述描述不应该作为对本申请实施例的限定。It should be noted that, in actual applications, the training data maintained in the database 1130 may not all come from the data acquisition device 1160, but may also be received from other devices. It should also be noted that the training device 1120 may not train the target model/rule 1101 entirely based on the training data maintained by the database 1130, but may also obtain training data from the cloud or other places for model training. The above description should not be used as a limitation on the embodiments of the present application.

根据训练设备1120训练得到的目标模型/规则1101可以应用于不同的系统或设备中,如应用于图10所示的执行设备1110,所述执行设备1110可以是终端,如手机终端,平板电脑,笔记本电脑,AR/VR,车载终端等,还可以是服务器或者云端等。在附图10中,执行设备1110配置有I/O接口1112,用于与外部设备进行数据交互,用户可以通过客户设备1140向I/O接口1112输入数据,所述输入数据在本申请实施例中可以包括待处理图像,还可以包括待处理图像中用户指示的目标主体、用户指示的目标主体的目标运动趋势等。The target model/rule 1101 obtained by training with the training device 1120 can be applied to different systems or devices, such as the execution device 1110 shown in FIG. 10 , which can be a terminal such as a mobile phone terminal, a tablet computer, a laptop computer, AR/VR, a vehicle terminal, etc., or a server or a cloud, etc. In FIG. 10 , the execution device 1110 is configured with an I/O interface 1112 for data interaction with an external device, and the user can input data to the I/O interface 1112 through the client device 1140. The input data can include the image to be processed in the embodiment of the present application, and can also include the target subject indicated by the user in the image to be processed, the target motion trend of the target subject indicated by the user, etc.

预处理模块1113和预处理模块1114用于根据I/O接口1112接收到的输入数据(如待处理图像)进行预处理。The preprocessing module 1113 and the preprocessing module 1114 are used to perform preprocessing according to the input data (such as the image to be processed) received by the I/O interface 1112 .

在本申请实施例中,也可以没有预处理模块1113和预处理模块1114(也可以只有其中的一个预处理模块),而直接采用计算模块1111对输入数据进行处理。In the embodiment of the present application, the preprocessing module 1113 and the preprocessing module 1114 may be omitted (or only one of the preprocessing modules may be provided), and the calculation module 1111 may be directly used to process the input data.

在执行设备1110对输入数据进行预处理,或者在执行设备1110的计算模块1111执行计算等相关的处理过程中,执行设备1110可以调用数据存储系统1150中的数据、代码等以用于相应的处理,也可以将相应处理得到的数据、指令等存入数据存储系统1150中。When the execution device 1110 preprocesses the input data, or when the computing module 1111 of the execution device 1110 performs calculations and other related processing, the execution device 1110 can call the data, code, etc. in the data storage system 1150 for corresponding processing, and can also store the data, instructions, etc. obtained from the corresponding processing into the data storage system 1150.

最后,I/O接口1112将处理结果,如上述得到的去模糊原始图像文件返回给客户设备1140,从而提供给用户。Finally, the I/O interface 1112 returns the processing result, such as the deblurred original image file obtained above, to the client device 1140 for providing to the user.

值得说明的是,训练设备1120可以针对不同的目标或称不同的任务,基于不同的训练数据生成相应的目标模型/规则1101,该相应的目标模型/规则1101即可以用于实现上述目标或完成上述任务,从而为用户提供所需的结果。It is worth noting that the training device 1120 can generate corresponding target models/rules 1101 based on different training data for different goals or different tasks. The corresponding target models/rules 1101 can be used to achieve the above goals or complete the above tasks, thereby providing users with the desired results.

在附图10中所示情况下,用户可以手动给定输入数据,该手动给定可以通过I/O接口1112提供的界面进行操作。另一种情况下,客户设备1140可以自动地向I/O接口1112发送输入数据,如果要求客户设备1140自动发送输入数据需要获得用户的授权,则用户可以在客户设备1140中设置相应权限。用户可以在客户设备1140查看执行设备1110输出的结果,具体的呈现形式可以是显示、声音、动作等具体方式。客户设备1140也可以作为数据采集端,采集如图所示输入I/O接口1112的输入数据及输出I/O接口1112的输出结果作为新的样本数据,并存入数据库1130。当然,也可以不经过客户设备1140进行采集,而是由I/O接口1112直接将如图所示输入I/O接口1112的输入数据及输出I/O接口1112的输出结果,作为新的样本数据存入数据库1130。In the case shown in FIG. 10 , the user can manually give input data, and the manual giving can be operated through the interface provided by the I/O interface 1112. In another case, the client device 1140 can automatically send input data to the I/O interface 1112. If the client device 1140 is required to automatically send input data and needs to obtain the user's authorization, the user can set the corresponding authority in the client device 1140. The user can view the results output by the execution device 1110 on the client device 1140, and the specific presentation form can be a specific method such as display, sound, action, etc. The client device 1140 can also be used as a data acquisition terminal to collect the input data of the input I/O interface 1112 and the output results of the output I/O interface 1112 as shown in the figure as new sample data, and store them in the database 1130. Of course, it is also possible not to collect through the client device 1140, but the I/O interface 1112 directly stores the input data of the input I/O interface 1112 and the output results of the output I/O interface 1112 as new sample data in the database 1130.

图11至图14示出了本申请实施例提供的图像生成方法的应用场景。11 to 14 illustrate application scenarios of the image generation method provided in the embodiments of the present application.

图11中的(a)示出了电子设备的一种图形用户界面(graphical user interface,GUI),该电子设备可以客户设备1140,该GUI为电子设备的桌面1010。在检测到用户点击桌面1010上的相册应用(application,APP)的图标1011的相册启动操作的情况下,电子设备可以启动相册应用,显示如图11中的(b)所示的另一GUI。图11中的(b)所示的GUI可以称为第一相册界面1020。相册界面1020可以包括多个缩略图。FIG. 11 (a) shows a graphical user interface (GUI) of an electronic device, which may be a client device 1140, and the GUI is a desktop 1010 of the electronic device. When a user clicks on an album application (APP) icon 1011 on the desktop 1010 to start an album operation, the electronic device may start the album application and display another GUI as shown in FIG. 11 (b). The GUI shown in FIG. 11 (b) may be referred to as a first album interface 1020. The album interface 1020 may include a plurality of thumbnails.

在检测到用户点击相册界面1020上的任一个缩略图的情况下,电子设备可以显示图12中的(a)所示的GUI,该GUI可以称为第二相册界面1030。第二相册界面1030可以包括视频生成图标1031和用户点击的缩略图对应的待处理图像1032。在检测到用户点击视频生成图标1031的情况下,可以显示如图12中的(b)所示的视频编辑界面1040。视频编辑界面1040包括待处理图像1032,以及包括文字信息“请选取目标,并指示目标的运动”。When it is detected that the user clicks on any thumbnail on the album interface 1020, the electronic device may display the GUI shown in (a) of FIG. 12, which may be referred to as the second album interface 1030. The second album interface 1030 may include a video generation icon 1031 and an image to be processed 1032 corresponding to the thumbnail clicked by the user. When it is detected that the user clicks on the video generation icon 1031, a video editing interface 1040 as shown in (b) of FIG. 12 may be displayed. The video editing interface 1040 includes the image to be processed 1032, and includes the text message "Please select a target and indicate the movement of the target".

用户手指可以轻触目标主体并在屏幕上进行滑动。用户轻触的位置所在的主体为目标主体。The user can touch the target subject with his finger and slide it on the screen. The subject where the user touches is the target subject.

用户手指滑动的方向可以是目标主体的起始运动方向。或者,用户手指停止滑动的位置可以是目标主体运动幅度最大时的位置。又或者,用户手指滑动的速度为目标主体的起始速度或最大速度。目标主体的目标运动趋势可以包括目标主体的起始运动方向、运动幅度最大时的位置、起始速度或最大速度等中的一个或多个。The direction in which the user's finger slides may be the target subject's starting direction of motion. Alternatively, the position where the user's finger stops sliding may be the target subject's position at the maximum motion amplitude. Alternatively, the speed at which the user's finger slides may be the target subject's starting speed or maximum speed. The target motion trend of the target subject may include one or more of the target subject's starting direction of motion, the position at the maximum motion amplitude, the starting speed or the maximum speed, etc.

在检测到用户手指的滑动结束的情况下,电子设备可以执行图3所示的方法,生成目标视频。在目标视频生成之后,电子设备可以显示图13所示的视频界面1050。视频界面1050包括目标视频的播放图标1051。在检测到用户点击播放图标1051的情况下,电子设备可以播放目标视频。When the sliding of the user's finger is detected to end, the electronic device can execute the method shown in FIG3 to generate the target video. After the target video is generated, the electronic device can display the video interface 1050 shown in FIG13. The video interface 1050 includes a play icon 1051 of the target video. When the user clicks the play icon 1051, the electronic device can play the target video.

在另一些实施例中,用户也可以通过语音或写入的方式,输入目标主体的目标运动趋势。In other embodiments, the user may also input the target movement trend of the target subject by voice or writing.

图14示出了电子设备的一种图形用户界面(graphical user interface,GUI),该GUI为电子设备的锁屏界面1410。锁屏界面1410显示的壁纸图像可以理解为待处理图像。Fig. 14 shows a graphical user interface (GUI) of an electronic device, where the GUI is a lock screen interface 1410 of the electronic device. The wallpaper image displayed on the lock screen interface 1410 can be understood as an image to be processed.

在检测到用户点击待处理图像中的任一个主体的情况下,电子设备可以将用户点击的主体作为目标主体,执行图3所示的方法,生成目标视频,并显示目标视频。When it is detected that the user clicks on any subject in the image to be processed, the electronic device may take the subject clicked by the user as the target subject, execute the method shown in FIG. 3 , generate a target video, and display the target video.

或者,在检测到用户在显示屏上的滑动结束的情况下,电子设备可以将滑动起始位置所在的主体作为目标主体,根据该滑动操作动作表示的目标运动趋势,生成目标视频,并显示目标视频。Alternatively, when detecting that the user's sliding on the display screen ends, the electronic device can use the subject at the starting position of the sliding as the target subject, generate a target video according to the target motion trend represented by the sliding operation, and display the target video.

从而,通过本申请提供的图像处理方法,能够生成个性化的动态壁纸的应用,并且能够生成与动态壁纸中还的各个图像对应的音频同时播放。Therefore, through the image processing method provided by the present application, a personalized dynamic wallpaper application can be generated, and audio corresponding to each image in the dynamic wallpaper can be generated and played simultaneously.

动态壁纸中进行运动的目标主体可以是用户指示的,目标主体的运动可以是根据用户的指示进行的。The target subject moving in the dynamic wallpaper may be instructed by the user, and the movement of the target subject may be performed according to the instruction of the user.

也就是说,用户不需要太多的专业知识,只需要从图库中选择一张精美图片作为壁纸,点击图片中自己感兴趣的运动主体,并适当拖动该主体,就可以看到期望的动态有声壁纸。In other words, users do not need too much professional knowledge. They only need to select a beautiful picture from the gallery as the wallpaper, click on the moving subject they are interested in in the picture, and drag the subject appropriately to see the desired dynamic sound wallpaper.

值得注意的是,附图10仅是本发明实施例提供的一种系统架构的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制,例如,在附图10中,数据存储系统1150相对执行设备1110是外部存储器,在其它情况下,也可以将数据存储系统1150置于执行设备1110中。It is worth noting that FIG10 is only a schematic diagram of a system architecture provided by an embodiment of the present invention, and the positional relationship between the devices, components, modules, etc. shown in the figure does not constitute any limitation. For example, in FIG10, the data storage system 1150 is an external memory relative to the execution device 1110. In other cases, the data storage system 1150 can also be placed in the execution device 1110.

如图10所示,根据训练设备1120训练得到目标模型/规则1101,该目标模型/规则1101在本申请实施例中可以是图像处理系统。具体的,本申请实施例提供的图像处理系统可以包括特征预测模型、图像生成模型和音频生成模型。特征预测模型包括运动位移场预测模型、运动特征提取模型、图像特征提取模型和调整模型。图像生成模型和音频生成模型,以及特征预测模型中的运动位移场预测模型、运动特征提取模型、图像特征提取模型,可以均为卷积神经网络。As shown in Figure 10, the target model/rule 1101 is obtained by training according to the training device 1120. The target model/rule 1101 can be an image processing system in the embodiment of the present application. Specifically, the image processing system provided in the embodiment of the present application may include a feature prediction model, an image generation model, and an audio generation model. The feature prediction model includes a motion displacement field prediction model, a motion feature extraction model, an image feature extraction model, and an adjustment model. The image generation model and the audio generation model, as well as the motion displacement field prediction model, the motion feature extraction model, and the image feature extraction model in the feature prediction model, can all be convolutional neural networks.

如前文的基础概念介绍所述,卷积神经网络是一种带有卷积结构的深度神经网络,是一种深度学习(deep learning)架构,深度学习架构是指通过机器学习的算法,在不同的抽象层级上进行多个层次的学习。作为一种深度学习架构,CNN是一种前馈(feed-forward)人工神经网络,该前馈人工神经网络中的各个神经元可以对输入其中的数据作出响应。As mentioned in the previous basic concept introduction, convolutional neural network is a deep neural network with convolution structure, which is a deep learning architecture. Deep learning architecture refers to multiple levels of learning at different abstract levels through machine learning algorithms. As a deep learning architecture, CNN is a feed-forward artificial neural network, in which each neuron can respond to the data input into it.

运动位移场预测模型可以是预训练得到的。训练设备1120基于数据库1130中维护的第一预训练数据和第二预训练数据,可以训练得到运动位移场预测模型。The motion displacement field prediction model may be obtained by pre-training. The training device 1120 may train the motion displacement field prediction model based on the first pre-training data and the second pre-training data maintained in the database 1130 .

图15是本申请实施例提供的图像处理装置的示意图。FIG. 15 is a schematic diagram of an image processing device provided in an embodiment of the present application.

图像处理装置1500包括获取单元1510和处理单元1520。The image processing device 1500 includes an acquisition unit 1510 and a processing unit 1520 .

在一些实施例中,图像处理装置1500可以执行图3所示的图像处理方法。图像处理装置1500可以是执行设备1110。In some embodiments, the image processing apparatus 1500 may execute the image processing method shown in FIG3 . The image processing apparatus 1500 may be the execution device 1110 .

获取单元1510用于,获取待处理图像。The acquisition unit 1510 is used to acquire the image to be processed.

处理单元1520用于,根据所述待处理图像,生成目标视频中的N个目标图像,以及所述目标视频中的N个目标音频,所述N个目标图像与所述N个目标音频一一对应,N为大于1的整数,每个目标音频用于在所述目标音频对应的目标图像被显示的情况下播放。The processing unit 1520 is used to generate N target images in the target video and N target audios in the target video according to the image to be processed, wherein the N target images correspond one-to-one to the N target audios, N is an integer greater than 1, and each target audio is used to be played when the target image corresponding to the target audio is displayed.

可选地,处理单元1520用于,利用图像处理系统依次对多个第一图像进行处理,以得到每个第一图像对应的至少一个第二图像和每个第二图像对应的所述目标音频,所述多个目标图像包括每个第一图像对应的所述至少一个第二图像,每个第一图像对应的所述至少一个第二图像均为所述目标视频中所述第一图像之后的图像,所述图像处理系统处理的第一个所述第一图像为所述待处理图像,所述图像处理系统处理的第i个所述第一图像为所述图像处理系统处理的第i-1个所述第一图像对应的至少一个第二图像中的图像,i为大于1的正整数,所述图像处理系统包括训练得到的神经网络模型。Optionally, the processing unit 1520 is used to use an image processing system to process multiple first images in sequence to obtain at least one second image corresponding to each first image and the target audio corresponding to each second image, the multiple target images include the at least one second image corresponding to each first image, the at least one second image corresponding to each first image is an image after the first image in the target video, the first first image processed by the image processing system is the image to be processed, the i-th first image processed by the image processing system is an image in the at least one second image corresponding to the i-1-th first image processed by the image processing system, i is a positive integer greater than 1, and the image processing system includes a trained neural network model.

可选地,获取单元1510还用于,获取所述待处理图像中用户指示的目标主体。Optionally, the acquisition unit 1510 is further configured to acquire a target subject indicated by a user in the image to be processed.

在所述图像处理系统处理的所述第一图像为所述待处理图像的情况下,所述图像处理系统用于对所述第一图像和目标主体区域信息进行处理,以得到所述第一图像对应的至少一个第二图像,以及每个第二图像对应的所述目标音频,所述目标主体区域信息表示目标主体区域,在所述第一图像中的所述目标主体区域记录有所述目标主体,每个第二图像中所述目标主体区域之外的其他区域记录的内容与所述第一图像中所述其他区域记录的内容相同。In the case where the first image processed by the image processing system is the image to be processed, the image processing system is used to process the first image and target subject area information to obtain at least one second image corresponding to the first image, and the target audio corresponding to each second image, the target subject area information represents a target subject area, the target subject is recorded in the target subject area in the first image, and the contents recorded in other areas outside the target subject area in each second image are the same as the contents recorded in other areas in the first image.

可选地,在所述图像处理系统处理的所述第一图像为所述待处理图像的情况下,所述图像处理系统用于对所述第一图像和目标主体区域信息进行处理,以得到所述第一图像对应的至少一个第二图像,以及每个第二图像对应的所述目标音频。Optionally, when the first image processed by the image processing system is the image to be processed, the image processing system is used to process the first image and the target subject area information to obtain at least one second image corresponding to the first image, and the target audio corresponding to each second image.

可选地,获取单元1510还用于,获取所述用户指示的所述目标主体的目标运动趋势。Optionally, the acquisition unit 1510 is further configured to acquire a target motion trend of the target subject indicated by the user.

在所述图像处理系统处理的所述第一图像为所述待处理图像的情况下,所述图像处理系统用于对所述第一图像、目标主体区域信息和运动信息进行处理,以得到所述至少一个第二时刻中每个第二时刻对应的所述目标图像,以及每个第二时刻对应的所述目标音频,在所述至少一个第二时刻对应的至少一个所述目标图像中所述目标主体是按照所述运动信息表示的所述目标运动趋势进行运动的。In the case where the first image processed by the image processing system is the image to be processed, the image processing system is used to process the first image, target subject area information and motion information to obtain the target image corresponding to each second moment in the at least one second moment, and the target audio corresponding to each second moment, wherein the target subject in at least one target image corresponding to the at least one second moment moves according to the target motion trend represented by the motion information.

可选地,所述图像处理系统包括特征预测模型、图像生成模型和音频生成模型。Optionally, the image processing system includes a feature prediction model, an image generation model and an audio generation model.

所述特征预测模型用于,对第一图像进行处理,以得到至少一个预测特征,不同的预测特征对应的第二时刻不同,每个第二时刻均是所述目标视频中所述第一图像对应的第一时刻之后的时刻。The feature prediction model is used to process the first image to obtain at least one prediction feature, different prediction features correspond to different second moments, and each second moment is a moment after the first moment corresponding to the first image in the target video.

所述图像生成模型用于,分别对所述至少一个所述预测特征进行处理,以得到每个第二时刻对应的第二图像。The image generation model is used to process the at least one prediction feature respectively to obtain a second image corresponding to each second moment.

所述音频生成模型用于,分别对所述至少一个所述预测特征进行处理,以得到每个第二时刻对应的目标音频。The audio generation model is used to process the at least one predicted feature respectively to obtain the target audio corresponding to each second moment.

可选地,所述特征预测模型包括运动位移场预测模型、运动特征提取模型、图像特征提取模型和调整模型。Optionally, the feature prediction model includes a motion displacement field prediction model, a motion feature extraction model, an image feature extraction model and an adjustment model.

所述运动位移场预测模型用于,对所述第一图像进行处理,以得到每个第二时刻对应的运动位移场,每个第二时刻对应的所述运动位移场表示所述第一图像中的多个像素在所述第二时刻相对所述第一时刻的位移。The motion displacement field prediction model is used to process the first image to obtain a motion displacement field corresponding to each second moment, and the motion displacement field corresponding to each second moment represents the displacement of multiple pixels in the first image at the second moment relative to the first moment.

所述运动特征提取模型用于,分别对所述至少一个运动位移场进行特征提取,以得到每个第二时刻对应的运动特征。The motion feature extraction model is used to perform feature extraction on the at least one motion displacement field respectively to obtain the motion feature corresponding to each second moment.

所述图像特征提取模型用于,对所述第一图像进行特征提取,以得到图像特征。The image feature extraction model is used to extract features from the first image to obtain image features.

所述调整模型用于,根据每个第二时刻对应的所述运动特征,对所述图像特征进行调整,以得到所述第二时刻对应的所述预测特征。The adjustment model is used to adjust the image feature according to the motion feature corresponding to each second moment, so as to obtain the prediction feature corresponding to the second moment.

可选地,获取单元1510还用于,获取所述待处理图像中用户指示的目标主体。Optionally, the acquisition unit 1510 is further configured to acquire a target subject indicated by a user in the image to be processed.

所述运动位移场预测模型用于对所述第一图像和目标主体区域信息进行处理,以得到每个第二时刻对应的所述运动位移场,所述目标主体区域信息表示目标主体区域,在所述待处理图像中的所述目标主体区域记录有所述目标主体。The motion displacement field prediction model is used to process the first image and the target subject area information to obtain the motion displacement field corresponding to each second moment. The target subject area information represents the target subject area, and the target subject is recorded in the target subject area in the image to be processed.

每个第二时刻对应的所述运动位移场表示区域外像素的位移为0,所述区域外像素在所述第一图像中位于所述目标主体区域之外。The motion displacement field corresponding to each second moment indicates that the displacement of pixels outside the region is 0, and the pixels outside the region are located outside the target body region in the first image.

可选地,获取单元1510还用于,获取用户指示的所述目标主体的目标运动趋势。Optionally, the acquisition unit 1510 is further configured to acquire a target motion trend of the target subject indicated by a user.

在所述图像处理系统处理的所述第一图像为所述待处理图像的情况下,所述运动位移场预测模型用于对所述第一图像、所述目标主体区域信息和运动信息进行处理,以得到每个第二时刻对应的所述运动位移场,每个第二时刻对应的所述运动位移场表示的目标主体像素在所述第二时刻相对所述第一时刻的目标位移符合所述运动信息表示的所述目标运动趋势,所述目标主体像素为所述第一图像中位于所述目标主体上的像素。In the case where the first image processed by the image processing system is the image to be processed, the motion displacement field prediction model is used to process the first image, the target subject area information and the motion information to obtain the motion displacement field corresponding to each second moment, and the target displacement of the target subject pixel represented by the motion displacement field corresponding to each second moment at the second moment relative to the target at the first moment conforms to the target motion trend represented by the motion information, and the target subject pixel is a pixel located on the target subject in the first image.

在另一些实施例中,图像处理装置1500可以执行图9所示的图像处理系统的训练方法。图像处理装置1500可以是训练设备1120。In some other embodiments, the image processing apparatus 1500 may execute the training method of the image processing system shown in FIG9 . The image processing apparatus 1500 may be the training device 1120 .

获取单元1510用于,获取训练样本和标签图像、标签音频。The acquisition unit 1510 is used to acquire training samples, label images, and label audio.

处理单元1520用于,利用初始图像处理系统对训练样本进行处理,以得到训练图像和训练音频。The processing unit 1520 is used to process the training samples using the initial image processing system to obtain training images and training audio.

处理单元1520还用于,根据训练图像和标签图像的第一差异,以及训练音频和标签音频的第二差异,调整初始图像处理系统的参数,调整后的初始图像处理系统为训练得到的图像处理系统。The processing unit 1520 is also used to adjust the parameters of the initial image processing system according to the first difference between the training image and the label image, and the second difference between the training audio and the label audio, and the adjusted initial image processing system is the trained image processing system.

需要说明的是,上述图像处理装置1500以功能单元的形式体现。这里的术语“单元”可以通过软件和/或硬件形式实现,对此不作具体限定。It should be noted that the above-mentioned image processing device 1500 is embodied in the form of a functional unit. The term "unit" here can be implemented in the form of software and/or hardware, and is not specifically limited to this.

例如,“单元”可以是实现上述功能的软件程序、硬件电路或二者结合。所述硬件电路可能包括应用特有集成电路(application specific integrated circuit,ASIC)、电子电路、用于执行一个或多个软件或固件程序的处理器(例如共享处理器、专有处理器或组处理器等)和存储器、合并逻辑电路和/或其它支持所描述的功能的合适组件。For example, a "unit" may be a software program, a hardware circuit, or a combination of the two that implements the above functions. The hardware circuit may include an application specific integrated circuit (ASIC), an electronic circuit, a processor (such as a shared processor, a dedicated processor, or a group processor, etc.) and a memory for executing one or more software or firmware programs, a combined logic circuit, and/or other suitable components that support the described functions.

因此,在本申请的实施例中描述的各示例的单元,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。Therefore, the units of each example described in the embodiments of the present application can be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Professional and technical personnel can use different methods to implement the described functions for each specific application, but such implementation should not be considered to be beyond the scope of the present application.

本申请还提供一种芯片,芯片包括数据接口以及一个或多个处理器。当所述一个或多个处理器执行指令时,所述一个或多个处理器通过所述数据接口读取存储器上存储的指令,以实现上述方法实施例中描述的图像处理方法和/或图像处理系统的训练方法。The present application also provides a chip, the chip comprising a data interface and one or more processors. When the one or more processors execute instructions, the one or more processors read instructions stored in a memory through the data interface to implement the image processing method and/or the training method of an image processing system described in the above method embodiment.

该一个或多个处理器可以是通用处理器或者专用处理器。例如,该一个或多个处理器可以是中央处理器(central processing unit,CPU)、数字信号处理器(digitalsignal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field programmable gate array,FPGA)或者其它可编程逻辑器件,如分立门、晶体管逻辑器件或分立硬件组件。The one or more processors may be general-purpose processors or special-purpose processors. For example, the one or more processors may be a central processing unit (CPU), a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), or other programmable logic devices such as discrete gates, transistor logic devices, or discrete hardware components.

该芯片可以作为终端设备或其它电子设备的组成部分。例如,芯片可以位于电子设备100中。The chip may be used as a component of a terminal device or other electronic devices. For example, the chip may be located in the electronic device 100 .

处理器和存储器可以单独设置,也可以集成在一起。例如,处理器和存储器可以集成在终端设备的系统级芯片(system on chip,SOC)上。也就是说,该芯片还可以包括存储器。The processor and the memory may be provided separately or integrated together. For example, the processor and the memory may be integrated on a system on chip (SOC) of a terminal device. That is, the chip may also include a memory.

存储器上可以存有程序,程序可被处理器运行,生成指令,使得处理器根据指令执行上述方法实施例中描述的图像处理方法和/或图像处理系统的训练方法。The memory may store a program, which may be executed by the processor to generate instructions, so that the processor executes the image processing method and/or the training method of the image processing system described in the above method embodiment according to the instructions.

可选地,存储器中还可以存储有数据。可选地,处理器还可以读取存储器中存储的数据,该数据可以与程序存储在相同的存储地址,该数据也可以与程序存储在不同的存储地址。Optionally, data may be stored in the memory. Optionally, the processor may read data stored in the memory, and the data may be stored at the same storage address as the program, or may be stored at a different storage address than the program.

示例性地,存储器可以用于存储本申请实施例中提供的图像处理方法的相关程序,处理器可以用于调用存储器中存储的图像处理方法的相关程序,以实现本申请实施例的图像处理方法。Exemplarily, the memory can be used to store relevant programs of the image processing method provided in the embodiments of the present application, and the processor can be used to call relevant programs of the image processing method stored in the memory to implement the image processing method in the embodiments of the present application.

例如,处理器可以用于获取待处理图像;根据所述待处理图像,生成目标视频中的N个目标图像,以及所述目标视频中的N个目标音频,所述N个目标图像与所述N个目标音频一一对应,N为大于1的整数,每个目标音频用于在所述目标音频对应的目标图像被显示的情况下播放。For example, the processor can be used to obtain an image to be processed; based on the image to be processed, generate N target images in a target video and N target audios in the target video, the N target images correspond one-to-one to the N target audios, N is an integer greater than 1, and each target audio is used to play when the target image corresponding to the target audio is displayed.

示例性地,存储器可以用于存储本申请实施例中提供的图像处理系统的训练方法的相关程序,处理器可以用于调用存储器中存储的图像处理系统的训练方法的相关程序,以实现本申请实施例的图像处理系统的训练方法。Exemplarily, the memory can be used to store relevant programs of the training method of the image processing system provided in the embodiments of the present application, and the processor can be used to call relevant programs of the training method of the image processing system stored in the memory to implement the training method of the image processing system in the embodiments of the present application.

例如,处理器可以用于获取训练样本和标签图像、标签音频;利用初始图像处理系统对训练样本进行处理,以得到训练图像和训练音频;根据训练图像和标签图像的第一差异,以及训练音频和标签音频的第二差异,调整初始图像处理系统的参数,调整后的初始图像处理系统为训练得到的图像处理系统。For example, the processor can be used to obtain training samples and labeled images and labeled audio; use the initial image processing system to process the training samples to obtain training images and training audio; adjust the parameters of the initial image processing system based on the first difference between the training image and the labeled image, and the second difference between the training audio and the labeled audio, and the adjusted initial image processing system is the trained image processing system.

该芯片可以设置在电子设备中。The chip can be arranged in an electronic device.

本申请还提供了一种计算机程序产品,该计算机程序产品被处理器执行时实现本申请中任一方法实施例所述的图像处理方法和/或图像处理系统的训练方法。The present application also provides a computer program product, which, when executed by a processor, implements the image processing method and/or the training method of an image processing system described in any method embodiment of the present application.

该计算机程序产品可以存储在存储器中,例如是程序,程序经过预处理、编译、汇编和链接等处理过程最终被转换为能够被处理器执行的可执行目标文件。The computer program product may be stored in a memory, for example, a program, which is finally converted into an executable target file that can be executed by a processor after preprocessing, compiling, assembling and linking.

本申请还提供了一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被计算机执行时实现本申请中任一方法实施例所述的图像处理方法和/或图像处理系统的训练方法。该计算机程序可以是高级语言程序,也可以是可执行目标程序。The present application also provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a computer, the image processing method and/or the training method of the image processing system described in any method embodiment of the present application is implemented. The computer program can be a high-level language program or an executable target program.

该计算机可读存储介质例如是存储器。存储器可以是易失性存储器或非易失性存储器,或者,存储器可以同时包括易失性存储器和非易失性存储器。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(randomaccess memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(dynamicRAM,DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data rate SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。The computer-readable storage medium is, for example, a memory. The memory may be a volatile memory or a nonvolatile memory, or the memory may include both a volatile memory and a nonvolatile memory. Among them, the nonvolatile memory may be a read-only memory (ROM), a programmable read-only memory (PROM), an erasable programmable read-only memory (EPROM), an electrically erasable programmable read-only memory (EEPROM), or a flash memory. The volatile memory may be a random access memory (RAM), which is used as an external cache. By way of example and not limitation, many forms of RAM are available, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), enhanced SDRAM (ESDRAM), synchronous link DRAM (SLDRAM), and direct rambus RAM (DR RAM).

本申请实施例中可能会涉及到对用户数据的使用,在实际应用中,可以在符合所在国的适用法律法规要求的情况下(例如,用户明确同意,对用户切实通知,等),在适用法律法规允许的范围内在本文描述的方案中使用用户特定的个人数据。The embodiments of this application may involve the use of user data. In actual applications, user-specific personal data may be used in the scheme described herein within the scope permitted by applicable laws and regulations, subject to the requirements of applicable laws and regulations of the country where the user is located (for example, with the user's explicit consent, effective notification to the user, etc.).

在本申请的描述中,术语“第一”、“第二”等仅用于描述目的,而不能理解为指示或暗示相对重要性,以及特定的顺序或先后次序。对于本领域的普通技术人员而言,可以具体情况理解上述术语在本申请中的具体含义。In the description of this application, the terms "first", "second", etc. are used only for descriptive purposes and should not be understood as indicating or implying relative importance, or a specific order or sequence. For those of ordinary skill in the art, the specific meanings of the above terms in this application can be understood according to specific circumstances.

本申请中,“至少一个”是指一个或者多个,“多个”是指两个或两个以上。“以下至少一项(个)”或其类似表达,是指的这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a, b, c中的至少一项(个),可以表示:a, b, c, a-b, a-c, b-c, 或a-b-c,其中a,b,c可以是单个,也可以是多个。In this application, "at least one" means one or more, and "more" means two or more. "At least one of the following" or similar expressions refers to any combination of these items, including any combination of single or plural items. For example, at least one of a, b, and c can mean: a, b, c, a-b, a-c, b-c, or a-b-c, where a, b, and c can be single or multiple.

应理解,在本申请的各种实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that in the various embodiments of the present application, the size of the serial numbers of the above-mentioned processes does not mean the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present application.

本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。Those of ordinary skill in the art will appreciate that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Professional and technical personnel can use different methods to implement the described functions for each specific application, but such implementation should not be considered to be beyond the scope of this application.

所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working processes of the systems, devices and units described above can refer to the corresponding processes in the aforementioned method embodiments and will not be repeated here.

在本申请所提供的几个实施例中,应该理解到所揭露的装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的;例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式;例如,多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in the present application, it should be understood that the disclosed devices and methods can be implemented in other ways. For example, the device embodiments described above are merely schematic; for example, the division of the units is only a logical function division, and there may be other division methods in actual implementation; for example, multiple units or components can be combined or integrated into another system, or some features can be ignored or not executed. Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be through some interfaces, indirect coupling or communication connection of devices or units, which can be electrical, mechanical or other forms.

所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.

以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。The above is only a specific implementation of the present application, but the protection scope of the present application is not limited thereto. Any person skilled in the art who is familiar with the present technical field can easily think of changes or substitutions within the technical scope disclosed in the present application, which should be included in the protection scope of the present application. Therefore, the protection scope of the present application should be based on the protection scope of the claims.

Claims (12)

1. An image processing method, the method comprising:
Acquiring an image to be processed;
Generating N target images in a target video and N target audios in the target video according to the image to be processed, wherein the N target images are in one-to-one correspondence with the N target audios, N is an integer greater than 1, and each target audio is used for playing under the condition that the target image corresponding to the target audio is displayed.
2. The method of claim 1, wherein generating N target images in a target video and N target audio in the target video from the image to be processed comprises:
And processing the plurality of first images sequentially by using an image processing system to obtain at least one second image corresponding to each first image and the target audio corresponding to each second image, wherein the N target images comprise at least one second image corresponding to each first image, the at least one second image corresponding to each first image is an image after the first image in the target video, the first image processed by the image processing system is the image to be processed, the ith first image processed by the image processing system is an image in at least one second image corresponding to the ith-1 first image processed by the image processing system, i is a positive integer greater than 1, and the image processing system comprises a neural network model obtained through training.
3. The method according to claim 2, wherein the method further comprises: acquiring a target main body indicated by a user in the image to be processed;
And when the first image processed by the image processing system is the image to be processed, the image processing system is used for processing the first image and target main body area information to obtain at least one second image corresponding to the first image and the target audio corresponding to each second image, the target main body area information represents a target main body area, the target main body is recorded in the target main body area in the first image, and the content recorded in other areas except the target main body area in each second image is the same as the content recorded in the other areas in the first image.
4. A method according to claim 3, wherein, in the case where the first image processed by the image processing system is the image to be processed, the image processing system is configured to process the first image and target subject area information to obtain at least one second image corresponding to the first image, and the target audio corresponding to each second image.
5. The method according to claim 3 or 4, characterized in that the method further comprises: acquiring a target movement trend of the target main body indicated by the user;
And under the condition that the first image processed by the image processing system is the image to be processed, the image processing system is used for processing the first image, target main body area information and motion information to obtain the target image corresponding to each second moment in the at least one second moment and the target audio corresponding to each second moment, and the target main body in at least one target image corresponding to the at least one second moment moves according to the target motion trend represented by the motion information.
6. The method of claim 2, wherein the image processing system comprises a feature prediction model, an image generation model, and an audio generation model;
The feature prediction model is used for processing the first image to obtain at least one prediction feature, and second moments corresponding to different prediction features are different, wherein each second moment is a moment after a first moment corresponding to the first image in the target video;
The image generation model is used for respectively processing the at least one prediction feature to obtain a second image corresponding to each second moment;
the audio generation model is used for respectively processing the at least one prediction feature to obtain target audio corresponding to each second moment.
7. The method of claim 6, wherein the feature prediction model comprises a motion displacement field prediction model, a motion feature extraction model, an image feature extraction model, and an adjustment model;
The motion displacement field prediction model is used for processing the first image to obtain a motion displacement field corresponding to each second moment, and the motion displacement field corresponding to each second moment represents the displacement of a plurality of pixels in the first image relative to the first moment at the second moment;
The motion feature extraction model is used for extracting features of the at least one motion displacement field respectively to obtain motion features corresponding to each second moment;
The image feature extraction model is used for extracting features of the first image to obtain image features;
The adjustment model is used for adjusting the image characteristics according to the motion characteristics corresponding to each second moment so as to obtain the prediction characteristics corresponding to the second moment.
8. The method of claim 7, wherein the method further comprises: acquiring a target main body indicated by a user in the image to be processed;
The motion displacement field prediction model is used for processing the first image and target main body area information to obtain the motion displacement field corresponding to each second moment, the target main body area information represents a target main body area, and the target main body is recorded in the target main body area in the image to be processed;
And the motion displacement field corresponding to each second moment represents that the displacement of the pixel outside the area is 0, and the pixel outside the area is positioned outside the target main body area in the first image.
9. The method of claim 8, wherein the method further comprises: acquiring a target movement trend of the target main body indicated by a user;
And under the condition that the first image processed by the image processing system is the image to be processed, the motion displacement field prediction model is used for processing the first image, the target main body area information and the motion information to obtain the motion displacement field corresponding to each second moment, the target motion trend represented by the motion displacement field corresponding to each second moment is met by the target main body pixel corresponding to each second moment, and the target main body pixel is the pixel positioned on the target main body in the first image.
10. An electronic device, the electronic device comprising: one or more processors, and memory;
The memory being coupled to the one or more processors, the memory being for storing computer program code comprising computer instructions that the one or more processors invoke to cause the electronic device to perform the method of any of claims 1-9.
11. A chip system for application to an electronic device, the chip system comprising one or more processors to invoke computer instructions to cause the electronic device to perform the method of any of claims 1 to 9.
12. A computer readable storage medium comprising instructions that, when run on an electronic device, cause the electronic device to perform the method of any one of claims 1 to 9.
CN202410339775.5A 2024-03-25 2024-03-25 Image processing method and electronic equipment Active CN118101856B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410339775.5A CN118101856B (en) 2024-03-25 2024-03-25 Image processing method and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410339775.5A CN118101856B (en) 2024-03-25 2024-03-25 Image processing method and electronic equipment

Publications (2)

Publication Number Publication Date
CN118101856A true CN118101856A (en) 2024-05-28
CN118101856B CN118101856B (en) 2025-01-17

Family

ID=91154833

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410339775.5A Active CN118101856B (en) 2024-03-25 2024-03-25 Image processing method and electronic equipment

Country Status (1)

Country Link
CN (1) CN118101856B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118651557A (en) * 2024-08-19 2024-09-17 苏州市伏泰信息科技股份有限公司 A garbage bin classification supervision method

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160148057A1 (en) * 2014-11-26 2016-05-26 Hanwha Techwin Co., Ltd. Camera system and operating method of the same
CN115061770A (en) * 2022-08-10 2022-09-16 荣耀终端有限公司 Method and electronic device for displaying dynamic wallpaper
CN115359156A (en) * 2022-07-31 2022-11-18 荣耀终端有限公司 Audio playing method, device, equipment and storage medium
CN116071248A (en) * 2021-11-02 2023-05-05 华为技术有限公司 Image processing method and related equipment
CN116781992A (en) * 2023-06-27 2023-09-19 北京爱奇艺科技有限公司 Video generation method, device, electronic equipment and storage medium
CN117177025A (en) * 2023-08-14 2023-12-05 科大讯飞股份有限公司 Video generation method, device, equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160148057A1 (en) * 2014-11-26 2016-05-26 Hanwha Techwin Co., Ltd. Camera system and operating method of the same
CN116071248A (en) * 2021-11-02 2023-05-05 华为技术有限公司 Image processing method and related equipment
CN115359156A (en) * 2022-07-31 2022-11-18 荣耀终端有限公司 Audio playing method, device, equipment and storage medium
CN115061770A (en) * 2022-08-10 2022-09-16 荣耀终端有限公司 Method and electronic device for displaying dynamic wallpaper
CN116781992A (en) * 2023-06-27 2023-09-19 北京爱奇艺科技有限公司 Video generation method, device, electronic equipment and storage medium
CN117177025A (en) * 2023-08-14 2023-12-05 科大讯飞股份有限公司 Video generation method, device, equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118651557A (en) * 2024-08-19 2024-09-17 苏州市伏泰信息科技股份有限公司 A garbage bin classification supervision method

Also Published As

Publication number Publication date
CN118101856B (en) 2025-01-17

Similar Documents

Publication Publication Date Title
Anantrasirichai et al. Artificial intelligence in the creative industries: a review
Seow et al. A comprehensive overview of Deepfake: Generation, detection, datasets, and opportunities
Hussain et al. A real time face emotion classification and recognition using deep learning model
Liu et al. Hard negative generation for identity-disentangled facial expression recognition
Alqahtani et al. Applications of generative adversarial networks (gans): An updated review
CN111667399B (en) Training method of style migration model, video style migration method and device
Sun et al. Lattice long short-term memory for human action recognition
US9111375B2 (en) Evaluation of three-dimensional scenes using two-dimensional representations
Toshpulatov et al. Generative adversarial networks and their application to 3D face generation: A survey
CN112800891B (en) Discriminative feature learning method and system for micro-expression recognition
US12125256B2 (en) Generator exploitation for deepfake detection
CN107451552A (en) A kind of gesture identification method based on 3D CNN and convolution LSTM
Xu et al. Designing one unified framework for high-fidelity face reenactment and swapping
US20240126810A1 (en) Using interpolation to generate a video from static images
CN118101856B (en) Image processing method and electronic equipment
CN113542759A (en) Generating antagonistic neural network assisted video reconstruction
CN113542758A (en) Generating antagonistic neural network assisted video compression and broadcast
Li et al. End-to-end training for compound expression recognition
US20240404018A1 (en) Image processing method and apparatus, device, storage medium and program product
CN117576248B (en) Image generation method and device based on gesture guidance
CN113408694A (en) Weight demodulation for generative neural networks
He Exploring style transfer algorithms in Animation: Enhancing visual
CN115631285B (en) Face rendering method, device, equipment and storage medium based on unified driving
CN111611852A (en) Method, device and equipment for training expression recognition model
KR102683330B1 (en) Face expression recognition method and apparatus using graph convolution neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant