WO2024012289A1 - Video generation method and apparatus, electronic device and medium - Google Patents

Video generation method and apparatus, electronic device and medium Download PDF

Info

Publication number
WO2024012289A1
WO2024012289A1 PCT/CN2023/105161 CN2023105161W WO2024012289A1 WO 2024012289 A1 WO2024012289 A1 WO 2024012289A1 CN 2023105161 W CN2023105161 W CN 2023105161W WO 2024012289 A1 WO2024012289 A1 WO 2024012289A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature information
image
video
image feature
classification
Prior art date
Application number
PCT/CN2023/105161
Other languages
French (fr)
Chinese (zh)
Inventor
李宇
Original Assignee
维沃移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 维沃移动通信有限公司 filed Critical 维沃移动通信有限公司
Publication of WO2024012289A1 publication Critical patent/WO2024012289A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • G06T11/001Texturing; Colouring; Generation of texture or colour
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • G06V10/765Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects using rules for classification or partitioning the feature space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Abstract

A video generation method and apparatus, an electronic device and a medium, relating to the technical field of artificial intelligence. The video generation method comprises: obtaining a first image set, inputting the first image set into a multi-classification model for classification, and outputting M classification results corresponding to the first image set; determining a target video template from at least one video template corresponding to the M classification results; and generating a target video on the basis of the first image set and the target video template, wherein M is an integer greater than 1.

Description

视频生成方法、装置、电子设备及介质Video generation method, device, electronic equipment and media
相关申请的交叉引用Cross-references to related applications
本申请主张在2022年07月14日在中国提交的申请号为202210834501.4的中国专利的优先权,其全部内容通过引用包含于此。This application claims priority to the Chinese patent with application number 202210834501.4 filed in China on July 14, 2022, the entire content of which is incorporated herein by reference.
技术领域Technical field
本申请属于人工智能技术领域,具体涉及一种视频生成方法、装置、电子设备及介质。This application belongs to the field of artificial intelligence technology, and specifically relates to a video generation method, device, electronic equipment and media.
背景技术Background technique
随着网络带宽的大量普及,互联网视频的发展势头也逐渐火热,而互联网视频是否受用户欢迎,最重要的影响因素就是视频自身的质量。With the widespread popularity of network bandwidth, the development momentum of Internet videos has gradually become more and more popular. The most important factor affecting whether Internet videos are popular among users is the quality of the video itself.
在相关技术中,制作视频所运用的视频分类网络,通常是将一段视频分成多个单帧图像后,分别对每帧图像进行分类,然后统计每帧图像对应的分类结果,然后,基于最终的统计结果来生成用户所需要的视频。In related technologies, the video classification network used to produce videos usually divides a video into multiple single-frame images, classifies each frame of image respectively, and then counts the classification results corresponding to each frame of image, and then, based on the final Statistical results are used to generate videos that users need.
然而,由于上述方案所运用的视频分类网络,需要依次对每帧图像分别进行分类处理,从而导致视频分类网络的延迟较高,分类所需要的时间较长,进而导致视频生成效率过低。However, due to the video classification network used in the above scheme, each frame of image needs to be classified separately in turn, which results in a high delay in the video classification network and a long time required for classification, which in turn results in low video generation efficiency.
发明内容Contents of the invention
本申请实施例的目的是提供一种视频生成方法、装置、电子设备及介质,能够解决视频生成效率低的问题。The purpose of the embodiments of the present application is to provide a video generation method, device, electronic device, and medium that can solve the problem of low video generation efficiency.
第一方面,本申请实施例提供了一种视频生成方法,该方法包括:获取第一图像集合,将该第一图像集合输入多分类模型进行分类,输出该第一图像集合对应的M个分类结果;从该M个分类结果对应的至少一个视频模板中,确定目标视频模板;基于上述第一图像集合与该目标视频模板,生成目标视频;其中,M为大于1的整数。In a first aspect, embodiments of the present application provide a video generation method. The method includes: acquiring a first image set, inputting the first image set into a multi-classification model for classification, and outputting M categories corresponding to the first image set. Result: Determine a target video template from at least one video template corresponding to the M classification results; generate a target video based on the above-mentioned first image set and the target video template; where M is an integer greater than 1.
第二方面,本申请实施例提供了一种视频生成装置,该装置包括:获取单元、分类单元、确定单元和生成单元;其中,获取单元,用于获取第一图像集合;分类单元,用于将获取单元获取到的上述第一图像集合输入多分类模型进行分类,输出上述第一图像集合对应的M个分类结果;确定单元,用于从分类单元得到的上述M个分类结果对应的至少一个视频模板中,确定目标视频模板;生成单元,用于基于获取单元获取到的上述第一图像集合与确定单元确定的上述目标视频模板,生成目标视频;其中,M为大于1的整数。In a second aspect, embodiments of the present application provide a video generation device, which includes: an acquisition unit, a classification unit, a determination unit and a generation unit; wherein the acquisition unit is used to acquire the first image collection; the classification unit is used to The above-mentioned first image set acquired by the acquisition unit is input into the multi-classification model for classification, and M classification results corresponding to the above-mentioned first image set are output; the determination unit is used for at least one corresponding to the above-mentioned M classification results obtained from the classification unit In the video template, the target video template is determined; the generating unit is configured to generate the target video based on the first image set obtained by the acquisition unit and the target video template determined by the determining unit; where M is an integer greater than 1.
第三方面,本申请实施例提供了一种电子设备,该电子设备包括处理器和存储器,所述存储器存储可在所述处理器上运行的程序或指令,所述程序或指令被所述处理器 执行时实现如第一方面所述的方法的步骤。In a third aspect, embodiments of the present application provide an electronic device. The electronic device includes a processor and a memory. The memory stores programs or instructions that can be run on the processor. The programs or instructions are processed by the processor. device When executed, the steps of the method as described in the first aspect are implemented.
第四方面,本申请实施例提供了一种可读存储介质,所述可读存储介质上存储程序或指令,所述程序或指令被处理器执行时实现如第一方面所述的方法的步骤。In a fourth aspect, embodiments of the present application provide a readable storage medium. Programs or instructions are stored on the readable storage medium. When the programs or instructions are executed by a processor, the steps of the method described in the first aspect are implemented. .
第五方面,本申请实施例提供了一种芯片,所述芯片包括处理器和通信接口,所述通信接口和所述处理器耦合,所述处理器用于运行程序或指令,实现如第一方面所述的方法。In a fifth aspect, embodiments of the present application provide a chip. The chip includes a processor and a communication interface. The communication interface is coupled to the processor. The processor is used to run programs or instructions to implement the first aspect. the method described.
第六方面,本申请实施例提供一种计算机程序产品,该程序产品被存储在存储介质中,该程序产品被至少一个处理器执行以实现如第一方面所述的方法。In a sixth aspect, embodiments of the present application provide a computer program product, the program product is stored in a storage medium, and the program product is executed by at least one processor to implement the method as described in the first aspect.
在本申请实施例中,在制作视频时,电子设备可以先获取第一图像集合,然后将该第一图像集合输入多分类模型进行分类,以输出该第一图像集合对应的M个分类结果;再从该M个分类结果对应的至少一个视频模板中,确定目标视频模板;最后,基于上述第一图像集合与该目标视频模板,生成目标视频;其中,M为大于1的整数。如此,由于本申请在对图像进行分类时,是对整个第一图像集合整体进行分类处理,使得上述多分类模型只进行一次前向处理,就可以得到该第一图像集合整体的M个分类结果,因此,提高了多分类模型的分类能力,从而提高了整体的视频生成效率。In the embodiment of the present application, when making a video, the electronic device can first obtain a first image set, and then input the first image set into a multi-classification model for classification to output M classification results corresponding to the first image set; Then, determine the target video template from at least one video template corresponding to the M classification results; finally, generate the target video based on the above-mentioned first image set and the target video template; where M is an integer greater than 1. In this way, since this application performs classification processing on the entire first image collection when classifying images, the above-mentioned multi-classification model only performs one forward processing to obtain M classification results of the entire first image collection. , Therefore, the classification ability of the multi-classification model is improved, thereby improving the overall video generation efficiency.
附图说明Description of drawings
图1是本申请实施例提供的一种视频生成方法的流程示意图之一;Figure 1 is one of the schematic flow diagrams of a video generation method provided by an embodiment of the present application;
图2是本申请实施例提供的一种多分类模型的处理流程图之一;Figure 2 is one of the processing flow charts of a multi-classification model provided by the embodiment of the present application;
图3是本申请实施例提供的一种多分类模型的处理流程图之二;Figure 3 is the second processing flow chart of a multi-classification model provided by the embodiment of the present application;
图4是本申请实施例提供的一种Token的下采样模块的示意图;Figure 4 is a schematic diagram of a Token downsampling module provided by an embodiment of the present application;
图5是本申请实施例提供的一种视频生成方法的流程示意图之二;Figure 5 is a second schematic flowchart of a video generation method provided by an embodiment of the present application;
图6是本申请实施例提供的一种视频生成方法的流程示意图之三;Figure 6 is a third schematic flowchart of a video generation method provided by an embodiment of the present application;
图7是本申请实施例提供的一种视频生成装置的结构示意图;Figure 7 is a schematic structural diagram of a video generation device provided by an embodiment of the present application;
图8是本申请实施例提供的一种电子设备的结构示意图;Figure 8 is a schematic structural diagram of an electronic device provided by an embodiment of the present application;
图9是本申请实施例提供的一种电子设备的硬件示意图。Figure 9 is a hardware schematic diagram of an electronic device provided by an embodiment of the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be clearly described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art fall within the scope of protection of this application.
本申请的说明书和权利要求书中的术语“第一”、“第二”等是用于区别类似的对象,而不用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施,且“第一”、“第二”等所区分的对象通常为一类,并不限定对象的个数,例如第一对象可以是一个,也可以是多个。此外,说明书以及权利要求中“和/或”表示所连接 对象的至少其中之一,字符“/”,一般表示前后关联对象是一种“或”的关系。The terms "first", "second", etc. in the description and claims of this application are used to distinguish similar objects and are not used to describe a specific order or sequence. It is to be understood that the figures so used are interchangeable under appropriate circumstances so that the embodiments of the present application can be practiced in orders other than those illustrated or described herein, and that "first,""second," etc. are distinguished Objects are usually of one type, and the number of objects is not limited. For example, the first object can be one or multiple. In addition, “and/or” in the description and claims indicates the connected At least one of the objects, the character "/", generally indicates that the related objects are in an "or" relationship.
下面结合附图,通过具体的实施例及其应用场景对本申请实施例提供的视频生成方法、装置、电子设备及介质进行详细地说明。The video generation method, device, electronic device, and medium provided by the embodiments of the present application will be described in detail through specific embodiments and application scenarios in conjunction with the accompanying drawings.
在相关技术中,当用户需要制作视频时,所运用的移动端视频分类网络主要还是通过对视频中的每帧图像都进行分类,得到每帧图像的分类结果,然后基于该分类结果,获得一个综合的视频分类结果。而这种分类方案需要进行多次前向处理,会导致该方案在运用到嵌入式平台后的处理时间较长;此外,在同一移动设备上运行该方案时,逐帧处理会导致该移动设备的运行压力较大,最终导致生成视频的质量较低。In related technologies, when a user needs to make a video, the mobile video classification network used mainly classifies each frame of image in the video to obtain the classification result of each frame of image, and then obtains a classification result based on the classification result. Comprehensive video classification results. This classification scheme requires multiple forward processing, which will cause the scheme to take a long time to process after it is applied to the embedded platform; in addition, when running the scheme on the same mobile device, frame-by-frame processing will cause the mobile device to The running pressure is greater, which ultimately leads to lower quality of the generated video.
在本申请实施例提供的视频生成方法、装置、电子设备及介质中,通过提供一种全新的视频分类模型,可以在用户需要制作视频时,可以先对将用户输入的图像或视频帧作为一个整体一次性输入视频分类模型,使得该视频分类模型只需进行一次前向过程,降低了该视频分类模型的延迟,提高了视频分类模型的分类效率。如此,可以在减少计算代价的同时提高了该视频分类模型的分类能力,并且可以通过该视频分类模型给出的分类结果,结合推荐算法和视频模板为用户一键生成视频,从而提高了视频生成效率。In the video generation method, device, electronic equipment and media provided by the embodiments of the present application, by providing a brand-new video classification model, when the user needs to create a video, the image or video frame input by the user can be first used as a The entire video classification model is input at one time, so that the video classification model only needs to perform a forward process, which reduces the delay of the video classification model and improves the classification efficiency of the video classification model. In this way, the classification ability of the video classification model can be improved while reducing the computational cost, and the classification results given by the video classification model, combined with the recommendation algorithm and video templates, can be used to generate videos for users with one click, thereby improving video generation. efficiency.
本实施例提供的视频生成方法的执行主体可以为视频生成装置,该视频生成装置可以为电子设备,也可以为该电子设备中的控制模块或处理模块等。一下电子设备为例来对本申请实施例提供的技术方案进行说明。The execution subject of the video generation method provided in this embodiment may be a video generation device, and the video generation device may be an electronic device, or may be a control module or processing module in the electronic device. An electronic device is taken as an example to illustrate the technical solutions provided by the embodiments of the present application.
本申请实施例提供一种视频生成方法,如图1所示,该视频生成方法可以包括如下步骤201至步骤204:An embodiment of the present application provides a video generation method, as shown in Figure 1. The video generation method may include the following steps 201 to 204:
步骤201:电子设备获取第一图像集合。Step 201: The electronic device obtains a first image set.
在本申请实施例中,上述第一图像集合包括N帧图像。其中,N为大于1的整数。In this embodiment of the present application, the first image set includes N frames of images. Among them, N is an integer greater than 1.
在一种可能的实施例中,上述第一图像集合中的N帧图像可以为N张图像。In a possible embodiment, the N frames of images in the first image set may be N images.
示例性地,电子设备在获取到用户输入的或电子设备预存的至少一张图像时,可以在该至少一张图像中的每张图像前后填充预定帧数的黑色图像,形成第一图像集合。For example, when the electronic device obtains at least one image input by the user or pre-stored by the electronic device, a predetermined number of black images can be filled before and after each image in the at least one image to form a first image set.
示例性地,步骤201中“电子设备获取第一图像集合”可以包括步骤201a:For example, "the electronic device acquires the first image collection" in step 201 may include step 201a:
步骤201a:电子设备获取用户输入的N张图像,以获取第一图像集合。Step 201a: The electronic device obtains N images input by the user to obtain the first image set.
示例性地,上述N张图像可以包括电子设备预先存储的图像和/或用户输入的图像。For example, the above-mentioned N images may include images pre-stored by the electronic device and/or images input by the user.
在另一种可能的实施例中,上述第一图像集合中的N帧图像可以为第一视频中的N帧视频帧。In another possible embodiment, the N frames of images in the first image set may be N frames of video frames in the first video.
需要说明的是,在相关技术中,由于相关技术中的视频分类模型在进行视频分类时,对时间顺序的建模能力偏弱,因此,对于具有较强时间顺序的视频来说,一帧帧处理会导致无法兼顾每帧图像之间的时间顺序,从而使得分类精度降低,无法满足用户需要。 It should be noted that in related technologies, since the video classification model in related technologies has weak modeling ability of temporal sequence when classifying videos, therefore, for videos with strong temporal sequence, one frame The processing will result in the inability to take into account the time sequence between each frame of image, thus reducing the classification accuracy and failing to meet user needs.
对此,电子设备可以在获取到第一视频中的N帧视频帧后,可以按照N帧视频帧中的每帧视频帧对应的时间顺序,对该N帧视频帧进行排序,从而生成第一图像集合。In this regard, after acquiring the N video frames in the first video, the electronic device can sort the N video frames according to the time sequence corresponding to each of the N video frames, thereby generating the first video frame. Image collection.
如此,当用户需要制作视频时,通过按照获取的N帧视频帧的时间顺序对该N帧视频帧进行排序,再将排序之后的所有视频帧作为一个整体一次性输入本申请提供的多分类模型,从而提高该多分类模型的分类准确率,进而提升了最终生成视频的质量。In this way, when the user needs to make a video, the N video frames are sorted according to the time order of the obtained N video frames, and then all the video frames after sorting are input into the multi-classification model provided by this application as a whole at once. , thereby improving the classification accuracy of the multi-classification model, thereby improving the quality of the final generated video.
示例性地,步骤201中“电子设备获取第一图像集合”可以包括步骤201b:For example, "the electronic device acquires the first image collection" in step 201 may include step 201b:
步骤201b:电子设备从第一视频中抽取N帧视频帧,以获取第一图像集合。Step 201b: The electronic device extracts N video frames from the first video to obtain a first image set.
示例性地,上述N帧视频帧可以为第一视频中的关键帧。进一步地,上述关键帧是指第一视频中存在关键信息的视频帧。例如,第一视频中能够表现物体运动或变化中的关键动作的那一帧图像,或者,其他可以起到决定性作用的视频帧。For example, the above-mentioned N video frames may be key frames in the first video. Further, the above-mentioned key frame refers to a video frame in which key information exists in the first video. For example, the frame of the first video that can show the key action in the movement or change of the object, or other video frames that can play a decisive role.
示例性地,电子设备在从第一视频中抽取N帧视频帧时,可以按照第一视频的时长,从第一视频中均匀抽取N帧视频帧。从而保证最终抽取的N帧视频帧能够体现第一视频中的多种视频特征。For example, when extracting N video frames from the first video, the electronic device may evenly extract N video frames from the first video according to the duration of the first video. This ensures that the finally extracted N video frames can reflect various video features in the first video.
步骤202:电子设备将第一图像集合输入多分类模型进行分类,输出第一图像集合对应的M个分类结果。Step 202: The electronic device inputs the first image set into the multi-classification model for classification, and outputs M classification results corresponding to the first image set.
在本申请实施例中,上述多分类模型可以为:多类视频分类模型(Multiclass Video-classification Model,MVM)。进一步地,上述MVM是指能够针对多帧图像进行综合分析的分类模型。In this embodiment of the present application, the above-mentioned multi-classification model may be: a multiclass video-classification model (Multiclass Video-classification Model, MVM). Furthermore, the above-mentioned MVM refers to a classification model that can perform comprehensive analysis on multiple frames of images.
在本申请实施例中,上述M个分类结果可以包括:该第一图像集合对应的分类类别,该第一图像集合对应分类类别的名称。示例性地,上述分类类别可以为:动作类别、场景类别、物体类别和情绪类别等。In this embodiment of the present application, the above M classification results may include: the classification category corresponding to the first image set, and the name of the classification category corresponding to the first image set. For example, the above classification categories may be: action category, scene category, object category, emotion category, etc.
步骤203:电子设备从M个分类结果对应的至少一个视频模板中,确定目标视频模板。Step 203: The electronic device determines the target video template from at least one video template corresponding to the M classification results.
其中,M为大于1的整数。Among them, M is an integer greater than 1.
在本申请实施例中,上述至少一个视频模板为电子设备的视频模板库中的一个或多个视频模板。其中,上述视频模板库中预存有多个视频模板,每个视频模板对应至少一种模板类别。示例性地,上述视频模板是指一段已经编辑好的、可重复使用的固定格式的视频。一般的,视频模板中可以包括:视频版式、视频配色、视频背景、视频配乐和视频字体等。In this embodiment of the present application, the at least one video template is one or more video templates in a video template library of the electronic device. Wherein, multiple video templates are pre-stored in the above video template library, and each video template corresponds to at least one template category. For example, the above video template refers to a fixed-format video that has been edited and can be reused. Generally, video templates can include: video layout, video color matching, video background, video soundtrack and video fonts, etc.
在本申请实施例中,一个分类结果可以对应一个或多个视频模板,不同的分类结果可以对应同一个视频模板,也可以对应不同的视频模板。In this embodiment of the present application, one classification result can correspond to one or more video templates, and different classification results can correspond to the same video template or different video templates.
步骤204:电子设备基于第一图像集合与目标视频模板,生成目标视频。Step 204: The electronic device generates a target video based on the first image set and the target video template.
在本申请实施例中,电子设备在确定出目标视频模板后,可以将该第一图像集合中的N帧图像与该目标视频模板进行融合,从而生成目标视频。In this embodiment of the present application, after determining the target video template, the electronic device can fuse the N frames of images in the first image set with the target video template to generate the target video.
可选地,在本申请实施例中,在上述第一图像集合中的N帧图像为第一视频中的 N帧视频帧的情况下,上述步骤204中“电子设备基于第一图像集合与目标视频模板,生成目标视频”可以包括步骤204b:Optionally, in this embodiment of the present application, the N frames of images in the above-mentioned first image set are images in the first video. In the case of N video frames, the "electronic device generates the target video based on the first image set and the target video template" in the above step 204 may include step 204b:
步骤204b:电子设备将第一视频与目标视频模板融合,生成目标视频。Step 204b: The electronic device fuses the first video with the target video template to generate the target video.
在本申请实施例中,电子设备在将第一视频与目标视频模板融合时,可以按照第一视频的时间轴与目标视频模板的时间轴,将两者的时间轴的起始时间点重叠,然后进行融合,从而生成目标视频。In the embodiment of the present application, when the electronic device fuses the first video with the target video template, the starting time points of the two timelines can be overlapped according to the timeline of the first video and the timeline of the target video template, Then fusion is performed to generate the target video.
需要说明的是,在目标视频模板的时间轴长度小于第一视频的时间轴长度时,可以在目标视频模板的时间轴到达结束时间点后,重复使用目标视频模板继续进行融合,直至第一视频全部融合完成。It should be noted that when the timeline length of the target video template is less than the timeline length of the first video, after the timeline of the target video template reaches the end time point, the target video template can be reused to continue fusion until the first video All integrated.
在本申请实施例提供的视频生成方法中,在制作视频时,电子设备可以先获取第一图像集合,然后将该第一图像集合输入多分类模型进行分类,以输出该第一图像集合对应的M个分类结果;再从该M个分类结果对应的至少一个视频模板中,确定目标视频模板;最后,基于上述第一图像集合与该目标视频模板,生成目标视频;其中,M为大于1的整数。如此,由于本申请在对图像进行分类时,是对整个第一图像集合整体进行分类处理,使得上述多分类模型只进行一次前向处理,就可以得到该第一图像集合整体的M个分类结果,因此,提高了多分类模型的分类能力,从而提高了整体的视频生成效率。In the video generation method provided by the embodiment of the present application, when producing a video, the electronic device can first obtain a first image set, and then input the first image set into a multi-classification model for classification to output the corresponding M classification results; then determine the target video template from at least one video template corresponding to the M classification results; finally, generate the target video based on the above-mentioned first image set and the target video template; where M is greater than 1 integer. In this way, since this application performs classification processing on the entire first image collection when classifying images, the above-mentioned multi-classification model only performs one forward processing to obtain M classification results of the entire first image collection. , Therefore, the classification ability of the multi-classification model is improved, thereby improving the overall video generation efficiency.
可选地,在本申请实施例中,上述步骤202中“电子设备将第一图像集合输入多分类模型进行分类,输出第一图像集合对应的M个分类结果”可以包括如下步骤A1至A4:Optionally, in this embodiment of the present application, in the above-mentioned step 202, "the electronic device inputs the first image set into the multi-classification model for classification, and outputs M classification results corresponding to the first image set" may include the following steps A1 to A4:
步骤A1:电子设备将第一图像集合输入多分类模型后,基于多分类模型将第一图像集合中的N帧图像转换为X个图像块的第一图像特征信息。Step A1: After the electronic device inputs the first image set into the multi-classification model, it converts N frames of images in the first image set into first image feature information of X image blocks based on the multi-classification model.
其中,X为大于1的整数。Among them, X is an integer greater than 1.
在本申请实施例中,上述的图像块的第一图像特征信息可以包括图像块的第一图像特征向量。例如,图像块对应的token。In this embodiment of the present application, the above-mentioned first image feature information of the image block may include a first image feature vector of the image block. For example, the token corresponding to the image block.
在本申请实施例中,电子设备可以将第一图像集合输入图像特征信息转化模块(如,Token化模块),输出N帧图像对应的X个图像块对应的token。In this embodiment of the present application, the electronic device can input the first image set into an image feature information conversion module (such as a tokenization module), and output tokens corresponding to X image blocks corresponding to N frames of images.
进一步可选地,在本申请实施例中,上述步骤A1中“基于多分类模型将第一图像集合中的N帧图像转换为X个图像块的第一图像特征信息”可以包括如下步骤A11和步骤A12:Further optionally, in the embodiment of the present application, the above-mentioned step A1 of "converting N frames of images in the first image set into first image feature information of X image blocks based on the multi-classification model" may include the following steps A11 and Step A12:
步骤A11:基于多分类模型中的图像特征信息转化模块,将第一图像集合中的N帧图像进行拆分,得到X个图像块。Step A11: Based on the image feature information conversion module in the multi-classification model, split the N frames of images in the first image set to obtain X image blocks.
在本申请实施例中,上述第一图像集合中的任一帧图像均可对应多个图像块。In this embodiment of the present application, any frame image in the first image set may correspond to multiple image blocks.
步骤A12:通过卷积神经网络对X个图像块进行特征信息提取,得到X个图像块的第一图像特征信息。 Step A12: Extract feature information from X image blocks through a convolutional neural network to obtain first image feature information of X image blocks.
在本申请实施例中,电子设备可以将第一图像集合中的每帧图像先切成一个个的图像块,然后通过卷积神经网络(Convolutional Neural Networks,CNN)单独提取每个图像块的图像特征,从而得到每个图像块的第一图像特征信息。In this embodiment of the present application, the electronic device can first cut each frame of the image in the first image set into individual image blocks, and then separately extract the image of each image block through a convolutional neural network (Convolutional Neural Networks, CNN). Features, thereby obtaining the first image feature information of each image block.
步骤A2:从X个图像块的第一图像特征信息中,确定出第一关键图像特征信息。Step A2: Determine the first key image feature information from the first image feature information of X image blocks.
在本申请实施例中,上述第一关键图像特征信息可以为:第一图像特征信息中像素特征满足预定条件的第一图像特征信息,也可以为第一图像特征信息中空间特征满足预定条件的第一图像特征信息。In the embodiment of the present application, the above-mentioned first key image feature information may be: first image feature information in which the pixel features in the first image feature information meet predetermined conditions, or may be first image feature information in which the spatial features in the first image feature information meet predetermined conditions. First image feature information.
进一步可选地,在本申请实施例中,上述步骤A2中“从X个图像块的第一图像特征信息中,确定出第一关键图像特征信息”可以包括如下步骤A21和步骤A22:Further optionally, in this embodiment of the present application, "determining the first key image feature information from the first image feature information of X image blocks" in the above step A2 may include the following steps A21 and A22:
步骤A21:电子设备基于多分类模型中的图像特征信息选择模块,从X个图像块的第一图像特征信息中,选择出第二关键图像特征信息,并将X个图像块的第一图像特征信息的排列方式进行变换,得到第二图像特征信息。Step A21: Based on the image feature information selection module in the multi-classification model, the electronic device selects the second key image feature information from the first image feature information of the The arrangement of the information is transformed to obtain the second image feature information.
在本申请实施例中,上述变换排列方式是指,将上述X个图像块的第一图像特征信息的排列位置进行调整。In the embodiment of the present application, the above-mentioned transformation arrangement method refers to adjusting the arrangement position of the first image feature information of the above-mentioned X image blocks.
需要说明的是,上述变换排列方式并不改变X个图像块的第一图像特征信息的具体内容信息。It should be noted that the above transformation arrangement does not change the specific content information of the first image feature information of the X image blocks.
步骤A22:将第二关键图像特征信息和第二图像特征信息进行融合,得到第一关键图像特征信息。Step A22: Fusion of the second key image feature information and the second image feature information to obtain the first key image feature information.
在本申请实施例中,上述图像特征信息选择模块可以为Token选择模块(如,TokenSelect模块)。In this embodiment of the present application, the above-mentioned image feature information selection module may be a Token selection module (eg, TokenSelect module).
在本申请实施例中,电子设备可以通过Token选择模块从X个图像块的图像特征信息中,选择最重要的几个关键图像特征信息,来减少图像块的图像特征信息的数量,从而减少多分类模型的计算量。In the embodiment of this application, the electronic device can select the most important key image feature information from the image feature information of X image blocks through the Token selection module to reduce the number of image feature information of the image block, thereby reducing the number of Computational amount of classification model.
在本申请实施例中,电子设备可以从X个图像块的图像特征信息中选择包含关键信息的图像块的图像特征信息。In this embodiment of the present application, the electronic device can select the image feature information of the image block containing key information from the image feature information of X image blocks.
步骤A3:提取至少一个关键图像特征信息对应的高层语义信息。Step A3: Extract high-level semantic information corresponding to at least one key image feature information.
在本申请实施例中,上述高层语义信息是指图像中的抽象特征信息,例如,图像中人物的表情,人物的年龄等。In the embodiment of this application, the above-mentioned high-level semantic information refers to the abstract feature information in the image, for example, the expression of the character in the image, the age of the character, etc.
进一步可选地,在本申请实施例中,上述步骤A3中“提取至少一个关键图像特征信息对应的高层语义信息”可以包括如下步骤A31至步骤A34:Further optionally, in this embodiment of the present application, "extracting high-level semantic information corresponding to at least one key image feature information" in the above-mentioned step A3 may include the following steps A31 to step A34:
步骤A31:电子设备基于多分类模型中的基础特征模块,对第一关键图像特征信息进行归一化操作,得到第三关键图像特征信息。Step A31: Based on the basic feature module in the multi-classification model, the electronic device performs a normalization operation on the first key image feature information to obtain the third key image feature information.
步骤A32:提取第三关键图像特征信息中的基础图像特征信息。Step A32: Extract basic image feature information from the third key image feature information.
步骤A33:将第一关键图像特征信息与基础图像特征信息融合,得到目标关键图像特征信息。 Step A33: Fusion of the first key image feature information and basic image feature information to obtain target key image feature information.
步骤A34:提取目标关键图像特征信息对应的高层语义信息。Step A34: Extract high-level semantic information corresponding to the target key image feature information.
在本申请实施例中,上述基础特征模块用于对Token选择模块确定出的第一关键图像特征信息进行特征提取,以获得该第一关键图像特征信息对应的高层语义信息。In this embodiment of the present application, the above-mentioned basic feature module is used to perform feature extraction on the first key image feature information determined by the Token selection module to obtain high-level semantic information corresponding to the first key image feature information.
步骤A4:基于第一关键图像特征信息对应的高层语义信息,得到第一图像集合对应的M个分类结果。Step A4: Based on the high-level semantic information corresponding to the first key image feature information, obtain M classification results corresponding to the first image set.
在本申请实施例中,电子设备可以将得到的高层语义信息输入多分类模型的全连接层,以得到第一图像集合对应的M个分类结果。In this embodiment of the present application, the electronic device can input the obtained high-level semantic information into the fully connected layer of the multi-classification model to obtain M classification results corresponding to the first image set.
在本申请实施例中,上述全连接层用于将输入的高层语义信息,转化为多个分类结果输出。In this embodiment of the present application, the above-mentioned fully connected layer is used to convert the input high-level semantic information into multiple classification result outputs.
示例1:Example 1:
举例说明,以第一图像集合为16帧视频帧组成的图像集合为例。来对多分类模型的分类过程进行示例性说明。For example, take the first image set as an image set consisting of 16 video frames. Let’s illustrate the classification process of the multi-classification model.
示例性地,以多分类模型为MVM模型为例,MVM模型的分类过程如下:首先对输入的视频按照时间均匀抽取16帧(该参数可变),按照在原来视频中的顺序排列成多维矩阵(如,[bs*16,3,224,224]),记为input(假设输入一个视频,则样本数量(batch size,bs)为1)。之后,通过CNN卷积来将input转换成输入的tokens(tokens是指将图片或则视频分成一个个的图像块,每个图像块通过CNN单独提取信息,成为一个1*1*embedding的特征向量,其中,embedding是指token化之后的特征向量的维度)。接着,将输入的token化之后的tokens通过TokenSelect模块来选择最重要的几个token,从而减少模型的计算量。其次,通过基础特征模块来提取token的高层语义信息,最后通过一个全连接层来获得视频的多个标签的分类结果。For example, taking the multi-classification model as the MVM model as an example, the classification process of the MVM model is as follows: first, 16 frames are evenly extracted from the input video according to time (this parameter is variable), and arranged into a multi-dimensional matrix according to the order in the original video. (For example, [bs*16,3,224,224]), recorded as input (assuming a video is input, the number of samples (batch size, bs) is 1). After that, the input is converted into input tokens through CNN convolution (tokens refers to dividing the picture or video into image blocks. Each image block extracts information separately through CNN and becomes a 1*1*embedding feature vector. , where embedding refers to the dimension of the feature vector after tokenization). Then, the input tokens are passed through the TokenSelect module to select the most important tokens, thereby reducing the calculation amount of the model. Secondly, the high-level semantic information of the token is extracted through the basic feature module, and finally the classification results of multiple labels of the video are obtained through a fully connected layer.
具体地,上述MVM模型的分类过程包括如下步骤S1至步骤S2:Specifically, the classification process of the above-mentioned MVM model includes the following steps S1 to step S2:
步骤S1(Token化模块的处理过程):首先,先将该16帧图像集合排列成多维矩阵1,如[bs*16,3,224,224]的,记为输入(input)。然后,通过若干个CNN的卷积操作,将多维矩阵1变换成多维矩阵2,如[bs*16,embedding,224/16,224/16],其中,embedding用于表示token的特征向量维度参数,其可选512,768,1024等;224/16是指在通过一个卷积核大小为16*16(卷积核大小为16*16用于指示,将每帧图像按照大小为16*16的像素分成一个个图像块来提取token),步长为16的CNN卷积的时候,输入的长宽(224*224)会减少到原来的[224/16,224/16]=[14,14]这么多。例如,将大小为[3,224,224]的输入,经过步长为16,卷积核大小为16*16的CNN卷积之后,该输入的大小会变成[512,224/16,224/16],这里取embedding=512为例。Step S1 (processing process of Tokenization module): First, arrange the 16-frame image set into a multi-dimensional matrix 1, such as [bs*16, 3, 224, 224], which is recorded as input. Then, through several CNN convolution operations, the multidimensional matrix 1 is transformed into a multidimensional matrix 2, such as [bs*16, embedding, 224/16, 224/16], where embedding is used to represent the feature vector dimension parameters of the token. , its optional 512, 768, 1024, etc.; 224/16 means that through a convolution kernel size of 16*16 (the convolution kernel size is 16*16 is used to indicate, each frame image will be 16*16 The pixels are divided into image blocks to extract tokens). When performing CNN convolution with a step size of 16, the input length and width (224*224) will be reduced to the original [224/16,224/16]=[14,14] so much. For example, after convolving an input of size [3,224,224] with a CNN with a step size of 16 and a convolution kernel size of 16*16, the size of the input will become [512, 224/16, 224/16]. Here we take embedding=512 as an example.
步骤S2(Token选择模块的处理过程):在通过若干个CCN的卷积操作,得到上述多维矩阵2(如[bs*16,512,224/16,224/16])后,则代表有bs*16*14*14个token。如图2所示,将上述多维矩阵2,通过两路变换:Step S2 (processing process of Token selection module): After obtaining the above-mentioned multi-dimensional matrix 2 (such as [bs*16, 512, 224/16, 224/16]) through several CCN convolution operations, it means that there is bs *16*14*14 tokens. As shown in Figure 2, the above multidimensional matrix 2 is transformed in two ways:
其中一路,先通过一个2d的卷积conv1(卷积核3*3,输出通道数512,通道数可 以调整),以及激活函数(relu)进行处理,然后,再通过另一个2d卷积conv2(卷积核3*3,输出通道数128,这里的128为Token选择模块最终需要选择的token数(即上述至少一个关键图像特征信息)),从多维矩阵2中选择出多维矩阵3,如[bs*128,14*14],最后,通过一个激活函数(sigmoid)来调整需要选择的token的置信度,并通过解压缩(unsqueeze)操作来扩展输出的维度,将将多维矩阵3扩展为的多维矩阵4,如[bs*128,14*14,1]。One of the channels first passes through a 2d convolution conv1 (convolution kernel 3*3, output channel number 512, the number of channels can be (adjusted) and the activation function (relu) for processing, and then through another 2d convolution conv2 (convolution kernel 3*3, the number of output channels is 128, where 128 is the number of tokens that the Token selection module ultimately needs to select ( That is, the above-mentioned at least one key image feature information)), select the multi-dimensional matrix 3 from the multi-dimensional matrix 2, such as [bs*128, 14*14], and finally, adjust the confidence of the token that needs to be selected through an activation function (sigmoid) degree, and expand the output dimensions through the decompression (unsqueeze) operation, extending the multidimensional matrix 3 to a multidimensional matrix 4, such as [bs*128, 14*14, 1].
另一路则是先进行重塑(reshape)操作,将上述多维矩阵2[bs*512,14,14]转变成为多维矩阵5,如[bs*1,512,14*14],再通过转置(transpose)操作,将多维矩阵5转换为多维矩阵6,如[bs*1,14*14,512],以改变多维矩阵的形状。The other way is to first perform a reshape operation to transform the above-mentioned multi-dimensional matrix 2 [bs*512, 14, 14] into a multi-dimensional matrix 5, such as [bs*1, 512, 14*14], and then transpose (transpose) operation, convert multidimensional matrix 5 into multidimensional matrix 6, such as [bs*1, 14*14, 512], to change the shape of the multidimensional matrix.
最后,将上述两路得到的结果按多维矩阵的元素相乘,获得一个[bs*128,14*14,512]的输出,此时,再对该输出的倒数第二维(14*14所在的维度)取平均,即可获得最后Token选择模块的输出,即[bs*128,512],其中128为要选择的token数量,512为token的特征向量维度。Finally, the results obtained by the above two methods are multiplied by the elements of the multi-dimensional matrix to obtain an output of [bs*128, 14*14, 512]. At this time, the penultimate dimension of the output (where 14*14 is located dimensions), the output of the final Token selection module can be obtained, which is [bs*128, 512], where 128 is the number of tokens to be selected, and 512 is the feature vector dimension of the token.
如此,电子设备在通过多分类模型对整个第一图像集合整体进行分类处理时,通过将每帧图像转换为多个图像块的图像特征信息,然后从该多个图像块的图像特征信息中,选择一些重要的关键图像特征信息,并对该重要的关键图像特征信息提取高层语义信息,最后基于该高层语义信息,得到上述第一图像集合的M个分类结果。因此,可以减少上述多分类模型的计算量,进一步提高了分类效率。In this way, when the electronic device classifies the entire first image set through the multi-classification model, it converts each frame of image into image feature information of multiple image blocks, and then from the image feature information of the multiple image blocks, Some important key image feature information is selected, and high-level semantic information is extracted from the important key image feature information. Finally, based on the high-level semantic information, M classification results of the above-mentioned first image collection are obtained. Therefore, the calculation amount of the above multi-classification model can be reduced, and the classification efficiency is further improved.
示例2:Example 2:
针对上述步骤A31中多分类模型中的基础特征模块:For the basic feature module in the multi-classification model in step A31 above:
举例说明,如图3所示,该基础特征模块主要组成部件为:Token的归一化层、Token的池化层、Token的随即丢弃层、和Token的残差链接层,以及Token的下采样模块。For example, as shown in Figure 3, the main components of the basic feature module are: Token's normalization layer, Token's pooling layer, Token's random discard layer, Token's residual link layer, and Token's downsampling. module.
针对该基础特征模块中的Token的归一化层:Normalization layer for Token in this basic feature module:
示例性地,上述Token的归一化层,用于限制token的范围。例如将其范围限制到(0,1)。For example, the above-mentioned normalization layer of Token is used to limit the scope of token. For example, limit its range to (0, 1).
示例性地,上述Token的归一化层:使用归一化模块(torch.nn.LayerNorm)对输入的token进行层归一化操作,该层归一化操作的主要作用是为了对每个token进行归一化,计算公式如下:
For example, the normalization layer of the above Token: use the normalization module (torch.nn.LayerNorm) to perform layer normalization operation on the input token. The main function of this layer normalization operation is to normalize each token. For normalization, the calculation formula is as follows:
其中,期望(Expectation,E)[x]为输入x的均值,变量(Variable,Var)[x]为输入x的方差,ε=1e-6防止分母为0,其他的参数为可学习的偏置量。Among them, expectation (Expectation, E) [x] is the mean value of input x, variable (Variable, Var) [x] is the variance of input x, ε = 1e-6 prevents the denominator from being 0, and other parameters are learnable biases. Set the amount.
针对该基础特征模块中的Token的池化层:Pooling layer for Token in this basic feature module:
示例性地,上述Token的池化层,用于学习不同token之间的关联关系。 For example, the above-mentioned Token pooling layer is used to learn the association between different tokens.
示例性地,结合示例1,上述Token的池化层:主要是通过3*1的池化层来对128个token进行池化操作。比如说针对[128,512]的图像输入,按照每行3个像素,每列1个像素移动池化核,来生成新的池化结果。该池化层主要是为了融合不同的token之间的信息。Illustratively, combined with Example 1, the above-mentioned Token pooling layer: mainly performs a pooling operation on 128 tokens through a 3*1 pooling layer. For example, for the image input [128,512], the pooling kernel is moved by 3 pixels per row and 1 pixel per column to generate a new pooling result. This pooling layer is mainly used to fuse information between different tokens.
针对该基础特征模块中的Token的随机丢弃层:Random discarding layer for Token in this basic feature module:
示例性地,上述Token的随机丢弃层,用于提高多分类模型的识别能力。Illustratively, the above-mentioned random discarding layer of Token is used to improve the recognition ability of the multi-classification model.
示例性地,结合示例1,上述Token的随机丢弃层:选取一个丢弃随机数t(0<=t<1),使得输入的128个token中,有t*100%的token数随机被置为0,从而丢弃原先的值,使得在后续处理视频分类时,所能处理的token范围更广。Illustratively, combined with Example 1, the random discarding layer of the above Token: selects a discarding random number t (0<=t<1), so that among the 128 input tokens, t*100% of the token numbers are randomly set to 0, thereby discarding the original value, allowing a wider range of tokens to be processed during subsequent video classification.
针对该基础特征模块中的Token的残差连接层:Residual connection layer for Token in this basic feature module:
示例性地,上述Token的残差连接层,用于提升多分类模型的处理深度。For example, the above-mentioned residual connection layer of Token is used to improve the processing depth of the multi-classification model.
示例性地,结合上文的示例,上述Token的残差连接层:主要是将输入的128个token数与上述token的随机丢弃层输出的(1-t)*100%的token数相加,从而保留原始信息。Illustratively, combined with the above example, the residual connection layer of the above Token: mainly adds the input 128 token numbers to the (1-t)*100% token number output by the random discarding layer of the above Token, This preserves the original information.
针对该基础特征模块中的Token的下采样模块:Downsampling module for Token in this basic feature module:
示例性地,上述下采样模块,用于进一步减少输出的token数量和调整输出的维度。Illustratively, the above-mentioned downsampling module is used to further reduce the number of output tokens and adjust the output dimensions.
示例性地,如图4所示,上述下采样模块包括:线性变换层(Fully Connected,FC)层、激活函数层(如Relu激活函数)和随机失活(dropout)层。具体的,先通过FC层,改变输出的token的维度,再通过激活函数层,然后通过dropout层,随机使得输出结果中的一部分token变成0。Exemplarily, as shown in Figure 4, the above-mentioned downsampling module includes: a linear transformation layer (Fully Connected, FC) layer, an activation function layer (such as a Relu activation function) and a random deactivation (dropout) layer. Specifically, the dimension of the output token is changed first through the FC layer, then through the activation function layer, and then through the dropout layer, which randomly causes part of the tokens in the output result to become 0.
如此,电子设备通过将Token选择模块选择出的关键图像特征信息输入基础特征模块,以得到该关键图像特征信息对应的高层语义信息。从而,使得本申请的多分类模型得到更为精准的分类结果。In this way, the electronic device inputs the key image feature information selected by the Token selection module into the basic feature module to obtain high-level semantic information corresponding to the key image feature information. As a result, the multi-classification model of this application can obtain more accurate classification results.
可选地,在本申请实施例中,上述M个分类结果包括:每种分类对应的分类评分;上述步骤203中“电子设备从M个分类结果对应的至少一个视频模板中,确定目标视频模板”可以包括如下步骤203a和步骤203b:Optionally, in this embodiment of the present application, the above-mentioned M classification results include: classification scores corresponding to each classification; in the above-mentioned step 203, "the electronic device determines the target video template from at least one video template corresponding to the M classification results." ” may include the following steps 203a and 203b:
步骤203a:电子设备从第一图像集合对应的M个分类结果中确定目标分类结果。Step 203a: The electronic device determines the target classification result from the M classification results corresponding to the first image set.
在本申请实施例中,上述目标分类结果为上述M个分类结果中分类评分最高的分类结果。In this embodiment of the present application, the above-mentioned target classification result is the classification result with the highest classification score among the above-mentioned M classification results.
在本申请实施例中,电子设备在得到M个分类结果后,可以根据分类结果中包括的每种分类的分类评分,对M个分类结果进行排序,将评分最高的分类结果确定为目标分类结果。In the embodiment of the present application, after obtaining M classification results, the electronic device can sort the M classification results according to the classification score of each classification included in the classification results, and determine the classification result with the highest score as the target classification result. .
示例3,以第一图像集合包括视频A中的N帧视频帧为例,电子设备可以对多分类模型输出的该视频A对应的M个分类结果进行评分排序,得到排名前三的分类结果, 记为A:[Aclass1,Ascore1;Aclass2,Ascore2;Aclass3,Ascore3],然后按照类别的评分值进行排序,获得一个经过排序之后的类别序列AS:[Aclass1,Aclass2,Aclass3],然后选取评分最高的分类结果Aclass1,作为视频A的目标分类结果。Example 3, taking the first image set including N video frames in video A as an example, the electronic device can score and sort the M classification results corresponding to video A output by the multi-classification model to obtain the top three classification results. Record it as A: [Aclass1, Ascore1; Aclass2, Ascore2; Aclass3, Ascore3], and then sort according to the score value of the category to obtain a sorted category sequence AS: [Aclass1, Aclass2, Aclass3], and then select the one with the highest score The classification result Aclass1 is used as the target classification result of video A.
示例4,如图5所示,以第一图像集合包括视频A和视频B中的N帧视频帧为例,电子设备可以分别对多分类模型输出的该视频A和视频B各自对应的M个分类结果进行评分排序,得到排名前三的分类结果,分别记为:A:[Aclass1,Ascore1;Aclass2,Ascore2;Aclass3,Ascore3],B:[Bclass1,Bscore1;Bclass2,Bscore2;Bclass3,Bscore3]。再将A和B组成一个匹配链,即AB:[Aclass1,Ascore1,Aclass2,Ascore2,Aclass3,Ascore3;Bclass1,Bscore1,Bclass2,Bscore2,Bclass3,Bscore3],此时对AB按照类别的评分值进行排序,获得一个经过排序之后的类别序列ABS:[Aclass1,Aclass2,Aclass3,Bclass1,Bclass2,Bclass3],然后分别选取评分最高的分类结果Aclass1,Bclass1,分别作为视频A和视频B的目标分类结果。Example 4, as shown in Figure 5, taking the first image set including N video frames in video A and video B as an example, the electronic device can respectively classify the M corresponding video A and video B output by the multi-classification model. The classification results are sorted by scoring, and the top three classification results are obtained, which are recorded as: A: [Aclass1, Ascore1; Aclass2, Ascore2; Aclass3, Ascore3], B: [Bclass1, Bscore1; Bclass2, Bscore2; Bclass3, Bscore3]. Then A and B form a matching chain, that is, AB: [Aclass1, Ascore1, Aclass2, Ascore2, Aclass3, Ascore3; Bclass1, Bscore1, Bclass2, Bscore2, Bclass3, Bscore3]. At this time, AB is sorted according to the score value of the category. , obtain a sorted category sequence ABS: [Aclass1, Aclass2, Aclass3, Bclass1, Bclass2, Bclass3], and then select the highest-rated classification results Aclass1 and Bclass1 as the target classification results of video A and video B respectively.
步骤203b:从与目标分类结果匹配的视频模板中,确定出目标视频模板。Step 203b: Determine the target video template from the video templates matching the target classification result.
在本申请实施例中,电子设备可以从视频模板库中先选出与目标分类结果匹配的至少一个视频模板,然后再从该至少一个视频模板中,确定出最符合目标分类结果的目标视频模板。In this embodiment of the present application, the electronic device can first select at least one video template that matches the target classification result from the video template library, and then determine the target video template that best matches the target classification result from the at least one video template. .
如此,电子设备通过对多分类模型得到的多个分类结果,按照每个分类结果对应的分类评分进行排序,以将评分最高的分类结果作为最终确定的目标分类结果,然后再从与该目标分类结果匹配的多个视频模板中,确定出最终的目标视频模板。从而,使得电子设备确定的目标视频模板与第一视频更为匹配,提升了最终生成的视频的视频质量。In this way, the electronic device sorts the multiple classification results obtained by the multi-classification model according to the classification score corresponding to each classification result, so that the classification result with the highest score is used as the final target classification result, and then the classification result is classified according to the target classification result. Among the multiple video templates that match the result, the final target video template is determined. Therefore, the target video template determined by the electronic device is more closely matched with the first video, and the video quality of the final generated video is improved.
可选地,在本申请实施例中,上述分类结果还包括:分类类型名称。在上述步骤203b之前,本申请实施例提供的视频生成方法还可以包括如下步骤203b1和步骤203b2:Optionally, in this embodiment of the present application, the above classification results also include: classification type name. Before the above step 203b, the video generation method provided by the embodiment of the present application may also include the following steps 203b1 and 203b2:
步骤203b1:计算目标分类结果中的分类类型名称与视频模板库中的每个视频模板的名称间的相似度值。Step 203b1: Calculate the similarity value between the classification type name in the target classification result and the name of each video template in the video template library.
在本申请实施例中,电子设备可以将目标分类结果的分类类型名称的文本信息和视频模板库中的每个视频模板的名称的文本信息均转化为向量值,然后通过计算向量值,得到每个视频模板的分数,以此来得到目标分类结果与每个视频模板的相似度。In this embodiment of the present application, the electronic device can convert the text information of the classification type name of the target classification result and the text information of the name of each video template in the video template library into vector values, and then calculate the vector value to obtain each The scores of each video template are used to obtain the similarity between the target classification result and each video template.
步骤203b2:将相似度值满足第一条件的视频模板,确定为与目标分类结果匹配的视频模板。Step 203b2: Determine the video template whose similarity value satisfies the first condition as the video template that matches the target classification result.
在本申请实施例中,上述第一条件可以为:与目标分类结果中的分类类型名称间的相似度值最高的视频模板。In this embodiment of the present application, the first condition may be: the video template with the highest similarity value to the classification type name in the target classification result.
在本申请实施例中,电子设备还可以将按照相似度值的高低对视频模板进行排序,将排名靠前的视频模板推送给用户,以供用户可以手动选择所要融合的视频模板,提 高了视频生成的灵活性。In this embodiment of the present application, the electronic device can also sort the video templates according to the similarity value, and push the top-ranked video templates to the user so that the user can manually select the video template to be merged, providing Improved flexibility of video generation.
示例5,结合上述示例3,在得到上述类别序列AS之后,电子设备可以使用相同的方法,对视频模板库中的视频模板进行类别序列生成操作,视频模板的类别按照模板中的视频出现的类别的先后顺序排列,记为DataSetSi,i∈[0,DataSet],通过遍历DataSet中的每个元素的类别,计算与AS的相似度,获得与视频A相似度最高的视频模板,然后将视频A与该视频模板进行融合,生成目标视频。Example 5, combined with the above Example 3, after obtaining the above category sequence AS, the electronic device can use the same method to perform a category sequence generation operation on the video template in the video template library. The category of the video template is according to the category of the video in the template. The sequential arrangement is recorded as DataSetSi, i∈[0, DataSet]. By traversing the category of each element in the DataSet, calculating the similarity with AS, the video template with the highest similarity to video A is obtained, and then video A is Fusion with the video template to generate the target video.
示例6,结合上述示例4,如图5所示,在得到上述类别序列ABS之后,电子设备可以使用相同的方法,对视频模板库中的视频模板进行类别序列生成操作,视频模板的类别按照模板中的视频出现的类别的先后顺序排列,记为DataSetSi,i∈[0,DataSet],通过遍历DataSet中的每个元素的类别,计算与ABS的相似度,分别获得与视频A、视频B相似度最高的两个视频模板,然后将视频A与自己相似度最高的视频模板融合、将视频B与自己相似度最高的模板融合,分别得到目标视频A*和目标视频B*,最后再将目标视频A*和目标视频B*进行简单拼接,生成目标视频。Example 6, combined with the above Example 4, as shown in Figure 5, after obtaining the above category sequence ABS, the electronic device can use the same method to perform a category sequence generation operation on the video templates in the video template library. The category of the video template is according to the template The order of the categories in which the videos appear is recorded as DataSetSi, i∈[0, DataSet]. By traversing the category of each element in the DataSet, the similarity with ABS is calculated, and the similarity with video A and video B is obtained respectively. The two video templates with the highest similarity, then fuse video A with the video template with the highest similarity, and fuse video B with the template with the highest similarity to get the target video A* and target video B* respectively, and finally merge the target video Video A* and target video B* are simply spliced to generate the target video.
如此,电子设备通过对视频模板库中的视频模板进行类别序列生成操作,计算确定出的目标分类结果的分类类型名称与视频模板库中的每个视频模板的名称间的相似度值。以获得与目标分类结果相似度最高的视频模板。从而,可以使电子设备更为精准地确定出视频模板。In this way, the electronic device calculates the similarity value between the determined classification type name of the target classification result and the name of each video template in the video template library by performing a category sequence generation operation on the video templates in the video template library. To obtain the video template with the highest similarity to the target classification result. Therefore, the electronic device can determine the video template more accurately.
以下将对本申请实施例提供的视频生成方法进行示例性说明:The following is an exemplary description of the video generation method provided by the embodiments of this application:
示例性地,以第一图像集合包含视频A的N帧视频帧为例,如图6所示,本申请提供的视频生成方法可以包括如下步骤P1至步骤P5:Illustratively, taking the first image set including N frames of video A as shown in Figure 6, the video generation method provided by this application may include the following steps P1 to P5:
步骤P1:用户输入视频A时,对视频A进行抽取视频帧操作。具体的,可以按照视频A的时长,均匀抽取N帧视频帧。Step P1: When the user inputs video A, extract video frames from video A. Specifically, N video frames can be evenly extracted according to the duration of video A.
步骤P2:将抽取的N帧视频帧按照视频A的时间顺序进行排序,组成第一个图像集合,用于输入MVM模型。Step P2: Sort the extracted N video frames according to the time order of video A to form the first image collection, which is used to input the MVM model.
步骤P3:将上述第一图像集合输入MVM模型后,通过推理获得视频A的M个分类结果。具体的,通过CNN卷积将上述N帧视频帧均转换为token,然后通过Token选择模块,从上述N帧视频帧转换为的token中,选择一些重要的token,再通过基础特征块对这些重要的token提取高层语义信息,最后将提取到的高层语义信息通过一个全连接层,来得到视频A的M个分类结果。Step P3: After inputting the above first image set into the MVM model, obtain M classification results of video A through reasoning. Specifically, the above-mentioned N video frames are converted into tokens through CNN convolution, and then through the Token selection module, some important tokens are selected from the tokens converted into the above-mentioned N video frames, and then these important tokens are selected through basic feature blocks. The token extracts high-level semantic information, and finally passes the extracted high-level semantic information through a fully connected layer to obtain M classification results of video A.
步骤P4:通过视频A的M个分类结果,从视频模板库中匹配视频模板。具体的,可以将M个分类结果按照各自对应的评分进行排序,得到评分最高的分类结果。然后将该评分最高的分类结果的分类类型名称的文本信息转化为向量值,并将视频库中的每个视频模板的名称的文本信息也转化为向量值,分别计算每个视频模板与上述评分最高的分类结果的相似度,将相似度最高的视频模板,作为最匹配的视频模板。Step P4: Match the video template from the video template library through the M classification results of video A. Specifically, the M classification results can be sorted according to their corresponding scores to obtain the classification result with the highest score. Then convert the text information of the classification type name of the highest-rated classification result into a vector value, and convert the text information of the name of each video template in the video library into a vector value, and calculate each video template and the above score respectively. Based on the highest similarity of the classification results, the video template with the highest similarity will be used as the most matching video template.
步骤P5:将用户输入的视频A,与步骤S4匹配出的视频模板进行融合,生成最 终的目标视频。Step P5: Fusion of the video A input by the user with the video template matched in step S4 to generate the final Final goal video.
如此,先通过对从视频中抽取的视频帧按照时间顺序进行排序,兼顾了视频帧之间的时序,提高了MVM模型的分类准确率,再将排序后的视频帧作为一个整体输入MVM模型进行处理,得到上述视频的多个分类结果,提升了MVM模型的分类速度,然后,从视频模板库中匹配与该多个分类结果中评分最高的分类结果最相似的视频模板,最后将上述视频与该视频模板融合,得到最终视频。因此,不仅提高了该MVM模型的分类能力,还保证了最终生成视频的质量。In this way, the video frames extracted from the video are first sorted in chronological order, taking into account the timing between video frames, and improving the classification accuracy of the MVM model, and then the sorted video frames are input into the MVM model as a whole. After processing, multiple classification results of the above video are obtained, which improves the classification speed of the MVM model. Then, the video template most similar to the classification result with the highest score among the multiple classification results is matched from the video template library, and finally the above video is compared with The video templates are fused to obtain the final video. Therefore, it not only improves the classification ability of the MVM model, but also ensures the quality of the final generated video.
本申请实施例提供的视频生成方法,执行主体可以为视频生成装置。本申请实施例中以视频生成装置执行视频生成方法为例,说明本申请实施例提供的视频生成装置。For the video generation method provided by the embodiments of the present application, the execution subject may be a video generation device. In the embodiment of the present application, a video generation device executing a video generation method is used as an example to describe the video generation device provided by the embodiment of the present application.
本申请实施例提供一种视频生成装置,如图7所示,该视频生成装置400包括:获取单元401、分类单元402、确定单元403和生成单元404,其中:上述获取单元401,用于获取第一图像集合;上述分类单元402,用于将获取单元401获取到的上述第一图像集合输入多分类模型进行分类,输出上述第一图像集合对应的M个分类结果;上述确定单元403,用于从分类单元402得到的上述M个分类结果对应的至少一个视频模板中,确定目标视频模板;上述生成单元404,用于基于获取单元401获取到的上述第一图像集合与确定单元403确定的上述目标视频模板,生成目标视频;其中,M为大于1的整数。An embodiment of the present application provides a video generation device. As shown in Figure 7, the video generation device 400 includes: an acquisition unit 401, a classification unit 402, a determination unit 403 and a generation unit 404, wherein: the above acquisition unit 401 is used to obtain A first image set; the classification unit 402 is used to input the first image set obtained by the acquisition unit 401 into a multi-classification model for classification, and output M classification results corresponding to the first image set; the determination unit 403 is used to Determine the target video template from at least one video template corresponding to the M classification results obtained from the classification unit 402; the above-mentioned generation unit 404 is used to determine based on the above-mentioned first image set obtained by the acquisition unit 401 and the determination unit 403 The above target video template generates a target video; where M is an integer greater than 1.
可选地,在本申请实施例中,上述分类单元402,具体用于:将获取单元401获取到的上述第一图像集合输入多分类模型后,基于该多分类模型将上述第一图像集合中的N帧图像转换为X个图像块的第一图像特征信息;从该X个图像块的第一图像特征信息中,确定出第一关键图像特征信息;提取该第一关键图像特征信息对应的高层语义信息;基于该高层语义信息,得到上述第一图像集合对应的M个分类结果;其中,N、X为大于1的整数。Optionally, in this embodiment of the present application, the above-mentioned classification unit 402 is specifically configured to: after inputting the above-mentioned first image set acquired by the acquisition unit 401 into a multi-classification model, based on the multi-classification model, the above-mentioned first image set is classified into Convert the N frames of images into first image feature information of X image blocks; determine the first key image feature information from the first image feature information of the X image blocks; extract the first key image feature information corresponding to High-level semantic information; based on the high-level semantic information, M classification results corresponding to the above-mentioned first image collection are obtained; where N and X are integers greater than 1.
可选地,在本申请实施例中,上述分类单元402,具体用于:基于上述多分类模型中的图像特征信息转化模块,将上述第一图像集合中的N帧图像进行拆分,得到X个图像块;通过卷积神经网络对该X个图像块进行特征信息提取,得到上述X个图像块的第一图像特征信息。Optionally, in this embodiment of the present application, the above-mentioned classification unit 402 is specifically configured to: based on the image feature information conversion module in the above-mentioned multi-classification model, split the N frames of images in the above-mentioned first image set to obtain X image blocks; extract feature information of the X image blocks through a convolutional neural network to obtain the first image feature information of the X image blocks.
可选地,在本申请实施例中,上述分类单元402,具体用于:基于上述多分类模型中的图像特征信息选择模块,从上述X个图像块的第一图像特征信息中,选择出第二关键图像特征信息,并将上述X个图像块的第一图像特征信息的排列方式进行变换,得到第二图像特征信息;将上述第二关键图像特征信息和上述第二图像特征信息进行融合,得到上述第一关键图像特征信息。Optionally, in this embodiment of the present application, the above-mentioned classification unit 402 is specifically configured to: based on the image feature information selection module in the above-mentioned multi-classification model, select the first image feature information from the above-mentioned X image blocks. two key image feature information, and transform the arrangement of the first image feature information of the above-mentioned X image blocks to obtain the second image feature information; fuse the above-mentioned second key image feature information and the above-mentioned second image feature information, The above-mentioned first key image feature information is obtained.
可选地,在本申请实施例中,上述分类单元402,具体用于:基于上述多分类模型中的基础特征模块,对上述第一关键图像特征信息进行归一化操作,得到第三关键图像特征信息;提取该第三关键图像特征信息中的基础图像特征信息;将上述第一关 键图像特征信息与上述基础图像特征信息融合,得到目标关键图像特征信息;提取该目标关键图像特征信息对应的高层语义信息。Optionally, in this embodiment of the present application, the above-mentioned classification unit 402 is specifically configured to perform a normalization operation on the above-mentioned first key image feature information based on the basic feature module in the above-mentioned multi-classification model to obtain a third key image. feature information; extract the basic image feature information in the third key image feature information; combine the above first level The key image feature information is fused with the above-mentioned basic image feature information to obtain the target key image feature information; the high-level semantic information corresponding to the target key image feature information is extracted.
可选地,在本申请实施例中,上述获取单元401,具体用于从第一视频中抽取N帧视频帧,以获取第一图像集合;上述生成单元404,具体用于将上述第一视频与上述目标视频模板融合,生成目标视频。Optionally, in this embodiment of the present application, the above-mentioned obtaining unit 401 is specifically used to extract N video frames from the first video to obtain the first image set; the above-mentioned generating unit 404 is specifically used to convert the above-mentioned first video into Fusion with the above target video template to generate the target video.
本申请实施例提供的视频生成装置中,在制作视频时,该视频生成装置可以先获取第一图像集合,然后将该第一图像集合输入多分类模型进行分类,以输出该第一图像集合对应的M个分类结果;再从该M个分类结果对应的至少一个视频模板中,确定目标视频模板;最后,基于上述第一图像集合与该目标视频模板,生成目标视频;其中,M为大于1的整数。如此,由于本申请在对图像进行分类时,是对整个第一图像集合整体进行分类处理,使得上述多分类模型只进行一次前向处理,就可以得到该第一图像集合整体的M个分类结果,因此,提高了多分类模型的分类能力,从而提高了整体的视频生成效率。In the video generation device provided by the embodiment of the present application, when producing a video, the video generation device can first obtain a first image set, and then input the first image set into a multi-classification model for classification to output the corresponding M classification results; then determine the target video template from at least one video template corresponding to the M classification results; finally, generate the target video based on the above-mentioned first image set and the target video template; where M is greater than 1 integer. In this way, since this application performs classification processing on the entire first image collection when classifying images, the above-mentioned multi-classification model only performs one forward processing to obtain M classification results of the entire first image collection. , Therefore, the classification ability of the multi-classification model is improved, thereby improving the overall video generation efficiency.
本申请实施例中的视频生成装置可以是电子设备,也可以是电子设备中的部件,例如集成电路或芯片。该电子设备可以是终端,也可以为除终端之外的其他设备。示例性的,电子设备可以为手机、平板电脑、笔记本电脑、掌上电脑、车载电子设备、移动上网装置(Mobile Internet Device,MID)、增强现实(augmented reality,AR)/虚拟现实(virtual reality,VR)设备、机器人、可穿戴设备、超级移动个人计算机(ultra-mobile personal computer,UMPC)、上网本或者个人数字助理(personal digital assistant,PDA)等,还可以为服务器、网络附属存储器(Network Attached Storage,NAS)、个人计算机(personal computer,PC)、电视机(television,TV)、柜员机或者自助机等,本申请实施例不作具体限定。The video generation device in the embodiment of the present application may be an electronic device or a component in the electronic device, such as an integrated circuit or chip. The electronic device may be a terminal or other devices other than the terminal. For example, the electronic device can be a mobile phone, a tablet computer, a notebook computer, a handheld computer, a vehicle-mounted electronic device, a mobile internet device (Mobile Internet Device, MID), or augmented reality (AR)/virtual reality (VR). ) equipment, robots, wearable devices, ultra-mobile personal computers (UMPC), netbooks or personal digital assistants (personal digital assistants, PDA), etc., and can also be servers, network attached storage (Network Attached Storage), NAS), personal computer (PC), television (TV), teller machine or self-service machine, etc., the embodiments of this application are not specifically limited.
本申请实施例中的视频生成装置可以为具有操作系统的装置。该操作系统可以为安卓(Android)操作系统,可以为iOS操作系统,还可以为其他可能的操作系统,本申请实施例不作具体限定。The video generation device in the embodiment of the present application may be a device with an operating system. The operating system can be an Android operating system, an iOS operating system, or other possible operating systems, which are not specifically limited in the embodiments of this application.
本申请实施例提供的视频生成装置能够实现图1至图6的方法实施例实现的各个过程,为避免重复,这里不再赘述。The video generation device provided by the embodiment of the present application can implement various processes implemented by the method embodiments of Figures 1 to 6. To avoid repetition, they will not be described again here.
可选地,如图8所示,本申请实施例还提供一种电子设备600,包括处理器601和存储器602,存储器602上存储有可在所述处理器601上运行的程序或指令,该程序或指令被处理器601执行时实现上述视频生成方法实施例的各个步骤,且能达到相同的技术效果,为避免重复,这里不再赘述。Optionally, as shown in Figure 8, this embodiment of the present application also provides an electronic device 600, including a processor 601 and a memory 602. The memory 602 stores programs or instructions that can be run on the processor 601. When the program or instruction is executed by the processor 601, each step of the above video generation method embodiment is implemented, and the same technical effect can be achieved. To avoid duplication, the details will not be described here.
需要说明的是,本申请实施例中的电子设备包括上述所述的移动电子设备和非移动电子设备。It should be noted that the electronic devices in the embodiments of the present application include the above-mentioned mobile electronic devices and non-mobile electronic devices.
图9为实现本申请实施例的一种电子设备的硬件结构示意图。FIG. 9 is a schematic diagram of the hardware structure of an electronic device implementing an embodiment of the present application.
该电子设备100包括但不限于:射频单元101、网络模块102、音频输出单元103、 输入单元104、传感器105、显示单元106、用户输入单元107、接口单元108、存储器109、以及处理器110等部件。The electronic device 100 includes but is not limited to: a radio frequency unit 101, a network module 102, an audio output unit 103, Input unit 104, sensor 105, display unit 106, user input unit 107, interface unit 108, memory 109, processor 110 and other components.
本领域技术人员可以理解,电子设备100还可以包括给各个部件供电的电源(比如电池),电源可以通过电源管理系统与处理器110逻辑相连,从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。图9中示出的电子设备结构并不构成对电子设备的限定,电子设备可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置,在此不再赘述。Those skilled in the art can understand that the electronic device 100 may also include a power supply (such as a battery) that supplies power to various components. The power supply may be logically connected to the processor 110 through a power management system, thereby managing charging, discharging, and function through the power management system. Consumption management and other functions. The structure of the electronic device shown in Figure 9 does not constitute a limitation on the electronic device. The electronic device may include more or less components than shown in the figure, or combine certain components, or arrange different components, which will not be described again here. .
其中,上述处理器110,用于获取第一图像集合;将获取到的上述第一图像集合输入多分类模型进行分类,输出上述第一图像集合对应的M个分类结果;从上述M个分类结果对应的至少一个视频模板中,确定目标视频模板;基于上述第一图像集合与上述目标视频模板,生成目标视频;其中,M为大于1的整数。Wherein, the above-mentioned processor 110 is used to acquire a first image set; input the acquired first image set into a multi-classification model for classification, and output M classification results corresponding to the above-mentioned first image set; from the above M classification results In at least one corresponding video template, a target video template is determined; based on the above-mentioned first image set and the above-mentioned target video template, a target video is generated; where M is an integer greater than 1.
可选地,在本申请实施例中,上述处理器110,具体用于:将获取到的上述第一图像集合输入多分类模型后,基于该多分类模型对上述第一图像集合将上述第一图像集合中的N帧图像转换为X个图像块的第一图像特征信息;从该X个图像块的第一图像特征信息中,确定出第一关键图像特征信息;提取该第一关键图像特征信息对应的高层语义信息;基于该高层语义信息,得到上述第一图像集合对应的M个分类结果;其中,N、X为大于1的整数。Optionally, in this embodiment of the present application, the above-mentioned processor 110 is specifically configured to: after inputting the acquired first image set into a multi-classification model, convert the above-mentioned first image set into a multi-classification model based on the multi-classification model. N frames of images in the image set are converted into first image feature information of X image blocks; first key image feature information is determined from the first image feature information of the X image blocks; and the first key image feature is extracted The high-level semantic information corresponding to the information; based on the high-level semantic information, M classification results corresponding to the above-mentioned first image collection are obtained; where N and X are integers greater than 1.
可选地,在本申请实施例中,上述处理器110,具体用于:基于上述多分类模型中的图像特征信息转化模块,将上述第一图像集合中的N帧图像进行拆分,得到X个图像块;通过卷积神经网络对该X个图像块进行特征信息提取,得到上述X个图像块的第一图像特征信息。Optionally, in this embodiment of the present application, the above-mentioned processor 110 is specifically configured to: based on the image feature information conversion module in the above-mentioned multi-classification model, split the N frames of images in the above-mentioned first image set to obtain X image blocks; extract feature information of the X image blocks through a convolutional neural network to obtain the first image feature information of the X image blocks.
可选地,在本申请实施例中,上述处理器110,具体用于:基于上述多分类模型中的图像特征信息选择模块,从上述X个图像块的第一图像特征信息中,选择出第二关键图像特征信息,并将上述X个图像块的第一图像特征信息的排列方式进行变换,得到第二图像特征信息;将上述第二关键图像特征信息和上述第二图像特征信息进行融合,得到上述第一关键图像特征信息。Optionally, in this embodiment of the present application, the above-mentioned processor 110 is specifically configured to: select the first image feature information from the first image feature information of the above-mentioned X image blocks based on the image feature information selection module in the above-mentioned multi-classification model. two key image feature information, and transform the arrangement of the first image feature information of the above-mentioned X image blocks to obtain the second image feature information; fuse the above-mentioned second key image feature information and the above-mentioned second image feature information, The above-mentioned first key image feature information is obtained.
可选地,在本申请实施例中,上述处理器110,具体用于:基于上述多分类模型中的基础特征模块,对上述第一关键图像特征信息进行归一化操作,得到第三关键图像特征信息;提取该第三关键图像特征信息中的基础图像特征信息;将上述第一关键图像特征信息与上述基础图像特征信息融合,得到目标关键图像特征信息;提取该目标关键图像特征信息对应的高层语义信息。Optionally, in this embodiment of the present application, the above-mentioned processor 110 is specifically configured to perform a normalization operation on the above-mentioned first key image feature information based on the basic feature module in the above-mentioned multi-classification model to obtain a third key image. feature information; extract the basic image feature information in the third key image feature information; fuse the above-mentioned first key image feature information with the above-mentioned basic image feature information to obtain the target key image feature information; extract the target key image feature information corresponding to High-level semantic information.
可选地,在本申请实施例中,上述处理器110,具体用于:从第一视频中抽取N帧视频帧,以获取第一图像集合;将上述第一视频与上述目标视频模板融合,生成目标视频。Optionally, in this embodiment of the present application, the above-mentioned processor 110 is specifically configured to: extract N video frames from the first video to obtain the first image set; fuse the above-mentioned first video with the above-mentioned target video template, Generate target video.
在本申请实施例提供电子设备中,在制作视频时,电子设备可以先获取第一图像 集合,然后将该第一图像集合输入多分类模型进行分类,以输出该第一图像集合对应的M个分类结果;再从该M个分类结果对应的至少一个视频模板中,确定目标视频模板;最后,基于上述第一图像集合与该目标视频模板,生成目标视频;其中,M为大于1的整数。如此,由于本申请在对图像进行分类时,是对整个第一图像集合整体进行分类处理,使得上述多分类模型只进行一次前向处理,就可以得到该第一图像集合整体的M个分类结果,因此,提高了多分类模型的分类能力,从而提高了整体的视频生成效率。In the electronic device provided in the embodiment of the present application, when making a video, the electronic device can first obtain the first image Set, then input the first image set into a multi-classification model for classification to output M classification results corresponding to the first image set; and then determine the target video template from at least one video template corresponding to the M classification results; Finally, a target video is generated based on the above-mentioned first image set and the target video template; where M is an integer greater than 1. In this way, since this application performs classification processing on the entire first image collection when classifying images, the above-mentioned multi-classification model only performs one forward processing to obtain M classification results of the entire first image collection. , Therefore, the classification ability of the multi-classification model is improved, thereby improving the overall video generation efficiency.
应理解的是,本申请实施例中,输入单元104可以包括图形处理器(Graphics Processing Unit,GPU)1041和麦克风1042,图形处理器1041对在视频捕获模式或图像捕获模式中由图像捕获装置(如摄像头)获得的静态图片或视频的图像数据进行处理。显示单元106可包括显示面板1061,可以采用液晶显示器、有机发光二极管等形式来配置显示面板1061。用户输入单元107包括触控面板1071以及其他输入设备1072中的至少一种。触控面板1071,也称为触摸屏。触控面板1071可包括触摸检测装置和触摸控制器两个部分。其他输入设备1072可以包括但不限于物理键盘、功能键(比如音量控制按键、开关按键等)、轨迹球、鼠标、操作杆,在此不再赘述。It should be understood that in the embodiment of the present application, the input unit 104 may include a graphics processor (Graphics Processing Unit, GPU) 1041 and a microphone 1042. The graphics processor 1041 is responsible for the image capture device (GPU) in the video capture mode or the image capture mode. Process the image data of still pictures or videos obtained by cameras (such as cameras). The display unit 106 may include a display panel 1061, which may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like. The user input unit 107 includes a touch panel 1071 and at least one of other input devices 1072 . Touch panel 1071 is also called a touch screen. The touch panel 1071 may include two parts: a touch detection device and a touch controller. Other input devices 1072 may include, but are not limited to, physical keyboards, function keys (such as volume control keys, switch keys, etc.), trackballs, mice, and joysticks, which will not be described again here.
存储器109可用于存储软件程序以及各种数据。存储器109可主要包括存储程序或指令的第一存储区和存储数据的第二存储区,其中,第一存储区可存储操作系统、至少一个功能所需的应用程序或指令(比如声音播放功能、图像播放功能等)等。此外,存储器109可以包括易失性存储器或非易失性存储器,或者,存储器109可以包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(Read-Only Memory,ROM)、可编程只读存储器(Programmable ROM,PROM)、可擦除可编程只读存储器(Erasable PROM,EPROM)、电可擦除可编程只读存储器(Electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(Random Access Memory,RAM),静态随机存取存储器(Static RAM,SRAM)、动态随机存取存储器(DynamicRAM,DRAM)、同步动态随机存取存储器(Synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(Double Data Rate SDRAM,DDRSDRAM)、增强型同步动态随机存取存储器(Enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(Synch link DRAM,SLDRAM)和直接内存总线随机存取存储器(Direct Rambus RAM,DRRAM)。本申请实施例中的存储器109包括但不限于这些和任意其它适合类型的存储器。Memory 109 may be used to store software programs as well as various data. The memory 109 may mainly include a first storage area for storing programs or instructions and a second storage area for storing data, wherein the first storage area may store an operating system, an application program or instructions required for at least one function (such as a sound playback function, Image playback function, etc.) etc. Additionally, memory 109 may include volatile memory or nonvolatile memory, or memory 109 may include both volatile and nonvolatile memory. Among them, non-volatile memory can be read-only memory (Read-Only Memory, ROM), programmable read-only memory (Programmable ROM, PROM), erasable programmable read-only memory (Erasable PROM, EPROM), electrically removable memory. Erase programmable read-only memory (Electrically EPROM, EEPROM) or flash memory. Volatile memory can be random access memory (Random Access Memory, RAM), static random access memory (Static RAM, SRAM), dynamic random access memory (DynamicRAM, DRAM), synchronous dynamic random access memory (Synchronous DRAM) , SDRAM), double data rate synchronous dynamic random access memory (Double Data Rate SDRAM, DDRSDRAM), enhanced synchronous dynamic random access memory (Enhanced SDRAM, ESDRAM), synchronous link dynamic random access memory (Synch link DRAM, SLDRAM) and Direct Rambus RAM (DRRAM). Memory 109 in embodiments of the present application includes, but is not limited to, these and any other suitable types of memory.
处理器110可包括一个或多个处理单元;可选的,处理器110集成应用处理器和调制解调处理器,其中,应用处理器主要处理涉及操作系统、用户界面和应用程序等的操作,调制解调处理器主要处理无线通信信号,如基带处理器。可以理解的是,上述调制解调处理器也可以不集成到处理器110中。The processor 110 may include one or more processing units; optionally, the processor 110 integrates an application processor and a modem processor, where the application processor mainly handles operations related to the operating system, user interface, application programs, etc., Modem processors mainly process wireless communication signals, such as baseband processors. It can be understood that the above modem processor may not be integrated into the processor 110 .
本申请实施例还提供一种可读存储介质,所述可读存储介质上存储有程序或指令, 该程序或指令被处理器执行时实现上述视频生成方法实施例的各个过程,且能达到相同的技术效果,为避免重复,这里不再赘述。Embodiments of the present application also provide a readable storage medium, with programs or instructions stored on the readable storage medium. When the program or instruction is executed by the processor, each process of the above video generation method embodiment is implemented, and the same technical effect can be achieved. To avoid duplication, the details will not be described here.
其中,所述处理器为上述实施例中所述的电子设备中的处理器。所述可读存储介质,包括计算机可读存储介质,如计算机只读存储器ROM、随机存取存储器RAM、磁碟或者光盘等。Wherein, the processor is the processor in the electronic device described in the above embodiment. The readable storage medium includes computer readable storage media, such as computer read-only memory ROM, random access memory RAM, magnetic disk or optical disk, etc.
本申请实施例另提供了一种芯片,所述芯片包括处理器和通信接口,所述通信接口和所述处理器耦合,所述处理器用于运行程序或指令,实现上述视频生成方法实施例的各个过程,且能达到相同的技术效果,为避免重复,这里不再赘述。An embodiment of the present application further provides a chip. The chip includes a processor and a communication interface. The communication interface is coupled to the processor. The processor is used to run programs or instructions to implement the above video generation method embodiment. Each process can achieve the same technical effect. To avoid duplication, it will not be described again here.
应理解,本申请实施例提到的芯片还可以称为系统级芯片、系统芯片、芯片系统或片上系统芯片等。It should be understood that the chips mentioned in the embodiments of this application may also be called system-on-chip, system-on-a-chip, system-on-a-chip or system-on-chip, etc.
本申请实施例提供一种计算机程序产品,该程序产品被存储在存储介质中,该程序产品被至少一个处理器执行以实现如上述视频生成方法实施例的各个过程,且能达到相同的技术效果,为避免重复,这里不再赘述。Embodiments of the present application provide a computer program product. The program product is stored in a storage medium. The program product is executed by at least one processor to implement each process of the above video generation method embodiment, and can achieve the same technical effect. , to avoid repetition, will not be repeated here.
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。此外,需要指出的是,本申请实施方式中的方法和装置的范围不限按示出或讨论的顺序来执行功能,还可包括根据所涉及的功能按基本同时的方式或按相反的顺序来执行功能,例如,可以按不同于所描述的次序来执行所描述的方法,并且还可以添加、省去、或组合各种步骤。另外,参照某些示例所描述的特征可在其他示例中被组合。It should be noted that, in this document, the terms "comprising", "comprises" or any other variations thereof are intended to cover a non-exclusive inclusion, such that a process, method, article or device that includes a series of elements not only includes those elements, It also includes other elements not expressly listed or inherent in the process, method, article or apparatus. Without further limitation, an element defined by the statement "comprises a..." does not exclude the presence of additional identical elements in a process, method, article or apparatus that includes that element. In addition, it should be pointed out that the scope of the methods and devices in the embodiments of the present application is not limited to performing functions in the order shown or discussed, but may also include performing functions in a substantially simultaneous manner or in reverse order according to the functions involved. Functions may be performed, for example, the methods described may be performed in an order different from that described, and various steps may be added, omitted, or combined. Additionally, features described with reference to certain examples may be combined in other examples.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以计算机软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端(可以是手机,计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。Through the above description of the embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better. implementation. Based on this understanding, the technical solution of the present application can be embodied in the form of a computer software product that is essentially or contributes to the existing technology. The computer software product is stored in a storage medium (such as ROM/RAM, disk , optical disk), including several instructions to cause a terminal (which can be a mobile phone, computer, server, or network device, etc.) to execute the methods described in various embodiments of this application.
上面结合附图对本申请的实施例进行了描述,但是本申请并不局限于上述的具体实施方式,上述的具体实施方式仅仅是示意性的,而不是限制性的,本领域的普通技术人员在本申请的启示下,在不脱离本申请宗旨和权利要求所保护的范围情况下,还可做出很多形式,均属于本申请的保护之内。 The embodiments of the present application have been described above in conjunction with the accompanying drawings. However, the present application is not limited to the above-mentioned specific implementations. The above-mentioned specific implementations are only illustrative and not restrictive. Those of ordinary skill in the art will Inspired by this application, many forms can be made without departing from the purpose of this application and the scope protected by the claims, all of which fall within the protection of this application.

Claims (17)

  1. 一种视频生成方法,所述方法包括:A video generation method, the method includes:
    获取第一图像集合;Get the first image collection;
    将所述第一图像集合输入多分类模型进行分类,输出所述第一图像集合对应的M个分类结果;Input the first image set into a multi-classification model for classification, and output M classification results corresponding to the first image set;
    从所述M个分类结果对应的至少一个视频模板中,确定目标视频模板;Determine a target video template from at least one video template corresponding to the M classification results;
    基于所述第一图像集合与所述目标视频模板,生成目标视频;Generate a target video based on the first image set and the target video template;
    其中,M为大于1的整数。Among them, M is an integer greater than 1.
  2. 根据权利要求1所述的方法,其中,所述将所述第一图像集合输入多分类模型进行分类,输出所述第一图像集合对应的M个分类结果,包括:The method according to claim 1, wherein said inputting the first image set into a multi-classification model for classification and outputting M classification results corresponding to the first image set includes:
    将所述第一图像集合输入多分类模型后,基于所述多分类模型将所述第一图像集合中的N帧图像转换为X个图像块的第一图像特征信息;After inputting the first image set into a multi-classification model, convert N frames of images in the first image set into first image feature information of X image blocks based on the multi-classification model;
    从所述X个图像块的第一图像特征信息中,确定出第一关键图像特征信息;Determine first key image feature information from the first image feature information of the X image blocks;
    提取所述第一关键图像特征信息对应的高层语义信息;Extract high-level semantic information corresponding to the first key image feature information;
    基于所述高层语义信息,得到所述第一图像集合对应的M个分类结果;Based on the high-level semantic information, obtain M classification results corresponding to the first image set;
    其中,N、X为大于1的整数。Among them, N and X are integers greater than 1.
  3. 根据权利要求2所述的方法,其中,所述基于所述多分类模型将所述第一图像集合中的N帧图像转换为X个图像块的第一图像特征信息,包括:The method of claim 2, wherein converting N frames of images in the first image set into first image feature information of X image blocks based on the multi-classification model includes:
    基于所述多分类模型中的图像特征信息转化模块,将所述第一图像集合中的N帧图像进行拆分,得到X个图像块;Based on the image feature information conversion module in the multi-classification model, split the N frames of images in the first image set to obtain X image blocks;
    通过卷积神经网络对所述X个图像块进行特征信息提取,得到所述X个图像块的第一图像特征信息。Feature information is extracted from the X image blocks through a convolutional neural network to obtain first image feature information of the X image blocks.
  4. 根据权利要求2所述的方法,其中,所述从所述X个图像块的第一图像特征信息中,确定出第一关键图像特征信息,包括:The method according to claim 2, wherein determining the first key image feature information from the first image feature information of the X image blocks includes:
    基于所述多分类模型中的图像特征信息选择模块,从所述X个图像块的第一图像特征信息中,选择出第二关键图像特征信息,并将所述X个图像块的第一图像特征信息的排列方式进行变换,得到第二图像特征信息;Based on the image feature information selection module in the multi-classification model, the second key image feature information is selected from the first image feature information of the X image blocks, and the first images of the X image blocks are The arrangement of the feature information is transformed to obtain the second image feature information;
    将所述第二关键图像特征信息和所述第二图像特征信息进行融合,得到所述第一关键图像特征信息。The second key image feature information and the second image feature information are fused to obtain the first key image feature information.
  5. 根据权利要求2所述的方法,其中,所述提取所述第一关键图像特征信息对应的高层语义信息,包括:The method according to claim 2, wherein the extracting high-level semantic information corresponding to the first key image feature information includes:
    基于所述多分类模型中的基础特征模块,对所述第一关键图像特征信息进行归一化操作,得到第三关键图像特征信息;Based on the basic feature module in the multi-classification model, perform a normalization operation on the first key image feature information to obtain the third key image feature information;
    提取所述第三关键图像特征信息中的基础图像特征信息;Extract basic image feature information from the third key image feature information;
    将所述第一关键图像特征信息与所述基础图像特征信息融合,得到目标关键图像 特征信息;Fusion of the first key image feature information and the basic image feature information to obtain the target key image feature information;
    提取所述目标关键图像特征信息对应的高层语义信息。Extract high-level semantic information corresponding to the key image feature information of the target.
  6. 根据权利要求1所述的方法,其中,所述获取第一图像集合,包括:The method according to claim 1, wherein said obtaining the first image set includes:
    从第一视频中抽取N帧视频帧,以获取第一图像集合;Extract N video frames from the first video to obtain a first image set;
    所述基于所述第一图像集合与所述目标视频模板,生成目标视频,包括:Generating a target video based on the first image set and the target video template includes:
    将所述第一视频与所述目标视频模板融合,生成目标视频。The first video is fused with the target video template to generate a target video.
  7. 一种视频生成装置,所述装置包括:获取单元、分类单元、确定单元和生成单元,其中:A video generation device, the device includes: an acquisition unit, a classification unit, a determination unit and a generation unit, wherein:
    所述获取单元,用于获取第一图像集合;The acquisition unit is used to acquire the first image set;
    所述分类单元,用于将所述获取单元获取到的所述第一图像集合输入多分类模型进行分类,输出所述第一图像集合对应的M个分类结果;The classification unit is configured to input the first image set obtained by the acquisition unit into a multi-classification model for classification, and output M classification results corresponding to the first image set;
    所述确定单元,用于从所述分类单元得到的所述M个分类结果对应的至少一个视频模板中,确定目标视频模板;The determining unit is configured to determine a target video template from at least one video template corresponding to the M classification results obtained by the classification unit;
    所述生成单元,用于基于所述获取单元获取到的所述第一图像集合与所述确定单元确定的所述目标视频模板,生成目标视频;The generating unit is configured to generate a target video based on the first image set acquired by the acquiring unit and the target video template determined by the determining unit;
    其中,M为大于1的整数。Among them, M is an integer greater than 1.
  8. 根据权利要求7所述的装置,其中,所述分类单元,具体用于:The device according to claim 7, wherein the classification unit is specifically used for:
    将所述获取单元获取到的所述第一图像集合输入多分类模型后,基于所述多分类模型将所述第一图像集合中的N帧图像转换为X个图像块的第一图像特征信息;After the first image set acquired by the acquisition unit is input into a multi-classification model, N frames of images in the first image set are converted into first image feature information of X image blocks based on the multi-classification model. ;
    从所述X个图像块的第一图像特征信息中,确定出第一关键图像特征信息;Determine first key image feature information from the first image feature information of the X image blocks;
    提取所述第一关键图像特征信息对应的高层语义信息;Extract high-level semantic information corresponding to the first key image feature information;
    基于所述高层语义信息,得到所述第一图像集合对应的M个分类结果;Based on the high-level semantic information, obtain M classification results corresponding to the first image set;
    其中,N、X为大于1的整数。Among them, N and X are integers greater than 1.
  9. 根据权利要求8所述的装置,其中,所述分类单元,具体用于:The device according to claim 8, wherein the classification unit is specifically used for:
    基于所述多分类模型中的图像特征信息转化模块,将所述第一图像集合中的N帧图像进行拆分,得到X个图像块;Based on the image feature information conversion module in the multi-classification model, split the N frames of images in the first image set to obtain X image blocks;
    通过卷积神经网络对所述X个图像块进行特征信息提取,得到所述X个图像块的第一图像特征信息。Feature information is extracted from the X image blocks through a convolutional neural network to obtain first image feature information of the X image blocks.
  10. 根据权利要求8所述的装置,其中,所述分类单元,具体用于:The device according to claim 8, wherein the classification unit is specifically used for:
    基于所述多分类模型中的图像特征信息选择模块,从所述X个图像块的第一图像特征信息中,选择出第二关键图像特征信息,并将所述X个图像块的第一图像特征信息的排列方式进行变换,得到第二图像特征信息;Based on the image feature information selection module in the multi-classification model, the second key image feature information is selected from the first image feature information of the X image blocks, and the first images of the X image blocks are The arrangement of the feature information is transformed to obtain the second image feature information;
    将所述第二关键图像特征信息和所述第二图像特征信息进行融合,得到所述第一关键图像特征信息。The second key image feature information and the second image feature information are fused to obtain the first key image feature information.
  11. 根据权利要求8所述的装置,其中,所述分类单元,具体用于: The device according to claim 8, wherein the classification unit is specifically used for:
    基于所述多分类模型中的基础特征模块,对所述第一关键图像特征信息进行归一化操作,得到第三关键图像特征信息;Based on the basic feature module in the multi-classification model, perform a normalization operation on the first key image feature information to obtain the third key image feature information;
    提取所述第三关键图像特征信息中的基础图像特征信息;Extract basic image feature information from the third key image feature information;
    将所述第一关键图像特征信息与所述基础图像特征信息融合,得到目标关键图像特征信息;Fusion of the first key image feature information and the basic image feature information to obtain target key image feature information;
    提取所述目标关键图像特征信息对应的高层语义信息。Extract high-level semantic information corresponding to the key image feature information of the target.
  12. 根据权利要求7所述的装置,其中,The device of claim 7, wherein
    所述获取单元,具体用于从第一视频中抽取N帧视频帧,以获取第一图像集合;The acquisition unit is specifically used to extract N video frames from the first video to obtain the first image set;
    所述生成单元,具体用于将所述第一视频与所述目标视频模板融合,生成目标视频。The generating unit is specifically configured to fuse the first video with the target video template to generate a target video.
  13. 一种电子设备,包括处理器和存储器,所述存储器存储可在所述处理器上运行的程序或指令,所述程序或指令被所述处理器执行时实现如权利要求1至6任一项所述的视频生成方法的步骤。An electronic device, including a processor and a memory. The memory stores programs or instructions that can be run on the processor. When the program or instructions are executed by the processor, the implementation of any one of claims 1 to 6 is achieved. The steps of the video generation method.
  14. 一种可读存储介质,所述可读存储介质上存储程序或指令,所述程序或指令被处理器执行时实现如权利要求1至6任一项所述的视频生成方法的步骤。A readable storage medium on which a program or instructions are stored. When the program or instructions are executed by a processor, the steps of the video generation method according to any one of claims 1 to 6 are implemented.
  15. 一种计算机程序产品,所述计算机程序产品被至少一个处理器执行以实现如权利要求1至6任一项所述的视频生成方法。A computer program product, which is executed by at least one processor to implement the video generation method according to any one of claims 1 to 6.
  16. 一种电子设备,所述电子设备被配置成用于执行如权利要求1至6任一项所述的视频生成方法。An electronic device configured to perform the video generation method according to any one of claims 1 to 6.
  17. 一种芯片,所述芯片包括处理器和通信接口,所述通信接口和所述处理器耦合,所述处理器用于运行程序或指令,实现如权利要求1至6任一项所述的视频生成方法。 A chip, the chip includes a processor and a communication interface, the communication interface is coupled to the processor, and the processor is used to run programs or instructions to implement video generation according to any one of claims 1 to 6 method.
PCT/CN2023/105161 2022-07-14 2023-06-30 Video generation method and apparatus, electronic device and medium WO2024012289A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210834501.4 2022-07-14
CN202210834501.4A CN115222838A (en) 2022-07-14 2022-07-14 Video generation method, device, electronic equipment and medium

Publications (1)

Publication Number Publication Date
WO2024012289A1 true WO2024012289A1 (en) 2024-01-18

Family

ID=83611607

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/105161 WO2024012289A1 (en) 2022-07-14 2023-06-30 Video generation method and apparatus, electronic device and medium

Country Status (2)

Country Link
CN (1) CN115222838A (en)
WO (1) WO2024012289A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115222838A (en) * 2022-07-14 2022-10-21 维沃移动通信有限公司 Video generation method, device, electronic equipment and medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108710902A (en) * 2018-05-08 2018-10-26 江苏云立物联科技有限公司 A kind of sorting technique towards high-resolution remote sensing image based on artificial intelligence
CN110837579A (en) * 2019-11-05 2020-02-25 腾讯科技(深圳)有限公司 Video classification method, device, computer and readable storage medium
CN111757149A (en) * 2020-07-17 2020-10-09 商汤集团有限公司 Video editing method, device, equipment and storage medium
US20210081671A1 (en) * 2019-09-12 2021-03-18 Beijing Xiaomi Mobile Software Co., Ltd. Video processing method and device, and storage medium
CN113094552A (en) * 2021-03-19 2021-07-09 北京达佳互联信息技术有限公司 Video template searching method and device, server and readable storage medium
CN115222838A (en) * 2022-07-14 2022-10-21 维沃移动通信有限公司 Video generation method, device, electronic equipment and medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108710902A (en) * 2018-05-08 2018-10-26 江苏云立物联科技有限公司 A kind of sorting technique towards high-resolution remote sensing image based on artificial intelligence
US20210081671A1 (en) * 2019-09-12 2021-03-18 Beijing Xiaomi Mobile Software Co., Ltd. Video processing method and device, and storage medium
CN110837579A (en) * 2019-11-05 2020-02-25 腾讯科技(深圳)有限公司 Video classification method, device, computer and readable storage medium
CN111757149A (en) * 2020-07-17 2020-10-09 商汤集团有限公司 Video editing method, device, equipment and storage medium
CN113094552A (en) * 2021-03-19 2021-07-09 北京达佳互联信息技术有限公司 Video template searching method and device, server and readable storage medium
CN115222838A (en) * 2022-07-14 2022-10-21 维沃移动通信有限公司 Video generation method, device, electronic equipment and medium

Also Published As

Publication number Publication date
CN115222838A (en) 2022-10-21

Similar Documents

Publication Publication Date Title
CN111428088B (en) Video classification method and device and server
CN111191078B (en) Video information processing method and device based on video information processing model
JP7193252B2 (en) Captioning image regions
Wang et al. Image captioning with deep bidirectional LSTMs and multi-task learning
Yao et al. Describing videos by exploiting temporal structure
CN107066464B (en) Semantic natural language vector space
CN108986186B (en) Method and system for converting text into video
CN106973244B (en) Method and system for automatically generating image captions using weak supervision data
WO2021008320A1 (en) Sign language recognition method and apparatus, computer-readable storage medium, and computer device
CN111209970B (en) Video classification method, device, storage medium and server
US11475219B2 (en) Method for processing information, and storage medium
Cao et al. Image captioning with bidirectional semantic attention-based guiding of long short-term memory
CN110083729B (en) Image searching method and system
CN114936623B (en) Aspect-level emotion analysis method integrating multi-mode data
CN112000818A (en) Cross-media retrieval method and electronic device for texts and images
CN111464881B (en) Full-convolution video description generation method based on self-optimization mechanism
WO2024012289A1 (en) Video generation method and apparatus, electronic device and medium
Zhang et al. Image caption generation using contextual information fusion with Bi-LSTM-s
WO2024088269A1 (en) Character recognition method and apparatus, and electronic device and storage medium
WO2023173552A1 (en) Establishment method for target detection model, application method for target detection model, and device, apparatus and medium
CN112261321B (en) Subtitle processing method and device and electronic equipment
Hammad et al. Characterizing the impact of using features extracted from pre-trained models on the quality of video captioning sequence-to-sequence models
Jiang et al. Multi-scale dual-modal generative adversarial networks for text-to-image synthesis
KR102578169B1 (en) System for Providing Auto Complete Service YouTube Thumbnail
Shen Image understanding via learning weakly-supervised cross-modal semantic translation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23838781

Country of ref document: EP

Kind code of ref document: A1