CN114584803A

CN114584803A - Video generation method and computer equipment

Info

Publication number: CN114584803A
Application number: CN202011387442.8A
Authority: CN
Inventors: 药欣; 马瑞; 曹芝勇; 周树荣; 毛明海
Original assignee: Shenzhen TCL Digital Technology Co Ltd
Current assignee: Shenzhen TCL Digital Technology Co Ltd
Priority date: 2020-12-01
Filing date: 2020-12-01
Publication date: 2022-06-03
Anticipated expiration: 2040-12-01
Also published as: CN114584803B

Abstract

The present invention provides a video generation method and computer equipment. The video generation method includes: acquiring an image set to be processed and audio to be processed, wherein the image set includes several images; audio, determining the target beat point corresponding to each image in the several images; inserting the image corresponding to the target beat point in each target beat point in the audio to generate a video. In the present invention, each image is inserted at the target beat point of the audio, which can improve the expressiveness of the video and obtain a video with better quality.

Description

A method for generating video and computer equipment

技术领域technical field

本申请涉及视频处理技术领域，特别是涉及一种视频的生成方法和计算机设备。The present application relates to the technical field of video processing, and in particular, to a video generation method and computer equipment.

背景技术Background technique

视频制作是一个复杂的过程，想要得到一个质量高的视频，需要整理图像，选择合适的音频，再确定图像的插入点。这个过程对制作人的技术能力有比较很高的要求。Video production is a complex process. To get a high-quality video, you need to organize the images, select the appropriate audio, and then determine the insertion point of the image. This process has relatively high requirements on the technical ability of the producer.

对于不具备视频制作技术的普通人来说，通常是手动选择图像，加上背景音频以幻灯片的形式进行播放，通过这种方式得到的视频，表现力差，质量不高。For ordinary people who do not have video production skills, they usually select images manually, and play the background audio in the form of a slideshow. The video obtained in this way has poor expressiveness and low quality.

因此，现有技术有待改进。Therefore, the existing technology needs to be improved.

发明内容SUMMARY OF THE INVENTION

本发明提供了一种视频的生成方法和计算机设备，以实现让相似的图像连续播放，并且每一张图像均在音频的节拍点插入，可以提高视频的表现力，得到质量更好的视频。The invention provides a video generation method and computer equipment, so as to realize the continuous playback of similar images, and each image is inserted at the beat point of the audio, which can improve the expressive power of the video and obtain the video with better quality.

第一方面，本发明实施例提供了一种视频的生成方法，包括：In a first aspect, an embodiment of the present invention provides a method for generating a video, including:

获取待处理的图像集和待处理的音频，其中，所述图像集包括若干图像；acquiring a set of images to be processed and audio to be processed, wherein the set of images includes several images;

基于所述图像集和所述音频，确定所述若干图像中每张图像各自对应的目标节拍点；Based on the image set and the audio, determining the target beat point corresponding to each image in the plurality of images;

在所述音频中的每个目标节拍点插入该目标节拍点对应的图像，以生成视频。Inserting an image corresponding to the target beat point at each target beat point in the audio to generate a video.

在进一步的改进方案中，所述基于所述图像集和所述音频，确定所述若干图像中每张图像各自对应的目标节拍点，具体包括：In a further improvement scheme, determining the target beat point corresponding to each image in the plurality of images based on the image set and the audio, specifically includes:

基于所述图像集中任意两张图像之间的相似度，对所述若干图像进行排序，以得到图像插入序列；Based on the similarity between any two images in the image set, sorting the several images to obtain an image insertion sequence;

获取所述音频的若干节拍点；obtain several beat points of the audio;

根据所述图像插入序列、所述若干节拍点和所述音频确定所述图像插入序列中每张图像各自对应的目标节拍点，其中，所述目标节拍点是所述音频的若干节拍点中用于插入图像的节拍点。The target beat point corresponding to each image in the image insertion sequence is determined according to the image insertion sequence, the several beat points and the audio, wherein the target beat point is a beat point used in several beat points of the audio. at the beat point of the inserted image.

在进一步的改进方案中，所述基于所述图像集中任意两张图像之间的相似度，对所述若干图像进行排序，以得到图像插入序列，具体包括：In a further improvement scheme, the plurality of images are sorted based on the similarity between any two images in the image set to obtain an image insertion sequence, which specifically includes:

在所述图像集中选取一张起始图像，将所述起始图像的插入序号设定为第一序号；Select an initial image in the image set, and set the insertion sequence number of the initial image as the first sequence number;

确定所述起始图像对应的非起始图像集，其中，所述非起始图像集包括若干非起始图像；determining a non-initial image set corresponding to the initial image, wherein the non-initial image set includes several non-initial images;

基于每一张非起始图像与所述起始图像之间的相似度确定所述起始图像对应的候选图像，将所述候选图像的插入序号设定为所述起始图像的插入序号的后一序号；Based on the similarity between each non-starting image and the starting image, a candidate image corresponding to the starting image is determined, and the insertion sequence number of the candidate image is set to be equal to the insertion sequence number of the starting image. the next serial number;

将所述候选图像作为起始图像，并继续执行所述确定所述起始图像对应的非起始图像集的步骤，直至确定所述图像集中所有图像各自对应的插入序号；Taking the candidate image as the starting image, and continuing to perform the step of determining the non-starting image set corresponding to the starting image, until the respective insertion sequence numbers of all the images in the image set are determined;

根据所述图像集中所有图像各自对应的插入序号确定所述图像集对应的图像插入序列。The image insertion sequence corresponding to the image set is determined according to the corresponding insertion sequence numbers of all the images in the image set.

在进一步的改进方案中，所述确定所述起始图像对应的非起始图像集，具体包括：In a further improvement scheme, the determining the non-starting image set corresponding to the starting image specifically includes:

对于所述起始图像，在所述图像集中选取所有未确定插入序号的图像，以得到所述起始图像对应的非起始图像集。For the starting image, all images whose insertion sequence numbers are not determined are selected from the image set to obtain a non-starting image set corresponding to the starting image.

在进一步的改进方案中，所述基于所述非起始图像集中每一张非起始图像与所述起始图像之间的相似度确定所述起始图像对应的候选图像，具体包括：In a further improvement scheme, determining the candidate image corresponding to the starting image based on the similarity between each non-starting image in the set of non-starting images and the starting image specifically includes:

分别计算所述非起始图像集中每一张非起始图像与所述起始图像之间的相似度，以得到相似度集；respectively calculating the similarity between each non-initial image in the non-initial image set and the initial image to obtain a similarity set;

在所述相似度集中选取最大相似度，并将所述最大相似度对应的图像作为所述起始图像对应的候选图像。The maximum similarity is selected from the similarity set, and the image corresponding to the maximum similarity is used as the candidate image corresponding to the initial image.

在进一步的改进方案中，所述根据所述图像插入序列、所述若干节拍点和所述音频确定所述图像插入序列中每张图像各自对应的目标节拍点，具体包括：In a further improved solution, determining the target beat point corresponding to each image in the image insertion sequence according to the image insertion sequence, the several beat points and the audio, specifically includes:

获取所述音频对应的音频时长，以及所述若干图像的图像数量；Obtain the audio duration corresponding to the audio and the number of images of the several images;

根据所述图像数量和所述音频时长确定所述图像插入序列中每每张图像各自对应的图像插入点，其中，所述相邻两个图像插入点之间的时长为根据所述音频时长与所述图像数量所确定的；An image insertion point corresponding to each image in the image insertion sequence is determined according to the number of images and the audio duration, wherein the duration between the two adjacent image insertion points is determined according to the audio duration and the audio duration. determined by the number of said images;

根据所述若干图像插入点和所述若干节拍点确定所述若干图像各自对应的目标节拍点。The target beat points corresponding to the several images are determined according to the several image insertion points and the several beat points.

在进一步的改进方案中，所述根据所述若干图像插入点和所述若干节拍点确定所述若干图像各自对应的目标节拍点，具体包括：In a further improved solution, determining the target beat points corresponding to the several images according to the several image insertion points and the several beat points specifically includes:

对于每一个图像插入点，在所述若干节拍点中确定与该图像插入点距离最近的节拍点，并将所述与该图像插入点距离最近的节拍点作为该图像插入点对应的图像的目标节拍点。For each image insertion point, the beat point that is closest to the image insertion point is determined among the several beat points, and the beat point that is closest to the image insertion point is used as the target of the image corresponding to the image insertion point beat point.

在进一步的改进方案中，所述在所述音频中的每个目标节拍点插入该目标节拍点对应的图像，以生成视频，具体包括：In a further improvement scheme, the image corresponding to the target beat point is inserted into each target beat point in the audio to generate a video, which specifically includes:

对于每个目标节拍点，在所述音频的播放时刻到达该目标节拍点处时，插入该目标节拍点对应的图像，并将该图像作为该目标节拍点和下一目标节拍点之间播放的图像帧。For each target beat point, when the audio playback time reaches the target beat point, insert the image corresponding to the target beat point, and use the image as the image played between the target beat point and the next target beat point image frame.

在进一步的改进方案中，所述获取待处理的图像集，并基于所述图像集中任意两张图像之间的相似度确定图像插入序列之前，还包括：In a further improvement scheme, before the acquiring the image set to be processed and determining the image insertion sequence based on the similarity between any two images in the image set, the method further includes:

获取原始图像集，其中，所述原始图像集中包括多张原始图像，所述多张原始图像中包括至少一张模板图像；acquiring an original image set, wherein the original image set includes multiple original images, and the multiple original images include at least one template image;

确定提取所述原始图像集中每张原始图像对应的目标特征图；Determine to extract the target feature map corresponding to each original image in the original image set;

基于确定的所有目标特征图将所述原始图像集划分为不同类别的原始图像子集；dividing the original image set into original image subsets of different categories based on all the determined target feature maps;

将包括所述模板图像的任意一个原始图像子集作为所述待处理的图像集。Any original image subset including the template image is used as the image set to be processed.

第二方面，本发明提供了一种视频生成装置，包括：In a second aspect, the present invention provides a video generation device, comprising:

获取单元，用于获取待处理的图像集和待处理的音频，其中，所述图像集包括若干图像；an acquisition unit, configured to acquire a set of images to be processed and audio to be processed, wherein the set of images includes several images;

目标节拍点确定单元，用于基于所述图像集和所述音频，确定所述若干图像中每张图像各自对应的目标节拍点；a target beat point determination unit, configured to determine the target beat point corresponding to each of the several images based on the image set and the audio;

视频生成单元，用于在所述音频中的每个目标节拍点插入该目标节拍点对应的图像，以生成视频。A video generation unit, configured to insert an image corresponding to the target beat point in each target beat point in the audio to generate a video.

第三方面，本发明实施例提供了一种计算机设备，包括存储器和处理器，所述存储器存储有计算机程序，所述处理器执行所述计算机程序时实现以下步骤：In a third aspect, an embodiment of the present invention provides a computer device, including a memory and a processor, the memory stores a computer program, and the processor implements the following steps when executing the computer program:

第四方面，本发明实施例还提供了一种计算机可读存储介质，其上存储有计算机程序，所述计算机程序被处理器执行时实现以下步骤：In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the following steps are implemented:

与现有技术相比，本发明实施例具有以下优点：Compared with the prior art, the embodiment of the present invention has the following advantages:

本发明提供了一种视频的生成方法，包括：获取待处理的图像集和待处理的音频，其中，所述图像集包括若干图像；基于所述图像集和所述音频，确定所述若干图像中每张图像各自对应的目标节拍点；在所述音频中的每个目标节拍点插入该目标节拍点对应的图像，以生成视频。在本发明中，每一张图像均在音频的节拍点插入，可以提高视频的表现力，得到质量更好的视频。The present invention provides a method for generating a video, comprising: acquiring an image set to be processed and audio to be processed, wherein the image set includes several images; and determining the several images based on the image set and the audio The target beat point corresponding to each image in the audio; insert the image corresponding to the target beat point in each target beat point in the audio to generate a video. In the present invention, each image is inserted at the beat point of the audio, which can improve the expressive power of the video and obtain a video with better quality.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明中记载的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments described in the present invention. For those of ordinary skill in the art, other drawings can also be obtained based on these drawings without any creative effort.

图1为本发明实施例中一种视频的生成方法的流程示意图；1 is a schematic flowchart of a method for generating a video in an embodiment of the present invention;

图2为本发明实施例中特征提取网络的示意图；2 is a schematic diagram of a feature extraction network in an embodiment of the present invention;

图3为本发明实施例中一种视频生成装置的结构示意图；3 is a schematic structural diagram of a video generation apparatus in an embodiment of the present invention;

图4为本发明实施例中计算机设备的内部结构图。FIG. 4 is an internal structural diagram of a computer device in an embodiment of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及效果更加清楚、明确，以下参照附图并举实施例对本发明进一步详细说明。应当理解，此处所描述的具体实施例仅用以解释本发明，并不用于限定本发明。In order to make the objectives, technical solutions and effects of the present invention clearer and clearer, the present invention will be described in further detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.

本技术领域技术人员可以理解，除非特意声明，这里使用的单数形式“一”、“一个”、“所述”和“该”也可包括复数形式。应该进一步理解的是，本发明的说明书中使用的措辞“包括”是指存在所述特征、整数、步骤、操作、元件和/或组件，但是并不排除存在或添加一个或多个其他特征、整数、步骤、操作、元件、组件和/或它们的组。应该理解，当我们称元件被“连接”或“耦接”到另一元件时，它可以直接连接或耦接到其他元件，或者也可以存在中间元件。此外，这里使用的“连接”或“耦接”可以包括无线连接或无线耦接。这里使用的措辞“和/或”包括一个或更多个相关联的列出项的全部或任一单元和全部组合。It will be understood by those skilled in the art that the singular forms "a", "an", "the" and "the" as used herein can include the plural forms as well, unless expressly stated otherwise. It should be further understood that the word "comprising" used in the description of the present invention refers to the presence of stated features, integers, steps, operations, elements and/or components, but does not exclude the presence or addition of one or more other features, Integers, steps, operations, elements, components and/or groups thereof. It will be understood that when we refer to an element as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Furthermore, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combination of one or more of the associated listed items.

本技术领域技术人员可以理解，除非另外定义，这里使用的所有术语(包括技术术语和科学术语)，具有与本发明所属领域中的普通技术人员的一般理解相同的意义。还应该理解的是，诸如通用字典中定义的那些术语，应该被理解为具有与现有技术的上下文中的意义一致的意义，并且除非像这里一样被特定定义，否则不会用理想化或过于正式的含义来解释。It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It should also be understood that terms, such as those defined in a general dictionary, should be understood to have meanings consistent with their meanings in the context of the prior art and, unless specifically defined as herein, should not be interpreted in idealistic or overly formal meaning to explain.

发明人经过研究发现，视频制作是一个复杂的过程，想要得到一个质量高的视频，需要整理图像，选择合适的音频，再确定图像的插入点。这个过程对制作人的技术能力有比较很高的要求。对于不具备视频制作技术的普通人来说，通常是手动选择图像，加上背景音频以幻灯片的形式进行播放，通过这种方式得到的视频，表现力差，质量不高。After research, the inventor found that video production is a complicated process. To get a high-quality video, it is necessary to organize the images, select the appropriate audio, and then determine the insertion point of the image. This process has relatively high requirements on the technical ability of the producer. For ordinary people who do not have video production skills, they usually select images manually, and play the background audio in the form of a slideshow. The video obtained in this way has poor expressiveness and low quality.

为了解决上述问题，在本发明实施例中，获取待处理的图像集和待处理的音频，其中，所述图像集包括若干图像；基于所述图像集和所述音频，确定所述若干图像中每张图像各自对应的目标节拍点；在所述音频中的每个目标节拍点插入该目标节拍点对应的图像，以生成视频。在本发明中，每一张图像均在音频的节拍点插入，可以提高视频的表现力，得到质量更好的视频。In order to solve the above problem, in this embodiment of the present invention, an image set to be processed and audio to be processed are acquired, wherein the image set includes several images; based on the image set and the audio, it is determined that among the several images Each image corresponds to a target beat point; inserting an image corresponding to the target beat point in each target beat point in the audio to generate a video. In the present invention, each image is inserted at the beat point of the audio, which can improve the expressive power of the video and obtain a video with better quality.

本发明实施例提供的一种视频的生成方法可以应用于电子设备，所述电子设备可以包括：包括PC机、电视机、服务器、手机、平板电脑、掌上电脑、个人数字助理(PersonalDigital Assistant，PDA)等。A method for generating a video provided by an embodiment of the present invention can be applied to electronic equipment, and the electronic equipment may include: including a PC, a TV, a server, a mobile phone, a tablet computer, a handheld computer, and a Personal Digital Assistant (PDA). )Wait.

下面结合附图，通过对实施例的描述，对发明内容作进一步说明。In the following, the content of the invention will be further illustrated by describing the embodiments in conjunction with the accompanying drawings.

参阅图1，本实施例提供了一种视频的生成方法，包括：Referring to FIG. 1, this embodiment provides a method for generating a video, including:

S1、获取待处理的图像集和待处理的音频，其中，所述图像集包括若干图像。S1. Acquire an image set to be processed and audio to be processed, wherein the image set includes several images.

在本发明实施例中，所述待处理的图像集中包括用于生成视频的若干图像。所述待处理的图像集可以是执行所述视频的生成方法的终端采集的，或者所述待处理的图像集可以是从第三方设备获取的，或者所述待处理的图像集中的图像部分来自于第三方设备，部分为终端采集。In this embodiment of the present invention, the set of images to be processed includes several images used for generating a video. The to-be-processed image set may be collected by a terminal executing the video generation method, or the to-be-processed image set may be obtained from a third-party device, or the image portion of the to-be-processed image set comes from For third-party devices, some are collected by terminals.

在本发明实施例中，所述音频用于生成视频，为了得到更好的视频效果，所述音频可以是音乐对应的音频，将音频作为视频的背景音乐。In this embodiment of the present invention, the audio is used to generate a video. In order to obtain a better video effect, the audio may be audio corresponding to music, and the audio is used as the background music of the video.

在本发明实施例中，为了使待处理的图像集中的图像风格相似，得到画面更和谐的视频，可以限定所述图像集中的图像属于同一类别。具体的，可以在原始图像集中选择属于同一类别的图像，将选取的属于同一类别的图像作为待处理的图像集。In this embodiment of the present invention, in order to make images in the image set to be processed have similar styles and obtain a video with a more harmonious picture, the images in the image set may be limited to belong to the same category. Specifically, images belonging to the same category may be selected from the original image set, and the selected images belonging to the same category may be used as the image set to be processed.

具体的，在步骤S1之前，包括：Specifically, before step S1, it includes:

M1、获取原始图像集，其中，所述原始图像集中包括多张原始图像，所述多张原始图像中包括至少一张模板图像。M1. Acquire an original image set, wherein the original image set includes multiple original images, and the multiple original images include at least one template image.

在本发明实施例中，所述原始图像集中包括多张原始图像，同样的，所述原始图像可以是从网络获取的，或者原始图像可以是终端采集的。多张原始图像中包括至少一张模板图像。在多张原始图像中选取用于生成视频的图像时，模板图像的优先级高于非模板图像的优先级，所述非模板图像是指多张原始图像中除了模板图像以外的其他原始图像；也就是说，对于一张模板图像和一张非模板图像，可以选择模板图像作为生成视频的图像。In this embodiment of the present invention, the original image set includes a plurality of original images. Similarly, the original images may be acquired from a network, or the original images may be collected by a terminal. At least one template image is included in the plurality of original images. When selecting an image for generating a video from multiple original images, the priority of the template image is higher than that of the non-template image, and the non-template image refers to other original images in the multiple original images except the template image; That is, for a template image and a non-template image, the template image can be selected as the image for generating the video.

所述模板图像可以是用户自定义选择，即用户在多张原始图像中指定模板图像，模板图像可以是用户最喜欢的原始图像，表示用户最希望视频中出现模板图像。所述模板图像可以是多张。The template image may be a user-defined selection, that is, the user specifies a template image among multiple original images, and the template image may be the user's favorite original image, indicating that the user most wishes the template image to appear in the video. The template image may be multiple.

M2、确定所述原始图像集中每张原始图像对应的目标特征图。M2. Determine a target feature map corresponding to each original image in the original image set.

在本发明实施例中，可以通过神经网络模型确定每张原始图像对应的目标特征图。所述目标特征图是图像尺寸为1*1的多通道图像(只有一个像素点)，所述目标特征图可以通过向量表示，向量的维度与目标特征图的通道数相同，向量中一个维度的数值为像素点在该维度对应的通道中的像素值。In this embodiment of the present invention, the target feature map corresponding to each original image can be determined by using a neural network model. The target feature map is a multi-channel image with an image size of 1*1 (only one pixel), the target feature map can be represented by a vector, and the dimension of the vector is the same as the number of channels of the target feature map. The value is the pixel value of the pixel in the channel corresponding to this dimension.

将一张原始图像输入将神经网络模型，神经网络模型的输出项则是该原始图像对应的目标特征图。所述神经网络模型包括特征提取模块和全连接模块；将原始图像输入通过特征提取模块，以得到原始图像对应的特征图，将特征图输入全连接模块，全连接模块的输出项为目标特征图。Input an original image into the neural network model, and the output item of the neural network model is the target feature map corresponding to the original image. The neural network model includes a feature extraction module and a fully connected module; the original image is input through the feature extraction module to obtain a feature map corresponding to the original image, and the feature map is input into the fully connected module, and the output item of the fully connected module is the target feature map .

在本发明实施例中，输入神经网络模型的原始图像的图像尺寸需要满足神经网络模型的输入要求，需要预先将所有原始图像的图像尺寸调整为预设尺寸。例如，所述预设尺寸为224*224。In the embodiment of the present invention, the image size of the original image input to the neural network model needs to meet the input requirements of the neural network model, and the image size of all original images needs to be adjusted to a preset size in advance. For example, the preset size is 224*224.

在具体实施时，所述神经网络模型可以是VGG16网络模型。用于生成特征图集的特征提取网络如图2所示，所述神经网络模型包括五个特征提取模块和一个全连接模块，其中，五个特征提取模块包括：第一特征提取模块、第二特征提取模块、第三特征提取模块、第四特征提取模块和第五特征提取模块。In a specific implementation, the neural network model may be a VGG16 network model. The feature extraction network used to generate the feature atlas is shown in Figure 2. The neural network model includes five feature extraction modules and one fully connected module, wherein the five feature extraction modules include: a first feature extraction module, a second feature extraction module Feature extraction module, third feature extraction module, fourth feature extraction module and fifth feature extraction module.

第一特征提取模块包括：第一卷积层c1、第二卷积层c2和第一池化层p1；c1和c2的卷积核的大小均为3*3*3，c1和c2的卷积核的个数均为64，p1的参数为2*2；所述第一特征提取模块的输入项为一原始图像，所述第一特征提取模块提取原始图像的特征，以得到第一特征图。The first feature extraction module includes: the first convolution layer c1, the second convolution layer c2 and the first pooling layer p1; the size of the convolution kernels of c1 and c2 are both 3*3*3, and the volumes of c1 and c2 The number of product kernels is 64, and the parameter of p1 is 2*2; the input item of the first feature extraction module is an original image, and the first feature extraction module extracts the features of the original image to obtain the first feature picture.

第二特征提取模块包括：第三卷积层c3、第四卷积层c4和第二池化层p2；c3和c4的卷积核的大小均为3*3*3，c3和c4的卷积核的个数均为128，p2的参数为2*2；所述第二特征提取模块的输入项为第一特征图，所述第二特征提取模块提取所述第一特征图的特征，以得到第二特征图。The second feature extraction module includes: the third convolution layer c3, the fourth convolution layer c4 and the second pooling layer p2; the size of the convolution kernels of c3 and c4 are both 3*3*3, the volumes of c3 and c4 The number of product kernels is 128, and the parameter of p2 is 2*2; the input item of the second feature extraction module is the first feature map, and the second feature extraction module extracts the features of the first feature map, to get the second feature map.

第三特征提取模块包括：第五卷积层c5、第六卷积层c6、第七卷积层c7和第三池化层p3；c5、c6和c7的卷积核的大小均为3*3*3，c5、c6和c7的卷积核的个数均为256，p3的参数为2*2；所述第三特征提取模块的输入项为第二特征图，所述第三特征提取模块提取所述第二特征图的特征，以得到第三特征图。The third feature extraction module includes: the fifth convolutional layer c5, the sixth convolutional layer c6, the seventh convolutional layer c7 and the third pooling layer p3; the size of the convolution kernels of c5, c6 and c7 are all 3* 3*3, the number of convolution kernels of c5, c6 and c7 are all 256, and the parameter of p3 is 2*2; the input item of the third feature extraction module is the second feature map, and the third feature extraction The module extracts features of the second feature map to obtain a third feature map.

第四特征提取模块包括：第八卷积层c8、第九卷积层c9、第十卷积层c10和第四池化层p4；c8、c9和c10的卷积核的大小均为3*3*3，c8、c9和c10的卷积核的个数均为512，p4的参数为2*2；所述第四特征提取模块的输入项为第三特征图，所述第四特征提取模块提取所述第三特征图的特征，以得到第四特征图。The fourth feature extraction module includes: the eighth convolution layer c8, the ninth convolution layer c9, the tenth convolution layer c10 and the fourth pooling layer p4; the size of the convolution kernels of c8, c9 and c10 are all 3* 3*3, the number of convolution kernels of c8, c9 and c10 are all 512, and the parameter of p4 is 2*2; the input item of the fourth feature extraction module is the third feature map, and the fourth feature extraction The module extracts features of the third feature map to obtain a fourth feature map.

第五特征提取模块包括：第十一卷积层c11、第十二卷积层c12、第十三卷积层c13和第五池化层p5；c10、c11和c12的卷积核的大小均为3*3*3，c11、c12和c13的卷积核的个数均为512，p5的参数为2*2；所述第五特征提取模块的输入项为第四特征图，所述第五特征提取模块提取所述第四特征图的特征，以得到第五特征图。The fifth feature extraction module includes: the eleventh convolutional layer c11, the twelfth convolutional layer c12, the thirteenth convolutional layer c13 and the fifth pooling layer p5; the size of the convolution kernels of c10, c11 and c12 are all is 3*3*3, the number of convolution kernels of c11, c12 and c13 are all 512, and the parameter of p5 is 2*2; the input item of the fifth feature extraction module is the fourth feature map, the first The five feature extraction module extracts the features of the fourth feature map to obtain the fifth feature map.

全连接模块包括第一全连接层fc1、第二全连接层fc2和第三全连接层fc3，第一全连接层的参数为：1*1*4096，第二全连接层的参数为1*1*4096，第三全连接层的参数为1*1*1000。将所述第五特征图输入全连接模块，得到原始图像对应的目标特征图。所述目标特征图可以通过向量形式表示，例如，目标特征图可以表示为：{x1,x2,…,xn}。The fully connected module includes the first fully connected layer fc1, the second fully connected layer fc2 and the third fully connected layer fc3. The parameters of the first fully connected layer are: 1*1*4096, and the parameters of the second fully connected layer are 1* 1*4096, the parameters of the third fully connected layer are 1*1*1000. The fifth feature map is input into the fully connected module to obtain the target feature map corresponding to the original image. The target feature map can be represented in the form of a vector, for example, the target feature map can be represented as: {x1,x2,...,xn}.

M3、基于确定的所有目标特征图将所述原始图像集划分为不同类别的原始图像子集。M3. Divide the original image set into original image subsets of different categories based on all the determined target feature maps.

在本发明实施例中，采用分类方法将原始图像集划分为多个原始图像子集，每个原始图像子集对应的类别互不相同。例如，可以通过K均值算法将所述原始图像集划分为不同类别的原始图像子集。一个原始图像子集包括的原始图像的风格相似。In the embodiment of the present invention, a classification method is used to divide the original image set into multiple original image subsets, and the categories corresponding to each original image subset are different from each other. For example, the original image set may be divided into different categories of original image subsets by a K-means algorithm. A subset of original images includes original images that are similar in style.

具体的，在所有目标特征图中随机选取k个目标特征图作为初始质心，例如设定k为3，则可以得到初始质心分别为：u1、u2和u3。将所有目标特征图中除了初始质心以外的目标特征图记为待分类特征图，对于每一个待分类特征图，分别计算该待分类特征图与每个初始质心：u1、u2和u3之间的距离，将最小距离对应的初始质心与该待分类特征图划分为同一类，以得到若干分类集。Specifically, k target feature maps are randomly selected as initial centroids in all target feature maps. For example, if k is set to 3, the initial centroids can be obtained as: u1, u2, and u3 respectively. The target feature maps other than the initial centroid in all target feature maps are recorded as the feature map to be classified, and for each feature map to be classified, the feature map to be classified and each initial centroid are calculated separately: u1, u2 and u3. The initial centroid corresponding to the minimum distance and the feature map to be classified are divided into the same class to obtain several classification sets.

例如，有10个目标特征图，包括：t1,t2,…,t10，在10个目标特征图随机选取三个目标特征图为：t1,t2和t3，将t1记为初始质心u1，将t2记为初始质心u2，将t3记为初始质心u3。待分类特征图包括：t4,t5,…,t10，对于每一个待分类特征图，计算该待分类特征图与每一个初始质心的距离。如，对于t4，计算t4和u1之间的距离，得到d41，计算t4和u2之间的距离，得到d42，计算t4和u3之间的距离，得到d43，假设d43最小，则将t4对应的原始图像和u3对应的原始图像划分为一类。对所有的待分类特征图进行上述计算，得到3个分类集。For example, there are 10 target feature maps, including: t1, t2, ..., t10, three target feature maps are randomly selected from the 10 target feature maps: t1, t2 and t3, t1 is recorded as the initial centroid u1, and t2 Denote it as the initial centroid u2, and denote t3 as the initial centroid u3. The feature maps to be classified include: t4, t5,..., t10. For each feature map to be classified, the distance between the feature map to be classified and each initial centroid is calculated. For example, for t4, calculate the distance between t4 and u1, get d41, calculate the distance between t4 and u2, get d42, calculate the distance between t4 and u3, get d43, assuming d43 is the smallest, then t4 corresponds to The original image and the original image corresponding to u3 are divided into one category. The above calculation is performed on all feature maps to be classified, and three classification sets are obtained.

在本发明实施例中，对于若干分类集中的每一个分类集，再确定该分类集对应的分类质心，对于上述待分类特征图中的每一个待分类特征图，再计算待分类特征图与每一个分类质心之间的距离，将最小距离对应的分类质心与该待分类特征图划分为同一类，以得到若干更新后的分类集。In the embodiment of the present invention, for each classification set in several classification sets, the classification centroid corresponding to the classification set is determined, and for each to-be-classified feature map in the above-mentioned to-be-classified feature map, the to-be-classified feature map and each to-be-classified feature map are calculated. The distance between a classification centroid, the classification centroid corresponding to the minimum distance and the feature map to be classified are divided into the same class to obtain several updated classification sets.

通过公式(1)可以确定分类集对应的分类质心。The classification centroid corresponding to the classification set can be determined by formula (1).

其中，Cj是分类集，t是属于分类集Cj的目标特征图，uj是分类质心。where Cj is the classification set, t is the target feature map belonging to the classification set Cj, and uj is the classification centroid.

在本发明实施例中，重复执行：计算分类集对应的分类执行：“对于若干分类集中的每一个分类集，再确定该分类集对应的分类质心”的步骤，直至分类质心与上一次计算的分类质心相同，则将若干分类集作为不同类别的原始图像子集。In the embodiment of the present invention, the steps of: calculating the classification corresponding to the classification set: "for each classification set in several classification sets, then determine the classification centroid corresponding to the classification set", until the classification centroid is the same as the one calculated last time. If the classification centroids are the same, several classification sets are used as the original image subsets of different categories.

M4、将包括所述模板图像的任意一个原始图像子集作为所述待处理的图像集。M4. Use any original image subset including the template image as the to-be-processed image set.

在本发明实施例中，在若干原始图像子集中确定包括模板图像的原始图像子集，若包括模板图像的原始图像子集的数量大于1，则在多个包括模板图像的原始图像子集中随机确定一个为待处理的图像集。或者若干模板图像有各自对应分别的喜爱度，喜爱度可以是用户设定的，将包括喜爱度最高的模板图像的原始图像子集作为待处理的图像集。或者结合喜爱度和原始图像子集中原始图像的数量，在若干包括模板图像的原始图像子集中确定待处理的图像集。In this embodiment of the present invention, an original image subset including the template image is determined from several original image subsets, and if the number of original image subsets including the template image is greater than 1, randomization is performed in the multiple original image subsets including the template image. Identify a set of images to be processed. Alternatively, several template images have respective corresponding degrees of preference, and the degrees of preference may be set by the user, and the original image subset including the template image with the highest degree of preference is used as the image set to be processed. Alternatively, the set of images to be processed is determined in several original image subsets including template images in combination with the degree of preference and the number of original images in the original image subsets.

S2、基于所述图像集和所述音频，确定所述若干图像中每张图像各自对应的目标节拍点。S2. Determine, based on the image set and the audio, a target beat point corresponding to each image in the plurality of images.

在本发明实施例中，对所述音频对应的信号进行分析，可以确定音频对应的若干节拍点，所述节拍点是乐器演奏某个特定音符的时刻，如鼓点。音频中包括大量的节拍点，任意两个不同的节拍点对应的播放时刻不同。所述目标节拍点实质是音频中的节拍点，所述目标节拍点是所述音频的若干节拍点中用于插入图像的节拍点。In the embodiment of the present invention, by analyzing the signal corresponding to the audio, several beat points corresponding to the audio can be determined, and the beat point is the moment when a musical instrument plays a specific note, such as a drum beat. The audio includes a large number of beat points, and any two different beat points correspond to different playback moments. The target beat point is substantially a beat point in the audio, and the target beat point is a beat point used for inserting an image among several beat points of the audio.

在本发明实施例中，首先确定所述若干图像的播放顺序，进而按照若干图像的播放顺序确定图像的目标节拍点。可以基于若干图像中任意两张图像之间的相似度确定视频中图像的播放顺序，可以设定相似的图像连续播放，这样，在视频播放时，不会出现相邻播放的图像风格太跳跃的情况。In the embodiment of the present invention, the playing sequence of the several images is determined first, and then the target beat points of the images are determined according to the playing sequence of the several images. The playback order of images in the video can be determined based on the similarity between any two images in several images, and similar images can be set to be played continuously, so that when the video is played, the style of the adjacent playing images will not appear too jumpy. Happening.

具体的，步骤S2包括：Specifically, step S2 includes:

S21、基于所述图像集中任意两张图像之间的相似度，对所述若干图像进行排序，以得到图像插入序列。S21. Based on the similarity between any two images in the image set, sort the several images to obtain an image insertion sequence.

在本发明实施例中，所述图像插入序列包括若干图像，以及每张图像各自分别对应的插入序号，所述图像插入序列中的若干图像可以按照插入序号进行排列。在生成视频时，按照插入序号从小到大的顺序将若干图像依次插入到音频中。In this embodiment of the present invention, the image insertion sequence includes several images and respective insertion sequence numbers corresponding to each image, and several images in the image insertion sequence may be arranged according to the insertion sequence numbers. When generating a video, insert several images into the audio in order of insertion sequence number from small to large.

在本发明实施例中，对于所述图像插入序列中的两张相邻图像，两张图像分别为第一图像和第二图像，其中，第一图像排列在第二图像之前，则第一图像与所述第二图像之间的相似度，大于任意一张排列在所述第二图像之后的图像与所述第一图像之间的相似度。In this embodiment of the present invention, for two adjacent images in the image insertion sequence, the two images are a first image and a second image respectively, wherein the first image is arranged before the second image, and the first image is the same as the second image. The similarity between the second images is greater than the similarity between any image arranged after the second image and the first image.

具体的，步骤S21包括：Specifically, step S21 includes:

S211、在所述图像集中选取一张起始图像，将所述起始图像的插入序号设定为第一序号。S211. Select an initial image from the image set, and set the insertion sequence number of the initial image as the first sequence number.

在本发明实施例中，所述起始图像可以随机选择，将所述起始图像对应的插入序号设定为第一序号，所述第一序号可以用数字表示，例如，通过数字1表示第一序号。In this embodiment of the present invention, the starting image may be randomly selected, and the insertion sequence number corresponding to the starting image is set as the first sequence number, and the first sequence number may be represented by a number, for example, the number 1 represents the first sequence number. a serial number.

S212、确定所述起始图像对应的非起始图像集，其中，所述非起始图像集包括若干非起始图像。S212. Determine a non-start image set corresponding to the start image, wherein the non-start image set includes several non-start images.

在本发明实施例中，对于所述起始图像，在所述图像集中选取所有未确定插入序号的图像，将所述未确定插入序号的图像作为非起始图像，并基于所有非起始图像得到所述起始图像对应的非起始图像集。也就是说，所述非起始图像集中包括若干非起始图像，所述非起始图像是未设定插入序号的图像。In this embodiment of the present invention, for the starting image, all images with undetermined insertion numbers are selected from the image set, and the images with undetermined insertion numbers are regarded as non-starting images, and based on all the non-starting images A non-start image set corresponding to the start image is obtained. That is to say, the non-starting image set includes several non-starting images, and the non-starting images are images for which insertion sequence numbers are not set.

例如，所述图像集中包括图像：r1,r2,r3,…,r8。在前述步骤中已经确定r1的插入序号为第一序号，则r2,r3,…,r8未确定插入序号，因此r2,r3,…,r8为非起始图像，非起始图像集为：{r2,r3,…,r8}。For example, the image set includes images: r1, r2, r3, . . . , r8. In the preceding steps, it has been determined that the insertion sequence number of r1 is the first sequence number, then the insertion sequence number of r2, r3, ..., r8 has not been determined, so r2, r3, ..., r8 are non-initial images, and the non-initial image set is: { r2,r3,…,r8}.

S213、基于每一张非起始图像与所述起始图像之间的相似度确定所述起始图像对应的候选图像，将所述候选图像的插入序号设定为所述起始图像的插入序号的后一序号。S213. Determine a candidate image corresponding to the starting image based on the similarity between each non-starting image and the starting image, and set the insertion sequence number of the candidate image as the insertion sequence number of the starting image The sequence number after the sequence number.

在本发明实施例中，“将所述候选图像的插入序号设定为所述起始图像的插入序号的后一序号”是指，候选图像在所述图像插入序列中位于所述起始图像的下一位。对于每一张非起始图像，计算该非起始图像与起始起始图像之间的相似度。In this embodiment of the present invention, "setting the insertion sequence number of the candidate image to the sequence number after the insertion sequence number of the starting image" means that the candidate image is located in the starting image in the image insertion sequence next. For each non-initial image, the similarity between the non-initial image and the initial initial image is calculated.

具体的，步骤S213包括：Specifically, step S213 includes:

S2131、分别计算所述非起始图像集中每一张非起始图像与所述起始图像之间的相似度，以得到相似度集。S2131. Calculate the similarity between each non-initial image in the non-initial image set and the initial image, respectively, to obtain a similarity set.

在本发明实施例中，对于每张非起始图像，获取该非起始图像对应的目标特征图，以及所述起始图像的目标特征图，计算每张非起始图像与所述起始图像之间的相似度，得到每张非起始图像对应的相似度，进而得到相似度集。In the embodiment of the present invention, for each non-initial image, the target feature map corresponding to the non-initial image and the target feature map of the initial image are obtained, and the relationship between each non-initial image and the initial image is calculated. The similarity between the images is obtained, and the similarity corresponding to each non-starting image is obtained, and then the similarity set is obtained.

具体的，通过公式(2)可以计算该非起始图像与起始图像之间的相似度。Specifically, the similarity between the non-initial image and the initial image can be calculated by formula (2).

其中，起始图像rx的特征为tx＝{x1,x2,…,xn}，非起始图像ry的特征为ty＝{y1,y2,…yn}，SIM(x,y)是起始图像rx与非起始图像ry之间的相似度。Among them, the characteristics of the starting image rx are tx={x1,x2,...,xn}, the characteristics of the non-starting image ry are ty={y1,y2,...yn}, SIM(x,y) is the starting image Similarity between rx and non-starting image ry.

S2132、在所述相似度集中选取最大相似度，并将所述最大相似度对应的图像作为所述起始图像对应的候选图像。S2132. Select the maximum similarity in the similarity set, and use the image corresponding to the maximum similarity as a candidate image corresponding to the initial image.

在本发明实施例中，所述相似度通过数值表示，数值越小，则相似度越小，数值越大，则相似度越大。所述最大相似度是指所有相似度中，数值最大的相似度。将最大相似度对应的图像作为起始图像对应的候选图像。In the embodiment of the present invention, the similarity is represented by a numerical value, the smaller the numerical value, the smaller the similarity, and the greater the numerical value, the greater the similarity. The maximum similarity refers to the similarity with the largest numerical value among all the similarities. The image corresponding to the maximum similarity is taken as the candidate image corresponding to the starting image.

例如，r1为起始图像，r2,r3,…,r8为非起始图像，其中，r3与r1之间的相似度为相似度集中的最大相似度，则将r3作为r1的候选图像，r3对应的插入序号为r1的插入序号的后一序号。假设r1对应的插入序号为1，表示第一个插入的图像为r1，则r3对应的插入序号为2，表示第二个插入的图像为r3，其中，插入序号“2”是插入序号“1”的下一位。在图像插入序列中，图像r3排列在图像r1的后一位。For example, r1 is the starting image, r2, r3,..., r8 are non-starting images, where the similarity between r3 and r1 is the maximum similarity in the similarity set, then r3 is used as the candidate image of r1, and r3 The corresponding insertion sequence number is the sequence number after the insertion sequence number of r1. Suppose the insertion sequence number corresponding to r1 is 1, indicating that the first inserted image is r1, then the insertion sequence number corresponding to r3 is 2, indicating that the second inserted image is r3, where the insertion sequence number "2" is the insertion sequence number "1" ' next. In the image insertion sequence, the image r3 is arranged one bit after the image r1.

S214、将所述候选图像作为起始图像，并继续执行所述确定所述起始图像对应的非起始图像集的步骤，直至确定所述图像集中所有图像各自对应的插入序号。S214. Use the candidate image as the starting image, and continue to perform the step of determining the non-starting image set corresponding to the starting image until the respective insertion sequence numbers corresponding to all the images in the image set are determined.

在本发明实施例中，经过步骤S211至步骤S213后，仅确定了两张图像的插入序号，还需确定待处理的图像集中其他图像的插入序号，为了便于说明，将候选图像作为起始图像，并继续执行步骤S212，以确定起始图像对应的候选图像。In this embodiment of the present invention, after steps S211 to S213, only the insertion sequence numbers of the two images are determined, and the insertion sequence numbers of other images in the image set to be processed also need to be determined. For the convenience of description, the candidate image is used as the starting image , and continue to perform step S212 to determine a candidate image corresponding to the starting image.

例如，经过步骤S211至步骤S123后，确定了r1的插入序号为1，r3的插入序号为2，接下来需要确定r3对应的候选图像，即在播放时，r3的下一张图像。将r3作为起始图像，首先确定r3(起始图像)对应的非起始图像，在前述已经解释非起始图像是所有未确定插入序号的图像，在此例中，由于r1和r3已经确定插入序号，因此r3对应的非起始图像包括：r2,r4,r5,…,r8。分别计算r3与每一个非起始图像(r2,r4,r5,…,r8)之间的相似度，确定r3对应的候选图像，假设r3对应的候选图像为r7，则r7的插入序号为r3对应的插入序号的后一位，在所述图像插入序列中，r7排列在r3的后一位。再执行步骤S212，确定r7对应的候选图像，直至确定所述图像集中所有图像各自对应的插入序号。For example, after steps S211 to S123, it is determined that the insertion sequence number of r1 is 1 and the insertion sequence number of r3 is 2. Next, the candidate image corresponding to r3 needs to be determined, that is, the next image of r3 during playback. Taking r3 as the starting image, first determine the non-starting image corresponding to r3 (starting image). As explained above, the non-starting image is all the images whose insertion sequence number is not determined. In this example, since r1 and r3 have been determined Insert the sequence number, so the non-starting images corresponding to r3 include: r2, r4, r5, ..., r8. Calculate the similarity between r3 and each non-starting image (r2, r4, r5, ..., r8) respectively, and determine the candidate image corresponding to r3. Assuming that the candidate image corresponding to r3 is r7, the insertion sequence number of r7 is r3 The last bit of the corresponding insertion sequence number, in the image insertion sequence, r7 is arranged at the last bit of r3. Step S212 is then executed to determine the candidate images corresponding to r7 until the respective insertion sequence numbers corresponding to all the images in the image set are determined.

S215、根据所述图像集中所有图像各自对应的插入序号确定所述图像集对应的图像插入序列。S215. Determine the image insertion sequence corresponding to the image set according to the respective insertion sequence numbers of all the images in the image set.

在本发明实施例中，确定所有图像各自对应的插入序号后，按照所有图像各自对应的插入序号对所有图像进行排序，以得到图像插入序列。对于图像插入序列中的一张图像，该图像与排列在该图像后一位的图像之间的相似度，大于该图像与排列在该图像后两位的图像之间的相似度。In the embodiment of the present invention, after determining the corresponding insertion sequence numbers of all the images, all images are sorted according to the respective insertion sequence numbers corresponding to all the images, so as to obtain the image insertion sequence. For an image in the image insertion sequence, the similarity between the image and the image arranged at the last two positions of the image is greater than the similarity between the image and the images arranged at the last two positions of the image.

例如，图像集包括的图像分别为：r1,r2,r3,…,r8，经过步骤S211至步骤S214，可以得到r1,r2,r3,…,r8中每张图像各自对应的插入序号，进而得到图像插入序列为：r1,r3,r7,r6,r2,r8,r5,r4。其中，r7与r6之间的相似度大于r7与r2(r8，r5，r4)之间的相似度。For example, the images included in the image set are: r1, r2, r3, ..., r8. After step S211 to step S214, the corresponding insertion sequence number of each image in r1, r2, r3, ..., r8 can be obtained, and then the corresponding insertion sequence number can be obtained. The image insertion sequence is: r1, r3, r7, r6, r2, r8, r5, r4. Among them, the similarity between r7 and r6 is greater than the similarity between r7 and r2 (r8, r5, r4).

在本发明实施例中，两张图像之间的相似度越大，表示两张图像之间的风格越相似。通过所述图像集中任意两张图像之间的相似度确定的图像插入序列，按照图像插入序列的顺序播放图像，可以让风格相似的图像连续播放，视频画面更和谐。In this embodiment of the present invention, the greater the similarity between the two images, the more similar the styles between the two images are. Through the image insertion sequence determined by the similarity between any two images in the image set, the images are played according to the sequence of the image insertion sequence, so that images with similar styles can be played continuously, and the video picture is more harmonious.

S22、获取所述音频的若干节拍点。S22. Acquire several beat points of the audio.

在本发明实施例中，对所述音频对应的信号进行分析，可以确定音频对应的若干节拍点。所述节拍点是乐器演奏某个特定音符的时刻，如鼓点。音频中包括大量的节拍点，任意两个不同的节拍点对应的播放时刻不同。In this embodiment of the present invention, by analyzing the signal corresponding to the audio, several beat points corresponding to the audio can be determined. The beat is the moment when the instrument plays a particular note, such as a drum beat. The audio includes a large number of beat points, and any two different beat points correspond to different playback moments.

具体的，步骤S22包括：Specifically, step S22 includes:

S221、获取所述音频对应的初始信号。S221. Acquire an initial signal corresponding to the audio.

在本发明实施例中，所述初始信号是所述音频对应的时域信号，所述初始信号的横坐标为时间，初始信号的纵坐标为音频信号的能量。In this embodiment of the present invention, the initial signal is a time domain signal corresponding to the audio, the abscissa of the initial signal is time, and the ordinate of the initial signal is the energy of the audio signal.

在本发明实施例中，初始信号中存在一些噪声，为了后续步骤中得到更准确得的节拍点，以及为了而减少后续步骤中处理的数据量，可以先对所述初始信号进行预处理，以消除所述初始信号中干扰信号以及减少数据量。In this embodiment of the present invention, there is some noise in the initial signal. In order to obtain a more accurate beat point in the subsequent steps, and to reduce the amount of data processed in the subsequent steps, the initial signal may be preprocessed to Eliminate interfering signals and reduce the amount of data in the initial signal.

对所述初始信号进行预处理的过程包括：The process of preprocessing the initial signal includes:

确定所述初始信号中的多个中心时刻，对于每个中心时刻，计算该中心时刻的邻域内的所有能量的累加值，将所述累加值作为该中心时刻对应的幅值，进而得到预处理后的初始信号，并采用预处理后的初始信号代替所述初始信号。其中，多个中心时刻中的任意两个中心时刻可以相交。Determine multiple central moments in the initial signal, for each central moment, calculate the cumulative value of all energy in the neighborhood of the central moment, and use the cumulative value as the amplitude corresponding to the central moment, and then obtain the preprocessing The post-processed initial signal is used instead of the pre-processed initial signal. Wherein, any two central moments in the plurality of central moments may intersect.

在本发明实施例中，中心时刻的邻域可以是：从该中心时刻前的预设时长至该时刻后的邻域时长内。例如，该中心时刻为t0，t0的邻域为[t0-tr,t0+tr]，所述邻域时长可以为10ms。In this embodiment of the present invention, the neighborhood of the central time may be: from a preset time period before the central time to a neighborhood time period after the time. For example, the central moment is t0, the neighborhood of t0 is [t0-tr, t0+tr], and the neighborhood duration may be 10ms.

具体的，通过公式(3)对所述初始信号进行预处理。Specifically, the initial signal is preprocessed by formula (3).

其中，在预处理后的初始信号中，Wt是中心时刻为t时对应的幅值；tr为邻域时长，tr可以设定为10ms；An是时刻n对应的幅值；t＝t+10ms，即，每间隔10ms确定t时刻对应的幅值。例如，确定预处理后的初始信号中10ms对应的幅值，可以根据[0,20ms]中所有采样点各自对应的幅值确定；由于t＝t+10ms，则再确定预处理后的初始信号中20ms对应的幅值，可以根据[10ms,30ms]中所有采样点各自对应的幅值确定。Among them, in the preprocessed initial signal, Wt is the amplitude corresponding to the center time t; tr is the neighborhood duration, and tr can be set to 10ms; An is the amplitude corresponding to time n; t=t+10ms , that is, the amplitude corresponding to time t is determined every 10ms. For example, to determine the amplitude corresponding to 10ms in the preprocessed initial signal, it can be determined according to the corresponding amplitudes of all sampling points in [0, 20ms]; since t=t+10ms, then determine the preprocessed initial signal again The amplitude corresponding to 20ms in [10ms, 30ms] can be determined according to the corresponding amplitudes of all sampling points in [10ms, 30ms].

其中，时刻n对应的幅值可以通过采样频率f₀确定，时刻n对应的采样点为n*f₀。可以在第n*f₀个采样点对应的幅值作为时刻n对应的幅值。所述采样频率可以为50Hz。The amplitude corresponding to time n may be determined by sampling frequency f ₀ , and the sampling point corresponding to time n is n*f ₀ . The amplitude corresponding to the n*f _0th sampling point may be used as the amplitude corresponding to time n. The sampling frequency may be 50Hz.

S222、对所述初始信号进行低通滤波处理，以得到所述音频对应的第一信号。S222. Perform low-pass filtering on the initial signal to obtain a first signal corresponding to the audio.

在本发明实施例中，在理论上，音乐中的节拍通常由低频乐器产生，如打击乐器中(大鼓、手鼓等)。因此对音频对应的低频信号进行分析，更容易确定若干节拍点。首先对初始信号进行低通滤波处理，以滤除大于低频阈值的信号，所述低频阈值可以是：200Hz。可以通过高斯低通滤波器对所述初始信号进行处理，以得到所述第一信号。In the embodiments of the present invention, in theory, the beats in music are usually generated by low-frequency musical instruments, such as percussion instruments (bass drums, tambourines, etc.). Therefore, it is easier to determine several beat points by analyzing the low-frequency signal corresponding to the audio. First, low-pass filtering is performed on the initial signal to filter out signals larger than a low-frequency threshold, where the low-frequency threshold may be: 200 Hz. The initial signal may be processed through a Gaussian low-pass filter to obtain the first signal.

S223、根据所述第一信号确定若干目标幅值。S223. Determine several target amplitudes according to the first signal.

在本发明实施例中，将所述第一信号划分为多个信号段，获取每个信号段各自对应的最大幅值，将获取的每个信号段各自对应的最大幅值，将所述获取获取每个信号段各自对应的最大幅值的步骤执行预设数值次；对于每个信号段，若预设数值次获取的该信号段的最大幅值均相同，则将该信号段的最大幅值作为该信号段的目标幅值。In the embodiment of the present invention, the first signal is divided into a plurality of signal segments, the maximum amplitude corresponding to each signal segment is obtained, the obtained maximum amplitude corresponding to each signal segment is obtained, and the obtained maximum amplitude corresponding to each signal segment is obtained. The step of obtaining the corresponding maximum amplitude of each signal segment is performed a preset number of times; for each signal segment, if the maximum amplitude of the signal segment obtained by the preset number of times is the same, then the maximum amplitude of the signal segment is the same. value as the target amplitude for this signal segment.

对于每一次执行所述获取每个信号段各自对应的最大幅值的具体过程如下：For each execution, the specific process of obtaining the corresponding maximum amplitude of each signal segment is as follows:

通过设定预设时长的窗口确定信号段，具体的，根据所述预设时长确定多个信号段，每个信号段对应的时长为所述预设时长。例如，设定窗口对应的预设时长为L，则每个信号段对应的时长为L。The signal segment is determined by setting a window with a preset duration. Specifically, a plurality of signal segments are determined according to the preset duration, and the duration corresponding to each signal segment is the preset duration. For example, if the preset duration corresponding to the set window is L, the duration corresponding to each signal segment is L.

窗口按照预设步长在第一信号上滑动以确定信号段，确定每个信号段中的最大幅值。所述预设步长是时间维度上的步长，所述预设步长可以大于或等于所述预设时长，当所述预设步长可以大于或等于所述预设时长时，所述多个信号段中的任意两个信号段不相交；所述预设步长可以小于所述预设时长，当预设步长小于所述预设时长时，所述多个信号段中的任意两个相邻信号段相交。The window is slid over the first signal according to a preset step size to determine signal segments, and the maximum amplitude in each signal segment is determined. The preset step size is a step size in the time dimension, the preset step size may be greater than or equal to the preset duration, and when the preset step size may be greater than or equal to the preset duration, the Any two signal segments in the multiple signal segments do not intersect; the preset step size may be smaller than the preset duration, and when the preset step size is smaller than the preset duration, any one of the multiple signal segments Two adjacent signal segments intersect.

在本发明实施例中，所述预设数值可以设定为20次。也就是说，对于每个信号段，需要重复预设数值次确定该信号段对应的最大幅值，若预设数值次确定得到的预设数值个最大幅值均相同，将所述最大幅值作为该信号段对应的目标幅值；若预设数值次确定得到的预设数值个最大幅值中，存在任意两个不同的最大幅值，则该信号段没有目标幅值。In this embodiment of the present invention, the preset value may be set to 20 times. That is to say, for each signal segment, it is necessary to repeat the preset value times to determine the maximum amplitude corresponding to the signal segment. As the target amplitude value corresponding to the signal segment; if there are any two different maximum amplitude values among the preset value and maximum amplitude values determined by the preset value times, the signal segment has no target amplitude value.

S224、对于所述若干目标幅值中的每一个目标幅值，将所述目标幅值对应的时刻作为所述目标幅值对应的节拍点。S224. For each target amplitude value in the several target amplitude values, take the time corresponding to the target amplitude value as the beat point corresponding to the target amplitude value.

在本发明实施例中，对于一个目标幅值，该目标幅值是第一信号(滤除高频后的低频信号)的一个信号段中的最大幅值；低频信号中节拍点的响度最大(幅值最大)，因此，可以根据目标幅值确定节拍点。所述节拍点是乐器演奏某个特定音符的时刻，将所述目标幅值对应的时刻作为所述目标幅值对应的节拍点。In this embodiment of the present invention, for a target amplitude value, the target amplitude value is the maximum amplitude value in a signal segment of the first signal (the low frequency signal after filtering out the high frequency); the loudness of the beat point in the low frequency signal is the largest ( Amplitude is the largest), therefore, the beat point can be determined according to the target amplitude. The beat point is the moment when the musical instrument plays a specific note, and the moment corresponding to the target amplitude value is taken as the beat point corresponding to the target amplitude value.

S23、根据所述图像插入序列、所述若干节拍点和所述音频确定所述图像插入序列中每张图像各自对应的目标节拍点。S23. Determine a target beat point corresponding to each image in the image insertion sequence according to the image insertion sequence, the several beat points and the audio.

在本发明实施例中，所述图像插入序列包括若干按照插入顺序排列的图像，根据若干节拍点、所述音频和所述图像插入序列确定每张图像的目标节拍点。对于图像插入序列中的任意两张相邻图像，任意两张相邻图像包括先播放的第一图像和后播放的第二图像，在视频的播放时刻到达第一图像对应的目标节拍点时，插入第一图像，直至视频的播放时刻达到第二图像对应的目标节拍点时，插入第二图像。In this embodiment of the present invention, the image insertion sequence includes several images arranged in an insertion sequence, and the target beat point of each image is determined according to several beat points, the audio and the image insertion sequence. For any two adjacent images in the image insertion sequence, any two adjacent images include the first image played first and the second image played later, when the video playback time reaches the target beat point corresponding to the first image, the first image is inserted. , and insert the second image until the playback time of the video reaches the target beat point corresponding to the second image.

具体的，步骤S23包括：Specifically, step S23 includes:

S231、获取所述音频对应的音频时长，以及所述若干图像的图像数量；根据所述图像数量和所述音频时长确定所述图像插入序列中每每张图像各自对应的图像插入点，其中，所述相邻两个图像插入点之间的时长为根据所述音频时长与所述图像数量所确定的。S231. Obtain the audio duration corresponding to the audio and the number of images of the several images; determine the image insertion point corresponding to each image in the image insertion sequence according to the number of images and the audio duration, wherein the The duration between the two adjacent image insertion points is determined according to the audio duration and the number of images.

在本发明实施例中，所述音频时长是音频的时长，在生成视频后，视频播放的时长等于所述音频时长。所述图像数量是指图像插入序列中图像的数量，也就是待处理的图像集中的图像的数量。可以计算所述音频时长和所述图像数量的比值，以得到每张图像的平均播放时长。根据平均播放时长和所述音频时长可以每张图像对应的图像插入点。其中，可以设定所述图像插入序列中排在第一位的图像对应的图像插入点为音频的播放起始时刻。In this embodiment of the present invention, the audio duration is the audio duration, and after the video is generated, the video playing duration is equal to the audio duration. The number of images refers to the number of images in the image insertion sequence, that is, the number of images in the image set to be processed. The ratio of the audio duration to the number of images may be calculated to obtain the average playback duration of each image. An image insertion point corresponding to each image can be obtained according to the average playback duration and the audio duration. Wherein, the image insertion point corresponding to the first image in the image insertion sequence may be set as the start time of audio playback.

例如，音频时长为20秒，所述图像插入序列中包括4张图像，则每张图像播放的时长为5秒，假设图像插入序列包括：g1,g2,g3和g4，可以确定g1对应的图像插入点为t＝0秒，g2对应的图像插入点为t＝5秒，g3对应的图像插入点为t＝10秒，g4对应的图像插入点为t＝15秒。For example, if the audio duration is 20 seconds, and the image insertion sequence includes 4 images, the playback duration of each image is 5 seconds. Assuming that the image insertion sequence includes: g1, g2, g3 and g4, the image corresponding to g1 can be determined The insertion point is t=0 seconds, the image insertion point corresponding to g2 is t=5 seconds, the image insertion point corresponding to g3 is t=10 seconds, and the image insertion point corresponding to g4 is t=15 seconds.

S232、根据所述若干图像插入点和所述若干节拍点确定所述若干图像各自对应的目标节拍点。S232. Determine the target beat points corresponding to the several images according to the several image insertion points and the several beat points.

在本发明实施例中，对于每一个图像插入点，在所述若干节拍点中确定与该图像插入点距离最近的节拍点，并将所述与该图像插入点距离最近的节拍点作为该图像插入点对应的图像的目标节拍点。In this embodiment of the present invention, for each image insertion point, the beat point that is closest to the image insertion point is determined from among the several beat points, and the beat point that is closest to the image insertion point is used as the image The target beat point of the image corresponding to the insertion point.

在本发明实施例中，所谓“距离最近”指的是：目标节拍点对应的时刻和图像插入点对应的时刻之间的距离最近。In the embodiment of the present invention, the so-called "closest distance" refers to the closest distance between the time corresponding to the target beat point and the time corresponding to the image insertion point.

例如，若干节拍点分别为：{j1,j2,j3,…,j20}，若干图像插入点分别为：{c1,c2,…,c5}，对于cm，若cm＝jn，则将jn作为cm对应的图像的目标节拍点，其中，1≦m≦5，1≦n≦20；若cm≠jn，则确定与cm距离最近的jn，将jn作为cm对应的图像的目标节拍点。For example, several beat points are: {j1,j2,j3,...,j20}, and several image insertion points are: {c1,c2,...,c5}, for cm, if cm=jn, then jn is taken as cm The target beat point of the corresponding image, where 1≦m≦5, 1≦n≦20; if cm≠jn, determine the jn closest to cm, and use jn as the target beat point of the image corresponding to cm.

例如，g3对应的图像插入点为t＝10秒，若存在一个节拍点为：t＝10秒，则将t＝10秒作为g3对应的图像的目标节拍点。g3对应的图像插入点为t＝10秒，若干节拍点中距离t＝10秒最近的节拍点为t＝11秒，则将t＝11秒作为g3对应的图像的目标节拍点。For example, the image insertion point corresponding to g3 is t=10 seconds, if there is a beat point: t=10 seconds, then t=10 seconds is taken as the target beat point of the image corresponding to g3. The image insertion point corresponding to g3 is t=10 seconds, and among several beat points, the nearest beat point from t=10 seconds is t=11 seconds, then t=11 seconds is used as the target beat point of the image corresponding to g3.

S3、在所述音频中的每个目标节拍点插入该目标节拍点对应的图像，以生成视频。S3. Insert an image corresponding to the target beat point at each target beat point in the audio to generate a video.

在本发明实施例中，对于每个目标节拍点，在所述音频的播放时刻到达该目标节拍点处时，插入该目标节拍点对应的图像，并将该图像作为该目标节拍点和下一目标节拍点之间播放的图像帧。In the embodiment of the present invention, for each target beat point, when the audio playback time reaches the target beat point, an image corresponding to the target beat point is inserted, and the image is used as the target beat point and the next beat point. Image frames played between target beats.

例如，音频时长为20秒，所述图像插入序列中包括4张图像，则每张图像播放的时长为5秒，假设图像插入序列包括：g1,g2,g3和g4，可以确定g1对应的目标节拍点为t＝0秒，g2对应的目标节拍点为t＝5.5秒，g3对应的目标节拍点为t＝9.5秒，g4对应的目标节拍点为t＝16秒。在音频的播放时刻为0秒时，插入g1，并在播放时刻为0秒至5.5秒中的每一帧均插入g1；在音频的播放时刻为5.5秒时，插入g2，并在播放时刻为5.5秒至9.5秒中的每一帧均插入g2；在音频的播放时刻为9.5秒时，插入g3，并在播放时刻为9.5秒至16秒中的每一帧均插入g3；在音频的播放时刻为16秒时，插入g4，并在播放时刻为16秒至20秒中的每一帧均插入g4。For example, if the audio duration is 20 seconds, and the image insertion sequence includes 4 images, the playback duration of each image is 5 seconds. Assuming that the image insertion sequence includes: g1, g2, g3 and g4, the target corresponding to g1 can be determined The beat point is t=0 seconds, the target beat point corresponding to g2 is t=5.5 seconds, the target beat point corresponding to g3 is t=9.5 seconds, and the target beat point corresponding to g4 is t=16 seconds. When the playback time of the audio is 0 seconds, insert g1, and insert g1 in each frame of the playback time from 0 seconds to 5.5 seconds; when the playback time of the audio is 5.5 seconds, insert g2, and at the playback time of g2 is inserted in each frame from 5.5 seconds to 9.5 seconds; g3 is inserted when the playback time of the audio is 9.5 seconds, and g3 is inserted in each frame from the playback time of 9.5 seconds to 16 seconds; When the time is 16 seconds, g4 is inserted, and g4 is inserted for each frame from 16 seconds to 20 seconds in the playback time.

在本发明实施例中，生成的视频可以实现：在视频播放时，两个目标节拍点之间持续播放同一张图像，并且每个目标节拍点均为节拍点。例如，在上例中，0秒至5.5秒中持续播放g1，在5.5秒时切换为g2，并在5.5秒至9.5秒中持续播放g2，其中，5.5秒为节拍点，即实现在节拍点处切换播放的图像。In the embodiment of the present invention, the generated video can realize: when the video is played, the same image is continuously played between two target beat points, and each target beat point is a beat point. For example, in the above example, g1 is continuously played from 0 seconds to 5.5 seconds, switched to g2 at 5.5 seconds, and g2 is continuously played from 5.5 seconds to 9.5 seconds, where 5.5 seconds is the beat point, that is, the realization of the beat point to switch the playback image.

基于上述一种视频的生成方法，参见图3，本发明实施例还提供了一种视频生成装置，包括：Based on the above-mentioned method for generating a video, referring to FIG. 3 , an embodiment of the present invention further provides a video generating apparatus, including:

在一个实施例中，本发明提供了一种计算机设备，该设备可以是终端，内部结构如图4所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络模型接口、显示屏和输入装置。其中，该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和计算机程序。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的网络模型接口用于与外部的终端通过网络模型连接通信。该计算机程序被处理器执行时以实现视频的生成方法。该计算机设备的显示屏可以是液晶显示屏或者电子墨水显示屏，该计算机设备的输入装置可以是显示屏上覆盖的触摸层，也可以是计算机设备外壳上设置的按键、轨迹球或触控板，还可以是外接的键盘、触控板或鼠标等。In one embodiment, the present invention provides a computer device, the device may be a terminal, and the internal structure is shown in FIG. 4 . The computer equipment includes a processor, memory, a network model interface, a display screen, and an input device connected by a system bus. Among them, the processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium, an internal memory. The nonvolatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the execution of the operating system and computer programs in the non-volatile storage medium. The network model interface of the computer equipment is used to communicate with external terminals through the network model connection. The computer program, when executed by a processor, implements a method of generating a video. The display screen of the computer equipment may be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment may be a touch layer covered on the display screen, or a button, a trackball or a touchpad set on the shell of the computer equipment , or an external keyboard, trackpad, or mouse.

本领域技术人员可以理解，图4所示的仅仅是与本申请方案相关的部分结构的框图，并不构成对本申请方案所应用于其上的计算机设备的限定，具体的计算机设备可以包括比图中所示更多或更少的部件，或者组合某些部件，或者具有不同的部件布置。Those skilled in the art can understand that the block diagram shown in FIG. 4 is only a partial structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied. shown in more or less components, or in combination with certain components, or with different arrangements of components.

本发明实施例提供了一种计算机设备，包括存储器和处理器，所述存储器存储有计算机程序，所述处理器执行所述计算机程序时实现以下步骤：An embodiment of the present invention provides a computer device, including a memory and a processor, the memory stores a computer program, and the processor implements the following steps when executing the computer program:

本发明实施例还提供了一种计算机可读存储介质，其上存储有计算机程序，所述计算机程序被处理器执行时实现以下步骤：Embodiments of the present invention also provide a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the following steps are implemented:

以上实施例的各技术特征可以进行任意的组合，为使描述简洁，未对上述实施例中的各个技术特征所有可能的组合都进行描述，然而，只要这些技术特征的组合不存在矛盾，都应当认为是本说明书记载的范围。The technical features of the above embodiments can be combined arbitrarily. In order to make the description simple, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features It is considered to be the range described in this specification.

以上所述实施例仅表达了本申请的几种实施方式，其描述较为具体和详细，但并不能因此而理解为对发明专利范围的限制。应当指出的是，对于本领域的普通技术人员来说，在不脱离本申请构思的前提下，还可以做出若干变形和改进，这些都属于本申请的保护范围。因此，本申请专利的保护范围应以所附权利要求为准。The above-mentioned embodiments only represent several embodiments of the present application, and the descriptions thereof are specific and detailed, but should not be construed as a limitation on the scope of the invention patent. It should be pointed out that for those skilled in the art, without departing from the concept of the present application, several modifications and improvements can be made, which all belong to the protection scope of the present application. Therefore, the scope of protection of the patent of the present application shall be subject to the appended claims.

Claims

1. a generation method of video, is characterized in that, comprises:

acquiring a set of images to be processed and audio to be processed, wherein the set of images includes several images;

Based on the image set and the audio, determining the target beat point corresponding to each image in the plurality of images;

Inserting an image corresponding to the target beat point at each target beat point in the audio to generate a video.

2. The method for generating a video according to claim 1, wherein the determining the target beat point corresponding to each image in the several images based on the image set and the audio, specifically comprises:

Based on the similarity between any two images in the image set, sorting the several images to obtain an image insertion sequence;

obtain several beat points of the audio;

The target beat point corresponding to each image in the image insertion sequence is determined according to the image insertion sequence, the several beat points and the audio, wherein the target beat point is a beat point used in several beat points of the audio. at the beat point of the inserted image.

3. The method for generating a video according to claim 2, wherein the several images are sorted based on the similarity between any two images in the image set to obtain an image insertion sequence, specifically include:

Select an initial image in the image set, and set the insertion sequence number of the initial image as the first sequence number;

determining a non-initial image set corresponding to the initial image, wherein the non-initial image set includes several non-initial images;

Based on the similarity between each non-starting image and the starting image, a candidate image corresponding to the starting image is determined, and the insertion sequence number of the candidate image is set to be equal to the insertion sequence number of the starting image. the next serial number;

Taking the candidate image as the starting image, and continuing to perform the step of determining the non-starting image set corresponding to the starting image, until the respective insertion sequence numbers of all the images in the image set are determined;

The image insertion sequence corresponding to the image set is determined according to the corresponding insertion sequence numbers of all the images in the image set.

4. The method for generating a video according to claim 3, wherein the determining a non-starting image set corresponding to the starting image specifically comprises:

For the starting image, all images whose insertion sequence numbers are not determined are selected from the image set to obtain a non-starting image set corresponding to the starting image.

5 . The method for generating a video according to claim 3 , wherein the determining of the starting image is based on the similarity between each non-starting image in the set of non-starting images and the starting image. 6 . candidate images corresponding to the original image, including:

respectively calculating the similarity between each non-initial image in the non-initial image set and the initial image to obtain a similarity set;

The maximum similarity is selected from the similarity set, and the image corresponding to the maximum similarity is used as the candidate image corresponding to the initial image.

6 . The video generation method according to claim 2 , wherein the target corresponding to each image in the image insertion sequence is determined according to the image insertion sequence, the several beat points and the audio. 7 . Beat points, including:

Obtain the audio duration corresponding to the audio and the number of images of the several images;

An image insertion point corresponding to each image in the image insertion sequence is determined according to the number of images and the audio duration, wherein the duration between the two adjacent image insertion points is determined according to the audio duration and the audio duration. determined by the number of said images;

The target beat points corresponding to the several images are determined according to the several image insertion points and the several beat points.

7. The method for generating a video according to claim 6, wherein determining the target beat points corresponding to the several images according to the several image insertion points and the several beat points, specifically comprising:

For each image insertion point, the beat point that is closest to the image insertion point is determined among the several beat points, and the beat point that is closest to the image insertion point is used as the target of the image corresponding to the image insertion point beat points.

8. The method for generating video according to claim 1, wherein the image corresponding to the target beat point is inserted into each target beat point in the audio to generate a video, specifically comprising:

For each target beat point, when the audio playback time reaches the target beat point, insert the image corresponding to the target beat point, and use the image as the image played between the target beat point and the next target beat point image frame.

9. The method for generating a video according to any one of claims 1 to 8, wherein the acquiring an image set to be processed, and determining an image based on the similarity between any two images in the image set Before the insertion sequence, also include:

acquiring an original image set, wherein the original image set includes multiple original images, and the multiple original images include at least one template image;

Determine the target feature map corresponding to each original image in the original image set;

dividing the original image set into original image subsets of different categories based on all the determined target feature maps;

Any original image subset including the template image is used as the image set to be processed.

10. A video generation device, comprising:

an acquisition unit, configured to acquire a set of images to be processed and audio to be processed, wherein the set of images includes several images;

a target beat point determination unit, configured to determine the target beat point corresponding to each of the several images based on the image set and the audio;

A video generation unit, configured to insert an image corresponding to the target beat point in each target beat point in the audio to generate a video.

11. A computer device, comprising a memory and a processor, wherein the memory stores a computer program, wherein the processor implements the video according to any one of claims 1 to 9 when the processor executes the computer program. Steps in the build method.

12. A computer-readable storage medium on which a computer program is stored, characterized in that, when the computer program is executed by a processor, the steps in the method for generating a video according to any one of claims 1 to 9 are implemented .