WO2021259322A1

WO2021259322A1 - System and method for generating video

Info

Publication number: WO2021259322A1
Application number: PCT/CN2021/101816
Authority: WO
Inventors: 陈万锋; 李韶辉; 谢统玲; 吴庆宁; 殷焦元
Original assignee: 广州筷子信息科技有限公司
Priority date: 2020-06-23
Filing date: 2021-06-23
Publication date: 2021-12-30

Abstract

A system and a method for generating a video. Said method comprises acquiring a plurality of video segments (210). Said method further comprises acquiring video configuration information (220), the video configuration information comprising one or more configuration features of at least some video segments among the plurality of video segments, and the configuration features including at least one of a content feature or an arrangement feature. Said method further comprises generating a target video on the basis of at least some video segments and the video configuration information (230).

Description

System and method for generating video

cross reference

This application requires Chinese patent application 202010578632.1 filed on June 23, 2020, Chinese patent application 202010741962.8 filed on July 29, 2020, Chinese patent application 202010738213.X filed on July 28, 2020, and May 2021 The priority of the Chinese patent application 202110503297.3 filed on the 10th is incorporated herein by reference in its entirety.

Technical field

This application relates to video processing, and in particular to a system and method for generating video.

Background technique

With the development of multimedia technology, video has become one of the most frequently contacted information media. However, there are still many problems in the process of video cropping, lens combination, and terminal playback. For example, the process of automatic lens combination through the system has problems such as simplification of content, high repetition, and low subject focus. For another example, when the same video is played on different playback terminals, there are situations in which target objects (for example, characters, products) are covered or displayed incompletely. Therefore, there is an urgent need to provide a method and system for generating videos to meet the diverse needs of video production and playback.

Summary of the invention

One aspect of the present application provides a system for generating video. The system includes at least one storage medium storing a set of instructions and at least one processor configured to communicate with the at least one storage medium, wherein when the instruction set is executed, the at least one processor is used to execute one or Multiple operations, the operations include: obtaining multiple video clips. Obtain video configuration information, where the video configuration information includes one or more configuration features of at least some of the multiple video clips, and the configuration features include at least one of content features or arrangement features. Based on the at least part of the video segment and the video configuration information, a target video is generated.

Another aspect of the present application provides a method of generating a video. The method is executed by a processing device including at least one memory and at least one processor, and the method includes: acquiring a plurality of video clips. Obtain video configuration information, where the video configuration information includes one or more configuration features of at least some of the multiple video clips, and the configuration features include at least one of content features or arrangement features. Based on the at least part of the video segment and the video configuration information, a target video is generated.

Another aspect of the present application provides a non-transitory computer readable medium. The non-transitory computer-readable medium includes at least one set of instructions for determining an optimal strategy, wherein when executed by one or more processors of a computing device, the at least one set of instructions causes the computing device to execute Method, the method includes: acquiring a plurality of video clips. Obtain video configuration information, where the video configuration information includes one or more configuration features of at least some of the multiple video clips, and the configuration features include at least one of content features or arrangement features. Based on the at least part of the video segment and the video configuration information, a target video is generated.

In some embodiments, the acquiring multiple video clips includes acquiring at least one of an initial image or an initial video. Perform editing processing on the initial image or initial video to obtain the multiple video clips.

In some embodiments, the performing editing processing on the initial image or the initial video to obtain the plurality of video clips includes: acquiring the characteristics of each pair of adjacent images or video frames in the initial image or the initial video. Determine the similarity of each pair of adjacent images or video frames. Based on the similarity of each pair of adjacent images or video frames, segment boundaries are identified. Based on the segment boundary, the initial image or the initial video is divided to obtain the multiple video segments.

In some embodiments, each video segment of the plurality of video segments is a shot segment.

In some embodiments, the editing of the initial image or the initial video to obtain the multiple video clips includes: determining subject information of the initial image or the initial video, the subject information including at least the subject and the subject position. Perform editing processing on the initial image or initial video based on the subject information to obtain the multiple video clips.

In some embodiments, the determining the subject information in the initial image or the initial video includes: obtaining a subject information determination model. The subject information is determined by inputting the initial image or the initial video into the subject information determination model.

In some embodiments, the editing of the initial image or the initial video based on the subject information includes: recognizing the outline of the subject in the initial image or the initial video based on the subject information. According to the outer contour of the subject, the initial image or initial video is cropped or zoomed.

In some embodiments, the video configuration information includes a first preset condition and a second preset condition.

In some embodiments, the generating a target video based on the at least part of the video clips and the video configuration information includes: obtaining one or more candidates from the multiple video clips based on the first preset condition Video clips. The one or more candidate video segments are grouped to determine at least one segment set. Based on each segment set in the at least one segment set, a target video is generated.

In some embodiments, the first preset condition is related to at least one of a plurality of elements, the plurality of elements including that the target video contains a specific object, the target video contains a specific subject, the total duration of the target video, and the target video The number of shots included, the specific shots contained in the target video, the number of overlaps of a specific subject in the target video, or the focusing time of a specific subject in the target video.

In some embodiments, the first preset condition includes that the value of the at least one element is greater than a corresponding threshold.

In some embodiments, the first preset condition further includes element constraint conditions between two or more specific elements in the plurality of elements.

In some embodiments, the first preset condition includes a binding condition of a shot frame in the target video, and the binding condition reflects the association relationship of at least two specific shot frames in the target video, and the binding condition is based on the first preset condition. It is assumed that obtaining one or more candidate video clips from the plurality of video clips includes: determining a video clip containing a specified shot picture from the plurality of video clips. Based on the binding condition, the video clips including the specified shots are combined to serve as a candidate video clip.

In some embodiments, the at least one fragment set includes two or more fragment sets, and the two or more fragment sets satisfy the second preset condition, and the second preset condition is consistent with the two and The combination difference degree of the candidate video clips between the above clip sets is related.

In some embodiments, the grouping the one or more candidate video clips to determine the at least one clip set includes: determining the difference between each of the two or more clip sets and other clip sets. The degree of difference in the combination of candidate video clips between. A segment set whose combination difference with other segment sets is higher than a preset threshold is used as the at least one segment set.

In some embodiments, the determining the degree of difference in combination of candidate video segments between each of the two or more segment sets and other segment sets includes: Each of the is assigned an identifying character. Based on the identification characters of the one or more candidate video segments, a character string corresponding to the segment set and other segment sets is determined. The edit distance of the character strings corresponding to the segment set and the other segment sets is determined as the combination difference degree of the candidate video segments between the segment set and the other segment sets.

In some embodiments, the determining the combination difference of candidate video segments between each of the two or more segment sets and other segment sets includes: based on a trained feature extraction model and the two Generate a segment feature corresponding to each candidate video segment from the candidate video segments in the set of more than one segment. Based on the segment features, a set feature vector corresponding to each segment set is generated. The degree of similarity between each segment set and other segment sets is determined based on the trained discriminant model and the set feature vector corresponding to each segment set. Based on the degree of similarity, the degree of combined difference between each segment set and other segment sets is determined.

In some embodiments, the determining the degree of similarity between each fragment set and other fragment sets based on the trained discriminant model and the set feature vector corresponding to each fragment set includes: comparing the set based on a clustering algorithm The feature vector is clustered, and multiple clusters are obtained.

In some embodiments, the feature extraction model is a sequence-based machine learning model, and the candidate video segment corresponding to the candidate video segment is generated based on the trained feature extraction model and the candidate video segments in the two or more segment sets The segment features include: obtaining multiple video frames contained in each candidate video segment. Determine one or more image features corresponding to each video frame. Based on the trained feature extraction model, the image features in the multiple video frames and the relationship between the image features in the multiple video frames are processed to determine the segment feature corresponding to the candidate video segment.

In some embodiments, the image features corresponding to the video frame include the shape information of the object in the video frame, the positional relationship information between multiple objects in the video frame, the color information of the object in the video frame, and the integrity of the object in the video frame. At least one of the degree or the brightness in the video frame.

In some embodiments, the generating a target video based on the at least part of the video fragments and the video configuration information includes: generating a plurality of candidate fragment sets based on the plurality of video fragments, and the plurality of candidate fragment sets satisfy The second preset condition. At least one segment set is selected from the plurality of candidate segment sets based on the first preset condition. Based on each segment set in the at least one target segment set, a target video is generated.

In some embodiments, the video configuration information further includes sequence information, and the generating a target video based on each segment set in the at least one segment set includes: based on the sequence information, combining each segment The candidate video clips in the collection are sorted and combined to generate a target video.

In some embodiments, the video configuration information further includes beautification parameters, and the beautification parameters include at least one of filter parameters, animation parameters, and layout parameters.

In some embodiments, the method further includes: obtaining a text layer, a background layer or a decoration layer and loading parameters based on the video configuration information. The layout of the text layer, the background layer, and the decoration layer in the target video is determined according to the loading parameters.

In some embodiments, the method further includes: normalizing the plurality of video clips.

In some embodiments, the method further includes: obtaining initial audio. At least one audio segmentation point is obtained by marking the initial audio based on the rhythm. At least one video segmentation point of the target video is determined based at least in part on the video configuration information. The at least one audio segmentation point is matched with the at least one video segmentation point. Based on the matching result, the segmented audio is synthesized with the target video.

In some embodiments, the method further includes: performing post-processing on the target video to satisfy at least one video output condition, and the at least one video output condition is related to a playback medium of the target video.

In some embodiments, the at least one video output condition includes a video size condition, and the post-processing of the target video includes: cropping a frame of the target video according to the video size condition.

In some embodiments, the cropping of the screen of the target video according to the video size condition includes: obtaining cropping subject information of each video segment included in the target video, where the cropping subject information reflects the The specific cropped subject and the position information of the specific cropped subject. According to the preset picture size corresponding to the video size condition and the cropping subject information, the picture of each video segment included in the target video is cropped.

In some embodiments, the cropping the screen of each video segment included in the target video according to the preset screen size corresponding to the video size condition and the cropping subject information includes: For each included video segment, the size and initial position of the cropping frame of at least one video frame in the video segment are determined according to the cropping subject information and the preset picture size. The initial position of the cropping frame of the at least one video frame is processed, and the final position corresponding to the cropping frame of the at least one video frame is determined. According to the final position of the cropping frame, the pictures of each video frame included in the video clip are cropped to preserve the pictures in the cropping frame.

In some embodiments, the processing the initial position of the cropping frame of the at least one video frame, and determining the final position corresponding to the cropping frame of the at least one video frame includes: The initial coordinate information of the reference point of the cropping frame of the video frame is smoothed according to time. According to the result of the smoothing process, the final coordinate information of the reference point is determined. The position of the reference point is determined based on the final coordinate information.

In some embodiments, the smoothing of the initial coordinate information of the reference point includes: performing linear regression processing on the initial coordinate of the reference point to obtain a linear regression equation and its slope.

In some embodiments, the determining the final coordinate information of the reference point according to the result of the smoothing process includes: comparing the absolute value of the slope with a slope threshold. In response to the absolute value of the slope being less than the slope threshold, the position of the midpoint of the trend line of the linear regression equation is taken as the final position of the reference point of the crop frame of each video frame. In response to the absolute value of the slope being greater than or equal to the slope threshold, the position corresponding to the time point of each video frame on the trend line of the linear regression equation is used as the final position of the reference point of the crop frame of the video frame .

In some embodiments, the determining the size and position of the cropping frame of at least one video frame in the video clip according to the cropping subject information and the preset picture size includes: according to the subject information of the target video And the cropping subject information to determine the correlation between one or more specific cropping subjects in the cropping subject information and the subject information. According to the preset picture size and the specific cropping subject, at least one candidate cropping frame corresponding to the at least one video frame is determined. Score the at least one candidate cropping frame according to the cropping subject information and the correlation degree. Based on the scoring result of the candidate cropping frame, the size and position of the cropping frame of the at least one video frame are determined.

In some embodiments, the obtaining the cropping subject information of each video segment included in the target video includes: using a machine learning model to obtain the candidate cropping subject in each of the video segments. According to the subject information of the target video, the one or more specific crop subjects are selected from the candidate crop subjects.

In some embodiments, the method further includes: delivering the target video to a specific audience group.

In some embodiments, the specific audience group meets specific demographics conditions, and the generating a target video based on the at least part of the video clips and the video configuration information includes: obtaining audience acceptance of the plurality of video clips . For the specific audience group, a candidate segment whose audience acceptance is higher than a threshold is determined from the plurality of video segments according to corresponding demographic conditions, so as to generate the target video.

In some embodiments, the method further includes: obtaining a delivery effect feedback of the target video, and adjusting at least one of the demographic condition or audience acceptance according to the delivery effect feedback.

In some embodiments, the delivery effect feedback is related to at least one of the completion rate, the number of replays, or the number of viewers of the target video.

In some embodiments, the target video includes a creative advertisement, and the method further includes determining the estimated effect data of the target video based on an element effect parameter of at least one advertisement element of the creative advertisement.

In some embodiments, determining the estimated effect data of the target video based on the element effect parameter of at least one advertisement element of the creative advertisement includes: obtaining an advertisement element effect estimation model. The advertisement element marked with at least one element tag is input into the advertisement element effect prediction model to determine the element effect parameter of the advertisement element, and the at least one element tag includes the relationship between the advertisement element and the creative advertisement . Based on the element effect parameter of the at least one advertisement element, an advertisement element that meets expectations is determined in the at least one advertisement element, and the element effect parameter of the advertisement element that meets the expectation is greater than a parameter threshold. Determine the proportion of the advertisement elements that meet the expectations in the at least one advertisement element in the creative advertisement. The estimated effect data of the target video is determined based on the ratio.

Another aspect of this specification provides a system for generating a video. The system includes an acquisition module for acquiring a plurality of video clips. The configuration module is configured to obtain video configuration information, the video configuration information including one or more configuration features of at least some of the multiple video clips, and the configuration feature includes at least one of a content feature or an arrangement feature . The generating module is configured to generate a target video based on the at least part of the video clip and the video configuration information.

Another aspect of this specification provides a computer-readable storage medium that stores computer instructions. After the computer reads the computer instructions in the storage medium, the computer executes the aforementioned method.

Additional features will be partially explained in the following description, and will become partly obvious to those skilled in the art when referring to the following and the drawings, or they can be learned through example production or operation. The features of this application can be achieved and obtained by practicing or using the methods, means and combinations set forth in the detailed examples discussed below

Description of the drawings

This specification will be further described in the form of exemplary embodiments, and these exemplary embodiments will be described in detail with the accompanying drawings. These embodiments are not restrictive. In these embodiments, the same number represents the same structure, in which:

Fig. 1 is a schematic diagram of a scene of a system for generating a video according to some embodiments of this specification;

Fig. 2 is an exemplary flow chart of generating a video according to some embodiments of the present application;

Fig. 3 is an exemplary flowchart of a method for determining a video segment shown in some embodiments of the present application;

Fig. 4 is an exemplary flow chart of a method for editing an initial image or an initial video according to some embodiments of the present application;

Fig. 5 is an exemplary flowchart of a method for editing an initial image or an initial video according to other embodiments of the present application;

Fig. 6 is an exemplary flowchart of a method for generating a video shown in other embodiments of the present application;

Fig. 7 is an exemplary flowchart of a method for generating a video shown in other embodiments of the present application;

Fig. 8 is an exemplary flowchart of a method for generating a target video shown in some embodiments of the present application;

Fig. 9 is an exemplary flowchart of a method for determining a fragment set according to some embodiments of the present application;

Fig. 10 is an exemplary flowchart of a method for determining a degree of combination difference according to some embodiments of the present application;

FIG. 11 is an exemplary flowchart of another method for determining the degree of combination difference according to some embodiments of the present application;

FIG. 12 is an exemplary flowchart of a method for generating a video according to other embodiments of the present application;

Fig. 13 is an exemplary flowchart of adding audio material according to some embodiments of the present application;

FIG. 14 is an application scene diagram of a screen cropping method according to some embodiments of the present application;

Fig. 15 is a schematic diagram of a smoothing method according to some embodiments of the present application;

FIG. 16 is a flowchart of a method for determining the size and position of the crop box of each video frame according to some embodiments of the present application;

Fig. 17 is a flowchart of a method for generating a target video based on an audience according to some embodiments of the present application;

FIG. 18 is a flowchart of confirming the estimated effect data of the target video according to some embodiments of the present application;

Fig. 19 is an exemplary flowchart of a model training process according to some embodiments of the present application; and

20A to 20E are schematic diagrams of a video synthesis system according to some embodiments of the present application.

detailed description

In order to more clearly describe the technical solutions of the embodiments of the present specification, the following will briefly introduce the accompanying drawings that need to be used in the description of the embodiments. Obviously, the drawings in the following description are only some examples or embodiments of this specification. For those of ordinary skill in the art, this specification can also be applied to these drawings without creative work. Other similar scenarios. Unless it is obvious from the language environment or otherwise stated, the same reference numerals in the figures represent the same structure or operation.

It should be understood that the “system”, “device”, “unit” and/or “module” used herein is a method for distinguishing different components, elements, parts, parts, or assemblies of different levels. However, if other words can achieve the same purpose, the words can be replaced by other expressions.

As shown in this specification and claims, unless the context clearly indicates exceptions, the words "a", "an", "an" and/or "the" do not specifically refer to the singular, but may also include the plural. Generally speaking, the terms "including" and "including" only suggest that the clearly identified steps and elements are included, and these steps and elements do not constitute an exclusive list, and the method or device may also include other steps or elements.

Although this specification makes various references to certain modules or units in the system according to the embodiments of this specification, any number of different modules or units can be used and run on the client and/or server. The modules are only illustrative, and different modules may be used for different aspects of the system and method.

In this specification, a flowchart is used to explain the operations performed by the system according to the embodiment of this specification. It should be understood that the preceding or following operations are not necessarily performed exactly in order. Instead, the steps can be processed in reverse order or at the same time. At the same time, you can also add other operations to these processes, or remove a step or several operations from these processes.

In industries such as the Internet and content creation (such as media and advertising), a large number of videos need to be generated in daily work, and various materials are manually screened and cropped, and then various materials are spliced and rendered based on software. This method The efficiency is not high and the demand for personnel is large. With more and more multimedia materials and more and more video elements of the same video, the process of screening and processing will become more and more difficult. At this time, the problem of reduced efficiency will further appear.

According to some embodiments of the present application, a multimedia system is proposed. The multimedia system can obtain multiple video clips and video configuration information. The video configuration information may be generated based on script information and/or video templates. The video configuration information may be used to determine one or more configuration features of at least some of the multiple video clips. The configuration feature includes at least one of content feature or arrangement feature. The content features may include video clips or the final generated video (also known as the target video) including a specific subject (object), a specific theme, a specific shot, a specific audio, and so on. The arrangement feature may include the size of the video segment, the layout of the target object in the video segment, the duration of the video segment, the specific position of the video segment in the target video, the visual effect characteristics of the video segment, and the like. The multimedia system may generate a target video based on the at least part of the video segment and the video configuration information. The multimedia system can realize automatic processing and generate target videos, and has high efficiency and saves labor costs.

Fig. 1 is a schematic diagram of a scene of a multimedia system according to some embodiments of the present application.

The multimedia system 100 can be used for media, advertising, the Internet, etc., and can quickly and targetedly generate target videos for delivery. The multimedia system 100 may include a server 110, a network 120, a terminal device 130, a database 140, and other data sources 150.

The server 110 and the terminal device 130 may be connected through the network 120 or directly; the database 140 may be connected to the server 110 through the network 120, or may be directly connected to the server 110 or located inside the server 110. The database 140 and other data sources 150 can be connected to the network 120 to communicate with one or more components of the multimedia system 100. One or more components of the multimedia system 100 can access data or instructions stored in the terminal device 130, the database 140, and other data sources 150 through the network 120.

The various components in the multimedia system 100 can be integrated in the same device, and the above-mentioned communication relationship can be realized through the internal bus of the device. At least part of the components in the multimedia system 100 can be integrated in the same device, and each device can be connected through the communication port of each device, so that the various components in the multimedia system 100 can be communicatively connected, thereby realizing the aforementioned communication relationship.

The server 110 may be used to manage resources and process data and/or information from at least one component of the system or an external data source (for example, a cloud data center). In some embodiments, the server 110 may be a single server or a group of servers. The server group may be centralized or distributed (for example, the server 110 may be a distributed system), it may be dedicated, or other devices or systems may provide services at the same time. In some embodiments, the server 110 may be regional or remote. In some embodiments, the server 110 may be implemented on a cloud platform or provided in a virtual manner. For example only, the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud, a distributed cloud, an internal cloud, a multi-layer cloud, etc., or any combination thereof.

In some embodiments, the server 110 may include a processing device 112. The processing device 112 may process data and/or information obtained from other devices or system components. The processor may execute program instructions based on these data, information, and/or processing results to perform one or more functions described in this application. In some embodiments, the processing device 112 may include one or more sub-processing devices (for example, a single-core processing device or a multi-core and multi-core processing device). For example only, the processing device 112 may include a central processing unit (CPU), an application specific integrated circuit (ASIC), an application specific instruction processor (ASIP), a graphics processing unit (GPU), a physical processor (PPU), a digital signal processor ( DSP), Field Programmable Gate Array (FPGA), Editable Logic Circuit (PLD), Controller, Microcontroller Unit, Reduced Instruction Set Computer (RISC), Microprocessor, etc. or any combination of the above.

The network 120 may connect various components of the system and/or connect the system and external resource parts. The network 120 enables communication between various components and with other parts outside the system, and facilitates the exchange of data and/or information. In some embodiments, the network 120 may be any one or more of a wired network or a wireless network. For example, the network 120 may include a cable network, a fiber optic network, a telecommunications network, the Internet, a local area network (LAN), a wide area network (WAN), a wireless local area network (WLAN), a metropolitan area network (MAN), and a public switched telephone network (PSTN) , Bluetooth network, ZigBee network (ZigBee), near field communication (NFC), device bus, device line, cable connection, etc. or any combination thereof. The network connection between the various parts can be in one of the above-mentioned ways, or in multiple ways. In some embodiments, the network may be a variety of topological structures such as point-to-point, shared, and centralized, or a combination of multiple topologies. In some embodiments, the network 120 may include one or more network access points. For example, the network 120 may include wired or wireless network access points, such as base stations and/or network exchange points 120-1, 120-2, ..., through which one or more components of the system can be connected to the network 120 To exchange data and/or information.

The terminal device 130 refers to one or more terminal devices or software used for data query and/or multimedia display. In some embodiments, the terminal device 130 may be one or more users, may include users who directly use the service, or may include other related users. In some embodiments, the terminal device 130 may be one or any combination of the mobile device 130-1, the tablet computer 130-2, the laptop computer 130-3 and other devices with input and/or output functions.

In some embodiments, the terminal device 130 may also include a user terminal that can be used to input and/or obtain data or information. In some embodiments, the user may generate or obtain the original video or original image through the user terminal 110. For example, the user can use the camera of the user terminal to record an image or take a picture and store it as an original video or original image, or download the original video from video software through the user terminal. In some embodiments, the user may input the constraint condition of the target video (for example, video configuration information) through the user terminal. In some embodiments, the user can obtain or browse the synthesized target video through the user terminal.

The database 140 may be used to store data and/or instructions. The database 140 is implemented in a single central server, multiple servers connected through a communication link, or multiple personal devices. In some embodiments, the database 140 may include mass memory, removable memory, volatile read-write memory (for example, random access memory RAM), read-only memory (ROM), etc., or any combination thereof. Exemplarily, the mass storage device may include magnetic disks, optical disks, solid-state disks, and the like. In some embodiments, the database 140 may be implemented on a cloud platform.

Other data sources 150 may be one or more sources used to provide other information for the system. The other data source 150 can be one or more devices, can be a camera device that directly obtains the initial image or initial video, can be one or more application program interfaces, can be one or more database query interfaces, and can be one or more A protocol-based information acquisition interface can be other methods of acquiring information, or a combination of two or more of the above methods. The information provided by the information source may already exist when the information is extracted, or it may be generated temporarily when the information is extracted, or a combination of the above methods. In some embodiments, other data sources 150 may be used to provide multimedia information such as pictures, videos, and music to the system.

Fig. 2 is an exemplary flow chart of generating a video according to some embodiments of the present application. In some embodiments, one or more steps in the process 200 may be stored as instructions in a storage device (for example, the database 140), and called by a processing device (for example, the processing device 112 in the server 110) and/or implement. As shown in Figure 2, the process 200 may include:

Step 210: Obtain multiple video clips.

A video segment may refer to a video composed of video frames. Each video segment may be a sub-sequence in the sequence of images constituting the video. A video clip can be a short video of 3 seconds, 4 seconds, or 5 seconds. The video frame can be understood as decomposing the continuous image according to the time interval to obtain the corresponding frame image. For example, the time interval between each frame of images can be exemplarily set to 1/24s (it can also be said that 24 frames of images are obtained within 1 second).

In some embodiments, a video segment may be or include one or more shot segments. The shot segment may refer to a continuous picture composed of video frames between two editing points. For example, a lens segment can be the sum of a segment of pictures taken continuously by the camera from start to standstill. Exemplarily, if the first picture in the video file is a seaside, and then the picture is switched to a girl drinking yogurt, and then switched to a girl surfing on the sea, then the girl drinking yogurt is a scene fragment, and a picture of the seaside in front of it It is another scene, and the scene behind it of a girl surfing on the sea is another scene. For a clearer understanding, in the embodiments of this specification, a video segment including a shot frame will be used as an example for description.

In some embodiments, the database 140 and/or other data sources 150 may store the multiple video clips, and step 210 may be implemented by directly obtaining the multiple video clips from the database 140 and/or other data sources 150.

In some embodiments, the database 140 and/or other data sources 150 may store unprocessed initial data. The initial data may include an initial image (also may be referred to as a to-be-processed image) and/or an initial video (also may be referred to as a to-be-processed video). Step 210 can obtain the multiple video clips by processing the initial data (for example, determining a cutting point and editing). Exemplarily, a video segment may be generated based on a shot segment contained in a video file (for example, an initial video and/or an initial image). For example, if a video file contains 5 shots, 5 video clips can be generated. In some embodiments, the video file may be split manually or by machine to generate multiple video clips. For example, the user manually edits based on the number of shots in the video file, or uses a trained machine learning model to split the video file into multiple video clips according to preset conditions (such as the number of shots, duration, etc.). In some alternative embodiments, the processing device 112 may also obtain multiple video clips intercepted from the video file based on the time window, and this specification does not limit the means for splitting the video clips. For the method of determining the multiple video clips, reference may be made to FIGS. 3-7 and related descriptions.

Step 220: Obtain video configuration information.

The video configuration information refers to the finally generated video (also referred to as the target video) and information related to the configuration of each video segment constituting the target video. The video configuration information may reflect requirements on the content or form of the target video and the video segments that make up the target video. Each video segment composing the target video may be at least a part of the multiple video segments. In some embodiments, the video configuration information may include one or more configuration features of each video segment (that is, the at least part of the video segment) constituting the target video.

In some embodiments, the configuration features may include content features and/or arrangement features. The content feature is related to the content of the at least part of the video clip. For example, the content feature may include that the video clip or target video contains a specific subject (object), a specific theme, a specific shot, a specific audio, and so on. The arrangement feature is related to the presentation form of the at least part of the video segment. For example, the arrangement feature may include the size of the video segment, the duration of the video segment, the specific position of the video segment in the target video, the visual effect feature of the video segment, and so on.

The specific subject may also be referred to as a specific object. For example, the specific subject may be a specific object included in each video clip, or may be a specific object related to a specific theme of the target video among multiple objects. Exemplarily, the specific subject may be products (electronic products, daily necessities, decorations, etc.), creatures (humans, animals, etc.), signs (for example, trademarks, regional signs, etc.), or landscapes (mountains, houses, etc.), etc. One or more of.

The specific topic may be the main content of the video clip. The topic can be determined by keyword information in the title or introduction of the video clip, tag information, user-defined information, or information stored in a database. The specific theme may be composed of specific content and video type. For example, in a perfume advertisement, perfume is a specific content, and the advertisement is a video type. The specific content may also include specific actions (such as live broadcast, evening parties, etc.), specific dates (such as Valentine's Day, Children's Day, Double Eleven, etc.) and so on. In some embodiments, the method for determining a specific topic may be user-defined or user-selected from a list. For example, the user directly enters the specific theme of the target video as "advertising for car interiors." For another example, the user may first select the target video as the advertisement video, and then select "Shampoo" -> "Shampoo" from the product category list. In some embodiments, it is also possible to identify a specific topic through a model. For example, when a user uploads a video (including video clips used to generate the target video, initial video, initial image and other video materials) and does not specify or select a specific topic, each video can be obtained based on image recognition technology such as target detection All objects in the image of the frame, and then the objects that appear most frequently or occupy the most weight are used as the default specific theme of the advertising video. For example, in the picture of each video frame, the image of the car tire occupies the largest area, then the default theme of the target video can be set to "car" or "car tire".

The specific shot picture refers to a video frame or a sequence of video frames containing a specific picture. For example, specific shots can include children drinking milk, models using cosmetics, playing volleyball on the beach, and browsing mobile shopping pages on the Double Eleven Shopping Festival. In some embodiments, the specific lens frame may be related to a specific subject or a specific subject. For example, when the specific topic is tooth protection, the promotional video can include shots of the specific action of "brushing teeth". For another example, when the specific subject is perfume, the advertisement video may include a shot of the specific action of "spraying perfume".

The specific audio may include a specific sound. For example, the specific audio may include dialogue, monologue, theme music, background music, or other specific sounds (for example, wind, rain, bird song, brake sound, impact sound, etc.). Wherein, the theme music and/or background music can be of different types, such as soothing, brisk, focused, aggressive, and so on.

The size of the video segment may be the width and height of the video frame in the video segment. For example, the size of the video frame in the video segment is 512×384, 1024×768, 1024×1024, 1920×1080, 2560×2560, etc. In some embodiments, the size of the video segment may also be the aspect ratio of the video frame in the video segment. For example, the aspect ratio of the video frame in the video segment is 9:16, 1:1, 5:4, 4:3, 16:9, etc.

The duration of the video segment refers to the length of time required to play the video segment. For example, 3 seconds, 5 seconds, 30 seconds, 2 minutes, etc.

The specific position of the video segment in the target video may refer to a specific range of the video segment in the target video. In some embodiments, the specific range may be expressed in terms of time. For example, a certain video segment can be in the target video from 2 minutes 15 seconds to 2 minutes 30 seconds. In some embodiments, the specific range may also be represented by a video frame range. For example, a certain video segment can be in the position of frames 1 to 50 in the target video. In some embodiments, the specific range may also be represented by a position relative to other video clips. For example, a certain video segment may be located between the third video segment and the fifth video segment in the target video.

The visual effect characteristics of the video segment may be used to describe operations performed on the video segment and related to its visual effect. The operations may include beautification, normalization, template decoration, and so on. The beautification may include operations such as filters and animations to enhance the video effect.

In some embodiments, the video configuration information may be determined by the user, or determined according to system default settings. In some embodiments, the database 140 and/or other data sources 150 may store the video configuration information, and step 220 may be implemented by directly obtaining the video configuration information from the database 140 and/or other data sources 150, for example, After determining the specific subject/specified subject of the target video, the corresponding video configuration information is obtained from the database 140 and/or other data sources 150 according to the specific subject/special subject of the target video.

In some embodiments, the video configuration information may be determined based on script information and/or video templates. For example, the multimedia system 100 determines the video configuration information by analyzing script information and/or video templates. In some embodiments, the script information and/or video template may be pre-stored in the database 140 and/or other data sources 150, and the user may automatically obtain information from the database 140 and/or other data sources 150 after inputting relevant information of the target video. Or determine the corresponding script information and/or video template. For example, after the user inputs a picture of a specific perfume, the system 100 can automatically recognize the specific theme "perfume advertisement" of the target video and call the script information and/or video template related to the perfume advertisement from the database 140 and/or other data sources 150.

The script information refers to determining the screen content (for example, a specific subject, a specific background, a specific action, etc., or a combination thereof) of each video segment and/or video frame in a video (for example, a target video), a screen duration, and a screen scene (For example, panorama, close-up, medium-range, close-up, etc.), lens change, appearance time, etc. Script information can define the content and/or arrangement of video clips. For example, for a target video of an advertisement type, the script information can determine that the target video includes video clips: ①viewer interactive video clips, ②use experience video clips, ③product selling points Video clips, ④How to use video clips, ⑤Function effect video frequency bands, and ⑥Action guidance video clips. It can be determined by analyzing the script information for each element (for example, the target video contains a specific object, the target video contains a specific theme, the total duration of the target video, the number of shots contained in the target video, the specific shots contained in the target video, and the target video The number of overlaps of a specific theme in the target video, or the focus time of a specific theme in the target video, etc.).

In some embodiments, the script information may further include the sequence of each shot. For example, in the script information of the target video of the advertisement type, the video clips included in the script information may be arranged in a preset order, for example, in the order of ①～⑥. The target video can be generated based on the sequence.

The script information can be of different types. In some embodiments, the script information may be general-purpose script information. The general-purpose script information may be applicable to different subjects (for example, products) or themes. For example, the general-purpose script information may sequentially include audience interaction, usage experience, product selling point, usage method, function function, and action guide, or a specific theme, product selling point, use method, use feeling, function function, and product cost performance, or Multiple video clips or shots such as applicable scenes/crowds, product selling points, efficacy, product cost-effectiveness, and action guidelines.

In some embodiments, the script information may be related to the subject (eg, product) or topic. For example, for beauty products, the corresponding script information may in turn include specific themes, effects, design, product ingredients, and action guidelines, or audience interaction, product/brand recommendations, applicable scenarios/populations, efficacy 1, efficacy 2 , Product components and audience interaction, or target group pain points 1, target group pain points 2, product/brand recommendations, product ingredients, efficacy, specific themes, usage experience, and specific themes and other video clips or shots. For another example, for food, the corresponding script information may in turn include eating experience, food attributes, cooking method/process, product ingredients, and eating experience, or product recommendation, cooking method/process, eating experience, brand promotion, and product utility, or Brand promotion, product recommendation, packaging design, cooking method/process, eating experience and action guidelines. For another example, for maternal and child products, the corresponding script information may include specific themes, appearance introduction, efficacy, product attributes 1, production process, product attributes 2, action guidelines and appearance introduction, or product cost performance, product/brand recommendation, Applicable scenes/crowd, product texture, product/brand recommendation, or multiple video clips or shots such as audience interaction, product attributes, efficacy, and product/brand recommendation.

The video template may refer to the form packaging of the video. The video template may include credits, credits, watermarks, subtitles, titles, borders, filters, etc. The title/end of the title refers to a piece of audio-visual material added at the beginning and end of the video to create an atmosphere, heighten the momentum, attract eyeballs, present the title of the work, the photographer, and product information. Watermark refers to the pattern attached to the video, which reflects the company, product and other information or personalized design. Subtitles refer to non-visual content such as dialogues, product/topic introductions, etc. displayed in the video in the form of text. The title is a short sentence indicating the content of the video. The frame refers to one or more patterns of specific shapes (for example, strips) surrounding the video page. Filter refers to the operation used to realize the special display effect of the image.

In some embodiments, the video template may be a template material in Adobe After Effects (AE) software, which is commonly used software in the field of video production, and will not be repeated here.

In some embodiments, the video template may be related to the subject (e.g., product, model, etc.), theme (e.g., charity, entertainment, education, life, shopping, etc.), video effect (e.g., promotion/advertising effect), etc. In some embodiments, the database 140 and/or other data sources 150 may be preset with multiple video templates corresponding to different subjects, themes, video effects, etc. After determining the specific subject, specific subject, and video effect of the target video, The corresponding video template is called from the database 140 and/or other data sources 150. In some embodiments, the user determines the video template by presetting the title, ending, watermark, subtitle, title, border, and/or filters of the video template based on different subjects, themes, and/or video effects.

In some embodiments, the video configuration information may include at least one preset condition (for example, a first preset condition, a second preset condition, etc.) related to the content of the video segment included in the target video. The first preset condition may be related to at least one of the multiple elements of the content feature in the configuration feature. The multiple elements may include that the target video contains a specific object, the target video contains a specific theme, the total duration of the target video, the number of shots contained in the target video, the specific shots contained in the target video, and the overlap of specific themes in the target video The number, or the focus time of a specific topic in the target video, etc. The first preset condition may implement screening of the multiple video clips by restricting at least one of the multiple elements. For example, the first preset condition may include limiting the last shot of the target video to a product display shot, and selecting a video clip containing the product display shot from the multiple video clips as the last video clip of the target video. For the specific content of the first preset condition, refer to the process 800 and its related description.

The second preset condition may be related to the difference degree feature of the content feature in the configuration feature. For example, the second preset condition may be related to the difference degree of the segment set. The segment set refers to one or more sets formed by grouping video segments that meet specific conditions (for example, video segments that meet a first preset condition), which may also be called candidate video segments. The second preset condition may implement the screening of the fragment set by restricting the degree of difference of the fragment set. The target video may be generated based on a filtered set of segments. For the specific content of the second preset condition, refer to the process 900 and its related descriptions.

In some embodiments, the video configuration information includes sequence information. The sequence information may be related to the arrangement feature of the configuration feature. That is, the sequence information can determine the arrangement of each video segment in the target video. For example, when generating the target video, the candidate video clips in the clip collection may be sorted based on the sequence information to generate the target video. For the specific content of the sequence information, refer to the process 800 and its related descriptions.

In some embodiments, the video configuration information includes beautification parameters. The beautification parameter may be related to the arrangement feature of the configuration feature. Beautify the target video by beautifying parameters to obtain better visual effects. For example, the beautification parameters may include at least one of filter parameters, animation parameters, and layout parameters. The beautification parameter may be used to beautify at least part of the video clips in the plurality of video clips. In some embodiments, the beautification parameter may also be directly used for the target video, initial image, initial video, and so on.

Step 230: Generate a target video based on at least part of the video segment and the video configuration information.

The multimedia system 100 may determine at least one segment set from the multiple video segments based on the video configuration information to generate the target video. For example, the multimedia system 100 may obtain one or more candidate video fragments from the multiple video fragments based on a first preset condition, and group the one or more candidate video fragments to determine at least one set of fragments. At least one segment set satisfies the second preset condition, and based on each segment set in the at least one segment set, a target video is generated according to the corresponding sequence information. For another example, the multimedia system 100 may generate a plurality of candidate fragment sets based on the plurality of video fragments, the plurality of candidate fragment sets satisfying a second preset condition; and based on the first preset condition, from the plurality of candidate fragment sets At least one segment set is filtered out; and based on each segment set in the at least one target segment set, a target video is generated according to the corresponding sequence information. The type of the target video may include, but is not limited to, advertising video, promotional video, video web log (vlog), entertainment short video, and the like. Regarding the specific description of generating the target video based on the first preset condition and the second preset condition, the

process

800 or 1200 and related descriptions can be used.

The multimedia system 100 may beautify at least a part of the video clips or the target video in the plurality of video clips based on the beautification parameters to obtain a better visual effect. In some embodiments, the multimedia system 100 may also perform conventional video processing on at least part of the video clips or the target video in the plurality of video clips, for example, cropping, zooming, editing based on templates, and so on.

In some embodiments, the multimedia system 100 may further perform post-processing on the target video to satisfy at least one video output condition. The at least one video output condition is related to a playback medium of the target video. For example, the at least one video output condition is a video size condition. The video size condition may be determined based on the size of the video playback medium. Post-processing the target video may include cropping a frame of the target video according to the video size condition. For a detailed description of the post-processing of the target video, refer to Figures 14-16 and related descriptions.

Fig. 3 is an exemplary flowchart of a method for determining a video segment according to an initial image or an initial video according to some embodiments of the present application. In some embodiments, one or more steps in the process 300 may be stored as instructions in a storage device (for example, the database 140), and called by a processing device (for example, the processing device 112 in the server 110) and/or implement. As shown in FIG. 3, the process 300 may include:

Step 310: Obtain at least one of an initial image or an initial video.

The initial video may refer to a dynamic image, and the dynamic image may be composed of a series of video frames. For example, the initial video may include video files in various formats such as MPEG, AVI, ASF, MOV, 3GP, WMV, DivX, XviD, RM, RMVB, FLV/F4V, etc. In some embodiments, the initial video may also include audio files (audio tracks) corresponding to the moving images. In some embodiments, the initial video may include promotional videos, personal recording videos, audiovisual images, network videos, advertisement clips, product demos, movies, and short films or movies containing related products and models.

The initial image can refer to a static image, for example, the initial image can include pictures in various formats such as bmp, jpg, png, tif, gif, pcx, tga, exif, fpx, svg, psd, cdr, pcd, dxf, ai, raw, etc. document. In some embodiments, the initial image may include photos taken by the camera, advertising images, product renderings, posters, and the like.

In some embodiments, the initial image and the initial image may be obtained by camera or video/image processing equipment and stored in the database 140 and/or other data source 150. Specifically, the corresponding image may be called from the database 140 and/or other data source 150. The initial image of and/or the initial image implements step 310.

In some embodiments, the initial video and the initial image may be network public material resources, such as image resources and video resources in various open databases, and step 310 may be implemented by obtaining public materials. In some embodiments, the multimedia system 100 may also obtain the initial video and the initial image through other direct or indirect methods. For example, the multimedia system 100 directly obtains a video file or an image file uploaded by a user, or obtains a video file or an image file based on a link input by the user.

Step 320: Perform editing processing on the initial image or the initial video to obtain multiple video clips.

In some embodiments, step 320 may determine the initial video or the initial image segment boundary, and segment or group the initial video or initial image based on the segment boundary to determine multiple shot segments of the initial video or initial image. Then take the shot fragment as the multiple video fragments. For example, if the initial video is a cooking video, the cooking steps, dish production steps, and tasting steps in the initial video can be divided into different shots, and then each shot can be used as the multiple video clips. For specific determination of the content of the video segment based on the initial video, reference may be made to FIG. 4 of this application and its related description.

In some embodiments, in order to improve the accuracy of the description of the configuration feature of the video segment, each video segment may be a shot segment. The initial video or the initial image often includes multiple shots, and therefore, the initial image or the initial video needs to be split. It is understandable that a video clip may also include multiple shots. In some embodiments, the initial video or the initial image can be split directly according to the number of shot fragments for a video fragment containing multiple shot fragments. In some embodiments, the video segment may be split into multiple video segments containing only one shot segment, and then the multiple video segments can be spliced into one video segment according to the constraint conditions (for example, binding conditions) of the video segments.

In some embodiments, an initial video may include only one shot segment, and the entire initial video is treated as a shot segment and output as a video segment. In some embodiments, an initial video may be formed by splicing multiple shots, and one or more consecutive video frames at the junction of two adjacent shots may be called a fragment boundary (also may be called a shot Boundary frame). In some embodiments, the initial video may be segmented in units of shot segments to obtain each video segment. When splitting the initial video into multiple shot segments, you can split it at the segment boundary, that is, use the segment boundary as a cutting point to split the initial video into multiple video segments.

Fig. 4 is an exemplary flowchart of a method for editing an initial image or an initial video according to some embodiments of the present application, which specifically involves splitting the initial image or the initial video. In some embodiments, one or more steps in the process 400 may be stored as instructions in a storage device (for example, the database 140), and called by a processing device (for example, the processing device 112 in the server 110) and/or implement. As shown in FIG. 4, the process 400 may specifically include the following steps:

Step 410: Obtain the characteristics of each pair of adjacent images or video frames in the initial image or the initial video.

In some embodiments, the image embedding model may be used to obtain the characteristic information of each pair of adjacent images or video frames (also can be understood as multiple video frames and video frames) in the initial image or initial video (for example, an advertisement video). In some embodiments, the processing method adopted for the initial image is similar to that of the initial video. For ease of understanding, the following description takes the initial video as an example for illustration, which does not constitute a limitation on the scope of the present application. The multimedia system 100 can embed the initial video input image in the model. The image embedding model can extract the images of each video frame that constitutes the initial video, and extract the characteristics of the image of the video frame, and generate the vector corresponding to the image of each video frame. In some embodiments, the image input image of the video frame that has been extracted may be embedded in the model, and the image embedding may correspond to the vector outputting the image of each video frame.

In other embodiments, the feature information of the video frame can be obtained based on a mobilenet model (such as a mobilenetV2 model) pre-trained by the imagenet picture library. The mobilenetV2 model can extract the image features of each video frame more accurately and quickly. For example, each video frame can be input into the mobilenetV2 model, and the normalized 1280-dimensional vector corresponding to each video frame can be output through the mobilenetV2 model. In some embodiments, other machine learning models with similar functions can also be used to obtain the feature information of the video frame, such as GoogLeNeT model, VGG model, ResNet model, etc., which is not limited in this application. By using the machine learning model to extract the features of the video frame, the camera boundary frame can be determined more accurately, so as to realize the accurate segmentation of the shot fragment, so that the subsequent cropping of the initial image or the initial video can be more convenient to operate and avoid The main information of the original image or original video is cut off.

Step 420: Determine the similarity of each pair of adjacent images or video frames.

In some embodiments, this can be achieved by separately calculating the similarity between each video frame and a video frame preselected from a plurality of video frames according to the feature information of the video frame.

In some embodiments, the inner product of the feature vectors of two video frames may be used as the similarity between the two video frames. In some embodiments, calculating the similarity between each adjacent image (for example, between a video frame and a video frame preselected from a plurality of video frames) may be calculating each video frame and its previous and/or back neighbors The similarity between video frames can also be calculated by calculating the similarity between each video frame and the video frame with a preset number of frames before and/or after it.

Step 430: Identify the segment boundary based on the similarity of each pair of adjacent images or video frames.

Exemplarily, the segment boundary may include a hard-cut boundary frame or a soft-cut boundary frame.

In some embodiments, identifying the segment boundary may include determining the hard-cut boundary frame of the shot segment. If the transition effect is not used between two adjacent lens segments, and the two adjacent video frames of the two adjacent lens segments jump directly, the two adjacent video frames can be understood as hard-cut boundary frames. In the process of determining the hard-cut boundary frame, the similarity between each video frame and its previous and/or subsequent adjacent video frames can be calculated, if the similarity between two adjacent video frames is lower than the similarity threshold , The two adjacent video frames are determined to be hard-cut boundary frames.

In some embodiments, identifying the segment boundary may include determining the soft cut boundary frame of the shot segment. If a transition effect is used between two adjacent lens fragments, and the adjacent video frames of the two adjacent lens fragments will not jump directly, the several sub-video frames used for transition between the two lens fragments can be understood as Soft cut boundary frame. The soft cut boundary frame can be determined by the following methods:

First, the candidate segmentation area can be determined by calculating the similarity between each video frame and the video frame with a preset number of frames before and/or after. In the process of determining the candidate segmentation area, the preset interval frame number can be set to 2 frames, 3 frames, 5 frames, and so on. If it is calculated that the similarity between the two video frames is less than the preset threshold, the video frame between the two video frames is used as the candidate segmentation area, and the two video frames are used as the boundary frame of the candidate segmentation area. For example, if the preset interval frame number is 2 frames, the similarity between the 10th frame and the 14th frame can be calculated. If the similarity is less than the similarity threshold, the 12th and 13th frames are used as candidate segmentation regions. The 10th and 14th frames are the boundary frames of the candidate segmentation area. Then, the candidate segmentation regions can be further merged, that is, the overlapping candidate segmentation regions are merged together. If the 12th and 13th frames are candidate segmentation regions, and the 13th and 14th frames are also candidate segmentation regions, then the 12th, 13th, and 14th frames are merged into one candidate segmentation region.

Since the foregoing steps may be mixed into some video frames that are in the same shot segment but the screen changes drastically, after that, the candidate segmentation regions can be further screened. In the process of screening the candidate segmentation area, the candidate segmentation area may be filtered based on the similarity S1 within the candidate segmentation area and the similarity S2 outside the candidate segmentation area.

The method of calculating the similarity S1 in the segmented area may be: calculating the similarity between the boundary frame of the candidate segmented area and the video frame located in the candidate segmented area and separated from the boundary frame of the candidate segmented area by a preset number of frames to obtain the candidate segmentation. The similarity within the region S1. For example, if the candidate segmentation area is the 12th frame and the 13th frame, and the preset interval frame number is 2, the similarity between the 11th frame and the 13th frame and the similarity between the 12th frame and the 14th frame are calculated, and the two similar The minimum value of the degree is taken as the similarity S1 in the segmented area.

The method of calculating the similarity S2 outside the candidate segmentation area may be: calculating the similarity between the video frame in the front part of the candidate segmentation area and the video frame with a preset number of interval frames before it, and calculating the back end of the candidate segmentation area. The similarity between the complemented video frame and the video frame with a preset number of frames in the subsequent interval is used to obtain the similarity S2 outside the candidate segmentation area. For example, if the candidate segmentation area is the 12th frame and the 13th frame, and the preset interval frame number is 2, the similarity between the 10th frame and the 12th frame, and the similarity between the 13th frame and the 15th frame are calculated, and the two The minimum value of the similarity is taken as the similarity S2 outside the divided area. If the value of S2 is greater than S1 and exceeds the threshold, the candidate segmentation area is deemed to be the final segmentation area, and the segmentation operation of the shot segment is performed based on the final segmentation area.

Step 440: Based on the segment boundaries, divide the initial image or the initial video to obtain multiple video segments.

The initial video can be split by using the segment boundary as the cutting point, so that each split video segment contains a shot segment.

It should be noted that the initial image can also be processed based on the method shown in FIG. 4 to obtain images that are determined to be non-repetitive in the initial image, as corresponding video segments or other editing processing to obtain corresponding video segments.

In some embodiments, in order to facilitate the processing of the video file based on a specific object, the present application may also determine a specific subject (also referred to as an object) related to the specific theme according to the specific theme of the target video. The specific subject here can be understood as the object or main object contained in the video frame/shot fragment related to the specific theme (also can be expressed as the target video theme, cropping theme, etc.), and can include living things (humans, animals, etc.) , Merchandise (cars, daily necessities, decorations, cosmetics, etc.), background (mountains, roads, bridges, houses, etc.), etc., for example, when the specific theme of the target video is advertising, the specific object can include people, objects, or signs, etc. One or more combinations. Specifically, the person can be an event/product spokesperson, the item can be a corresponding product, and the logo can be a product trademark, a regional identification, etc.

It can be understood that in different processes of the present application, specific objects can be expressed as different names. For example, in the aforementioned process 200, specific objects can be expressed as specific subjects. In the subsequent process 500, a specific object can be expressed as a subject. In the subsequent process 1400, the specific object may be expressed as the cropped subject.

In some embodiments, the process of determining a specific subject based on a specific subject can be achieved by determining a specific subject from candidate subjects based on the specific subject. Specifically, the specific subject can be automatically selected based on the degree of relevance between the multiple candidate subjects included in the video clip and the specific subject. For example, rank the relevance of each candidate subject to a specific topic, and then select the top X candidate subjects. X can be set to 1, 2, or 4, etc. The top X candidate subjects can be determined as the specific subjects of the video clip. The degree of association between the candidate subject and a specific topic can refer to the description of the correlation between one or more cropped subjects in the cropped subject information in the process 1600 and the subject information.

In some embodiments, the candidate subject may be a subject set in the database 140 in advance. Candidate subjects can be set specifically for a specific type of subject. For example, for a cosmetics advertisement video, the candidate subject may be a commodity or a human face (including five senses such as eyes, nose, and mouth).

In some embodiments, the candidate subject can also be the subject in each video frame/shot segment. The process of obtaining the candidate subject can be achieved by determining the candidate subject of each video segment through a machine learning model, for example, determining each subject through a machine learning model. For the main content in the video frame of the video clip, the main content is used as a candidate subject, and then the correlation between the candidate subject and the specific subject is determined according to semantic analysis, and then each specific subject is determined. This method can make the selected specific subject have a higher degree of relevance to the specific topic.

In some embodiments, the specific subject of the video clip can be determined based on a specific subject. For example, the subject information of the target video is lipstick, and the video clip may contain one or more subjects. The subject of each video clip identified by the machine learning model is used as Candidate subjects, which can include the nose of the human face, the mouth of the human face, the eyes of the human face, commodities (lipstick), trees, roads, and houses. Based on a specific theme as lipstick, the mouth and commodities of the face can be further selected (lipstick). ) These candidate subjects with a high correlation with lipstick will eventually become the specific subjects of the corresponding video clips.

In some embodiments, the target video may include multiple video themes. For example, in order to highlight the role of the mouthwash, a tooth protection promotion video may be spliced before the advertisement video of the mouthwash. Correspondingly, different video themes may contain different specific subjects. For each video file, the amount of overlap of a specific theme can be determined according to the specific subject contained in the video clip and the corresponding relationship between the specific subject and the specific theme (different video themes), for example, the aforementioned mouthwash with dental protection promotion. In the advertisement video, the lens related to the effect display can be related to two specific themes at the same time. For example, the related lens of the effect display can include the effect of non-brushing, the effect of brushing with toothbrushes, the effect of mouthwash, and the effect of While protecting the teeth, it also highlights the cleaning effect of the mouthwash. In addition, based on multiple video topics, the focus time of a specific topic in the target video can be determined by counting the focus time corresponding to a specific topic (using the specific topic as the focus or main content of the video clip).

Fig. 5 is an exemplary flowchart of a method for editing an initial image or an initial video according to some embodiments of the present application. In some embodiments, one or more steps in the process 500 may be stored as instructions in a storage device (for example, the database 140), and called by a processing device (for example, the processing device 112 in the server 110) and/or implement.

Step 510: Determine the subject information of the initial image or the initial video. The subject information includes at least the subject and the position of the subject.

The initial image or initial video usually includes one or more subjects for highlighting the subject. In some embodiments, the subject may specifically be the object most relevant to the target video theme among the various objects appearing in the video clip. In some embodiments, the subject may also be the object occupying the largest aspect ratio in the video frame. Exemplarily, the subject may be one or more of products (electronic products, daily necessities, decorations, etc.), creatures (humans, animals, etc.), or landscapes (mountains, houses, etc.), and the like. For ease of description, in this section, the main body is one and the main body is the model.

In some embodiments, the subject may be manually imported. For example, the user may select the subject from the database 140 or the terminal device 130. Continue to use the model as the main body as an example. When the user wants to generate a target video related to the model, the model can be used as the main body of the target video (specific theme). The corresponding video clip, initial video, and initial image should contain the subject. The user is in the database 140 After selecting the model in, use the processor to further process the initial image or initial video to determine the position of the model in each video frame. The selection of the subject can be achieved by uploading a specific image, manually selecting a specific image in the video frame, semantic recognition and similar methods. For example, after the user enters the text content of the "model", the multimedia system 100 can automatically recognize each initial through semantic recognition. The "model image" in the video and the initial image is the subject.

In some embodiments, the subject information includes at least the presence or absence of the subject and the subject position of the corresponding subject. The subject information can also include the subject's color information, size information, name information, category information, or facial recognition data. The position of the subject is understood as the information about the position of the subject in the picture and/or video, for example, it may be the information of the coordinates of the reference point. The size information of the main body may include information about the actual size of the main body and information about the proportion of the main body in the size of the screen of the advertisement video. The category information of the subject can be understood as the category of the subject. For example, the category information of the subject includes the information that the subject’s classification is product or model, or it can be further refined into a certain category of product information. It is a mobile device.

In some embodiments, the subject information can be characterized by tags (such as tag ID and tag value). Correspondingly, a tag can be added to the video frame in the initial video, and the tag can represent the name of the subject included in the image or video. If a poster includes product A, product B, and model A, you can add tags for product A, product B, and model A to the poster (for example, add and modify the tag IDs corresponding to product A, product B, and model A, and add The corresponding tag value is modified to 1).

In some embodiments, in the method of determining the initial image or the initial video (such as step 310), each image or video to be selected in the database 140 may hold a tag. When the user selects product A, product B, or model A in the database 140 as the subject (for example, after determining the specific theme of the target video, select the specific object corresponding to the specific theme of the target video), the database 140 can automatically match the aforementioned product A. The poster of product B and model A is associated, and the poster is extracted as the initial image. When extracting the initial video, part of the video content with the video frame can be directly processed further (for example, the video frame of each video is analyzed through the main body information determination model). Thereby, a label containing the subject and the position of the subject is obtained.

In some embodiments, the method of filtering video frames can also be implemented by a machine learning algorithm, that is, a machine learning model is used to identify whether each video segment contains a specific object. For example, the subject of the target video can be tooth protection promotion, and the corresponding specific object can be "teeth", "doctor", "dental appliance" and other specific objects related to the theme (dental protection). Based on the determined specific object, the machine can be used The learning algorithm determines whether each video segment contains a specific object.

In some embodiments, the initial image or the initial video may be processed by a subject information determination model (for example, a machine learning model) to obtain the subject and the location of the subject.

In some embodiments, for the initial video, the main body information determination model can be a generation model, a decision model, or a deep learning model in machine learning. For example, it can be a yolo series algorithm, a faster R-CNN algorithm, or an EfficientDet. Algorithms and other algorithms for deep learning models. The subject information determination model can perform subject information confirmation alone, or can be combined with other processes/steps (such as process 400) to determine the subject and the location of the subject. Correspondingly, the subject information determination model can be trained separately or together with other steps. For example, when a deep learning model is used for editing processing (for example, the process 400), manually labeled object positions and categories can be used as training samples to train the model, so that the model can accurately label the subject in the initial video. In some embodiments, the image embedding model can be further used as the main body information determination model to extract images that make up each video frame in the initial video, and image features of the video frames are extracted to determine the main body of the initial video.

In some embodiments, for the initial image, the initial image may be processed by the subject information determination model to obtain the position of the subject. Specifically, the image embedding model can continue to be used to determine the position of the subject. It is understandable that a single image in a video frame can be regarded as a picture, and the image embedding model that can process multiple video frames can also process the initial image. In some embodiments, the image embedding model for initial image and initial video processing can be trained separately or together. In addition, in other specific implementations, the determination of the subject position in the initial image can also be used in the initial video. The deep learning model of, for example, may be a deep learning model using algorithms such as the yolo series algorithm, the R-CNN algorithm, or the Efficient Det algorithm.

The subject information determination model can perform the operation of determining subject information alone; the subject information determination model can also be combined with other operations to realize the determination of subject information during the execution of other operations.

In some embodiments, the initial image or initial video may be directly input to the subject information determination model, so that the subject information determination model marks the corresponding subject and subject position in the corresponding video frame, thereby determining the subject information of the initial image or initial video .

In some embodiments, one or more video clips may be obtained after the initial video with a long time is edited as shown in the process 400. In order to improve the correlation between the video segment and the target video, the subject information determination model can be combined with the related operations of the process 400, so that the video segment with a specific subject can be retained during the splitting or editing of the process 400, for example, the subject can be retained For a specific object related to a specific topic of the target video. In some embodiments, the subject information determination model may be used to label and crop the subject to ensure that the subject is included in the cropped video.

In some embodiments, the subject information determination model can be combined with the related operations of step 310. When the image embedding model is used to determine subject information, the image embedding model extracts the image features of the subject. The image characteristics of a specific object determined by the subject, such as the image characteristics of a "model". Based on the image feature of the video frame and the image feature of the subject, a series of video frames containing the subject are determined in the database 140, and a shot segment composed of the series of video frames is the initial video or initial image containing the subject.

Step 520: Perform editing processing on the initial image or the initial video based on the subject information to obtain multiple video clips.

In some embodiments, the editing process can avoid the range of the subject according to the determined position of the subject, so as to generate a video clip that meets the requirements.

In some embodiments, in order to improve the processing accuracy, the outer contour of the main body can be used to avoid the influence of the editing process on the main body. The background part of the video is distinguished. It should be noted that in some other embodiments, the information of the subject may also include color information and size information, etc. Obviously, the outer contour of the subject can be determined more quickly and efficiently based on the color information and size information on the basis of the position of the subject.

In some embodiments, the outer contour of the main body can be determined by the size of the main body. For example, the smallest rectangular marquee containing the main body can be determined according to the size of the main body, and the smallest rectangular marquee can be used as the outer contour of the main body.

In some embodiments, the outer contour of the subject can be determined by the edge of the subject, where the edge refers to the intersection point between the subject and the image background in the image. For example, after determining the position of the subject, an image recognition algorithm (such as an edge detection algorithm) is used. Determine the edge of the main body, and use the edge of the main body as the outer contour of the main body. In some embodiments, the area obtained by preset processing of the edge of the subject may also be used as the outer contour of the subject. For example, the preset processing may include one or more combinations of smoothing, area scaling, and the like.

In some embodiments, the initial image or initial video is cropped or zoomed according to the outer contour of the subject to obtain a video segment that meets the requirements. A video clip that meets the requirements may be a video clip that does not affect the main body by editing the original image or the original video. This can be achieved by cropping the initial image or initial video by avoiding the outer contour of the subject, and scaling the initial image or initial video by maintaining the inner aspect ratio of the outer contour of the subject. It is understandable that the editing process may also include any of the editing methods mentioned in the present application. For example, when performing beautification operations, the outer contour of the subject can avoid blurring the image outside the outer contour of the subject to highlight the subject.

In some embodiments, cutting the initial image or the initial video to avoid the outer contour of the subject can be achieved by matting. Specifically, the outer contour of the subject in the initial image or initial video has been identified, and the matting algorithm can be used to avoid the outer contour of the subject and separate the subject from the initial image or initial video. The processing methods for the separated subject include But it is not limited to locking or creating a new layer. When the subject locks or creates a new layer, further processing can be performed on the background part.

It should be noted that, in some embodiments, the matting algorithm may be a matting algorithm based on deep learning, such as learning-based digital matting and nearest neighbor algorithm matting (KNN). matting) and so on. In some other embodiments, the matting algorithm may also be a matting based on cluster sampling (Cluster-Based Sampling matting, CBS), or Iterative Transductive Learning for alpha matting (ITL). at least one.

In some embodiments, scaling the initial image or initial video while maintaining the outer contour and inner aspect ratio of the subject can be achieved by separating the subject from the background. Specifically, in order to avoid deformation and distortion of the main body during the zooming process, the main body and the background part are zoomed separately, and the aspect ratio within the outer contour of the main body is maintained during the zooming process. Just as an example, the initial image can be a poster with a pixel size of 800×600, and the main body is a mobile phone with a pixel size of 150×330 in the poster (the main body 350 aspect ratio is 5:11), when the size of the target video or video clip is 1200 ×800, that is, the size of the initial image needs to be scaled to 1200×800. If the subject is directly scaled, the scaled size will be 225×440. At this time, the aspect ratio is 5:9.8. Obviously, the subject is deformed at this time. The deformation of the subject in the target video or video clip may adversely affect the effect of the video and the customer's perception of the product. In some embodiments, the method of maintaining the internal aspect ratio of the outer contour of the main body may be to obtain the initial image or initial video separately to obtain the scale ratio in the width direction and the length direction of the target video size (or video segment size). For example, continuing to use the above example, the initial image is scaled by 1.25 times in the width direction and 1.5 times in the length direction. At this time, in order to ensure that the subject does not deform, you can choose to zoom in both the length direction and the width direction by 1.25 times or 1.5 times. It should be noted that, in some other embodiments, the outer contour of the subject may not be a rectangle. In this case, the above scaling method is also applicable. The image processing method is similar to the video processing method, so I will not repeat it here.

In some embodiments, since the initial image or the initial video background size ratio may be inconsistent with the size of the target video or video segment, directly scaling may cause the ratio to change. When the proportions need to be consistent, the background part can be cropped first, and then zoomed after cropping.

Fig. 6 is an exemplary flow chart of generating a video according to some embodiments of the present application. In some embodiments, one or more steps in the process 600 may be stored as instructions in a storage device (for example, the database 140), and called by a processing device (for example, the processing device 112 in the server 110) and/or implement. As shown in FIG. 6, the process 600 specifically includes the following steps:

Step 610: Acquire at least one of an initial video and an initial image.

In some embodiments, the initial video and the initial image may also be referred to as the to-be-processed video and the to-be-processed image, respectively. Step 610 can be implemented with reference to step 310 in the process 300, and step 610 can also be implemented with reference to step 510 in the process 500.

Step 620: Obtain the main body of the target video in the initial video.

In some embodiments, the main body of the target video here can be understood as each specific object corresponding to the specific theme of the target video, for example, the aforementioned target video with "teeth protection promotion" as the main body, the main body of the target video (A specific object corresponding to a specific theme) may be a specific object related to the theme (dental protection) such as "tooth", "doctor", and "dental appliance".

In step 630, the initial video is cropped, zoomed and/or edited based on the preset size of the target video and the main body, to obtain video materials that all include the main body.

In some embodiments, the video material may be a video segment determined by the initial video in the video segment, and the video material that all includes the subject refers to the video segment that is determined based on the initial video and all contains the subject (specific object) of the target video.

In some embodiments, the target video may be preset with a preset size as required. In the case where the initial video size does not match the target video size or the size ratio does not match, the initial video may be cropped, zoomed and/or edited. Just as an example, the target video size is FHD (Full High Definition, 1920×1080). When the initial video does not match the target video size but the ratio is the same (like 16:9), the initial video can be scaled to get the same as the target video. The video size is the same as 1920×1080 video. When the ratio between the initial video and the target video does not match (for example, the initial video is 1:1), if the initial video size is 1024×1024, the cropping target size is 1024×768 according to the target video size ratio, that is, the initial video is first frame by frame Crop, and then enlarge the cropped video with a size of 1024×768 to 1920×1080 in equal proportions. It should be noted that, in some embodiments, when the initial video size is larger than the target video size, for example, when the initial video size is 2560×2560, the initial video can be directly cropped to the target video size of 1920×1080, or press The above steps are first cropped to 2560×1440 and then scaled equally. Since the video frame can be regarded as a picture, the method of cropping the video frame by frame in this step can refer to the image cropping method shown in FIG. 7 of this application. In some embodiments, the method of cropping the video frame by frame in step 630 can also refer to the video cropping method shown in FIG. 14 of the application.

In some embodiments, when the initial video size matches the target video size, or after cropping and scaling the size matches the target video size, the initial video with a longer time (for example, more than 15 seconds or 20 seconds, etc.) can be edited. To avoid the problem of long duration of a single video material (video clip), usually one video material (video clip) corresponds to a scene (shot clip). Playing the same scene for a long time may make the viewer feel bored. A way of video material to highlight the key points. In some implementations, the duration of the video material can also be adjusted by interpolation or sampling. It should be noted that if the initial video needs to be cropped, zoomed, and clipped separately, you can first edit to obtain a video with the main body and then perform cropping and zooming, or you can perform cropping and zooming to obtain a video with the same size before cropping. This application There is no restriction on this.

In some embodiments, when the initial video size matches the target video size, or after cropping and scaling the size matches the target video size, the initial video with a longer time (for example, more than 15 seconds or 20 seconds, etc.) can be edited. To avoid the problem of a long duration of a single video material, usually one video material corresponds to a scene, and playing the same scene for a long time may make the viewer feel bored. By shortening each video material, we can highlight the key points. It should be noted that if the initial video needs to be cropped, zoomed, and clipped separately, you can first edit to obtain a video with the main body and then perform cropping and zooming, or you can perform cropping and zooming to obtain a video with the same size before cropping. This application There is no restriction on this.

In step 640, the initial image is cropped and/or scaled based on the preset size of the target video to obtain an image material including the subject.

In some embodiments, the image material may be determined by the initial image, and the image material that all includes the subject refers to the subject (specific object) image that is determined based on the initial image and all contains the target video.

In some embodiments, in order to make the image meet the size requirement of the target video, crop or zoom the image file whose size does not match the target video size in the initial image, and continue to take the target video size as FHD as an example, through cropping and/ Or zoom to get an image with a size of 1920×1080 and including the subject.

In some embodiments, at least one of the initial image and the initial video is acquired in step 610. When the initial image and the initial video are acquired at the same time, step 630 and step 640 are performed, and there is no sequence between the two steps; When only the initial video is acquired, step 630 may be performed without performing step 640; when only the initial image is acquired, step 630 may be skipped and step 640 may be performed.

In some embodiments, step 640 may be implemented by the method shown in FIG. 7, where FIG. 7 is an exemplary flowchart of generating a video shown in some embodiments of the present application, and one or more sub-steps in step 640 It may be stored in a storage device (for example, the database 140) as an instruction, and called and/or executed by a processing device (for example, the processing device 112 in the server 110).

Step 642: Obtain the subject information of the target video in the initial image.

In some embodiments, the initial image may be processed by a subject information determination model (for example, a machine learning model) to obtain subject information. In some embodiments, for the initial image, the subject information determination model can be a generation model, a decision model, or a deep learning model in machine learning. For example, it can use the yolo series algorithm, the faster R-CNN algorithm, or the EfficientDet. Algorithms and other algorithms for deep learning models.

Determine the subject information by inputting the initial image or the initial video into the subject information determination model. In some embodiments, the initial image or initial video may be directly input to the subject information determination model, so that the subject information determination model marks the corresponding subject and subject position in the corresponding video frame, thereby determining the subject information of the initial image or initial video . In some embodiments, for step 642, reference may be made to the related description of step 510 in the process 500.

Step 644: Identify the outer contour of the subject based on the subject information.

After the subject information (including the subject and the position of the subject) is determined, the outer contour of the subject is determined based on the subject position, so as to distinguish the subject from the background part in the initial image. It should be noted that in some other embodiments, the information of the subject may also include color information and size information, etc. Obviously, the outer contour of the subject can be determined more quickly and efficiently based on the color information and size information on the basis of the position of the subject.

Step 646: Crop the initial image avoiding the outer contour of the subject.

The initial image can be cropped by avoiding the outline of the subject. In some embodiments, cropping of the initial image by avoiding the outer contour of the subject can be achieved by matting. Specifically, after recognizing the outline of the subject in the initial image, the outline of the subject can be avoided by the matting algorithm and the subject can be separated from the initial image. The processing methods for the separated subject include but are not limited to locking or creating a new layer. After the subject is locked or a new layer is created, the background part can be further processed.

It should be noted that, in some embodiments, the matting algorithm may be a matting algorithm based on deep learning, such as learning-based digital matting and nearest neighbor algorithm matting (KNN). matting) and so on. In some other embodiments, the matting algorithm may also be a matting based on cluster sampling (Cluster-Based Sampling matting, CBS), and iterative transductive learning (Iterative Transductive Learning for alpha matting, ITL). at least one. In some embodiments, for step 646, reference may be made to the related description of step 520 in the process 500.

Step 648: Maintain the external contour and internal aspect ratio of the main body to zoom the image to be processed.

In some embodiments, scaling the initial image while keeping the outer contour and inner aspect ratio of the subject can be achieved by separating the subject from the background. Specifically, in order to avoid deformation and distortion of the main body during the zooming process, the main body and the background part are zoomed separately, and the aspect ratio within the outer contour of the main body is maintained during the zooming process. Just as an example, the initial image can be a poster with a pixel size of 800×600, and the main body is a mobile phone with a pixel size of 150×330 in the poster (the main body 350 aspect ratio is 5:11), when the size of the target video or video clip is 1200 ×800, that is, the size of the initial image needs to be scaled to 1200×800. If the subject is directly scaled, the scaled size will be 225×440. At this time, the aspect ratio is 5:9.8. Obviously, the subject is deformed at this time. The deformation of the subject in the target video or video clip may adversely affect the effect of the video and the customer's perception of the product. In some embodiments, the method of maintaining the internal aspect ratio of the outer contour of the main body may be to obtain the initial image to be scaled in the width direction and the length direction of the target video size (or video segment size) respectively. For example, continuing to use the above example, the initial image is scaled by 1.25 times in the width direction and 1.5 times in the length direction. At this time, in order to ensure that the subject does not deform, you can choose to zoom in both the length direction and the width direction by 1.25 times or 1.5 times. It should be noted that in some other embodiments, the outer contour of the subject may not be a rectangle. In this case, the above scaling method is also applicable. The image processing method is similar to the video processing method, so I will not repeat it here. In some embodiments, reference may be made to the related description of step 520 in the process 500 for step 648.

In step 650, video clips are spliced based on the video configuration information to generate the target video.

The video clip in step 650 may include video material each including the subject and/or image material including the subject.

In some embodiments, the video configuration information may be determined based on script information and/or video templates. The video template may include the overall video template of the target video, and may also include the fragment video template of each video segment that composes the target video. The video template can include at least a time parameter. In some embodiments, the time parameter at least reflects the length of the target video or video segment (the total duration of the target video). In some embodiments, the initial image and/ Or the initial video processing obtains a video segment (including image material and/or video material) consistent with the target video size. Therefore, the splicing can be random or predetermined rules to play video clips in an orderly manner based on time parameters. As an example only, the predetermined rule may be that image materials and video materials are alternately spliced, or image materials are concentrated in the middle of the target video for playback, etc., or they may be played in a preset order corresponding to each video segment. For example, for the aforementioned rinse For the saliva advertising video, first play the video clip containing the subject (specific object) "model", then play the video clip containing the subject "teeth", and finally play the video clip containing the subject "mouthwash product". It should be noted that since pictures do not have time attributes, the display time of a single picture (such as 3 seconds, 5 seconds, 10 seconds, etc.) can be defined in the stitching, and switch to the next material when the display time is satisfied. In some embodiments, the duration of the video clip may be different from the time parameter in the clip video template. You can cut some video clips, merge with other video clips, sample and play each video frame, and interpolate between each video frame. One or more methods such as playback can be combined to adjust the duration of the video clip.

In some embodiments, the video template may also be used for steps such as obtaining an initial video or an initial image, and generating a video clip. Correspondingly, the object of the video module in this step may be a target video, a video clip, an initial image, and an initial video. For example, for the initial image and/or initial video, you can select the video template corresponding to the subject of the target video and apply it. Specifically, for the initial image of a specific product (such as mouthwash), you can replace the background of the image with the display background, and Add text (such as product introduction, price introduction, purchase link, etc.) and dynamic effects (such as an arrow pointing to the purchase link) on the image to generate a product display video clip.

In some embodiments, the video template may also include beautification parameters. Beautify the target video by beautifying parameters to obtain better results. In some other embodiments, the aforementioned beautification parameters may not be included in the video template, and the acquisition is performed before video rendering, before determining a video segment, and/or before generating a target video. Correspondingly, the subsequent beautification method can also be applied to the target video, video segment, initial image, and initial video.

In some embodiments, specifically, the beautification parameters may include at least one of filter parameters, animation parameters, and layout parameters. The filter parameter can be to add a global effect filter (such as black and white, retro, bright, etc.) to the target video; the animation parameter can be when the target video has multiple video clips in the splicing process, an animation effect is added between the video clips to make The target video effect is better and natural; the layout parameters can be due to the different positions of the main body of each segment in the video clip. In some embodiments, the information of the main body position can be marked in the material (for example, the main body is located in the upper left and upper right of the entire image/video , Lower left, lower right, etc.), the layout parameters combine and arrange the subject position information to make the target video smoother and the subject more prominent. In some other embodiments, the beautification parameter may also include removing watermark or adding watermark, etc.

In some embodiments, before splicing or when determining a video segment, at least one of a text layer, a background layer, or a decoration layer and corresponding loading parameters may also be obtained based on the video configuration information. Determine the layout of the text layer, the background layer, and the decoration layer in the target video according to the loading parameters, and embed the text layer, the decoration layer and/or the background layer into the video clip or the target video according to the layout during the splicing and rendering process. In some embodiments, the text layer may be subtitles or additional text introductions. In addition, the image material sometimes has a transparent background, and a background layer may be required. When it is understandable, the above text layer and background layer are added according to the actual situation of the target video. In some embodiments, the text layer, the decoration layer, and the background layer may be included in the video template.

In some embodiments, the initial image and the initial video may come from different sources, and the color difference may be relatively large. Therefore, when the video clip is generated, the difference between the video clips may also be relatively large, so normalization processing is required. In some embodiments, the normalization process may be performed when the video segment is determined, that is, the initial image and/or the initial video are normalized to generate a video segment with a uniform style. In some embodiments, the normalization process may be performed during splicing rendering, that is, the normalization process is performed on the video segment to generate a target video with a uniform style. Since a video frame can be regarded as an image, the normalization of an image refers to the process of performing a series of standard processing transformations on the image to transform it into a fixed standard form. The standard image is called a normalized image. Just as an example, in some embodiments, the grayscale or Gamma value of the video segment, the initial image, and/or the initial video may be normalized. Specifically, the image histogram of the image or video frame may be obtained first, at least for the image The histogram is subjected to averaging processing, and the gray scale or gamma value of the image or video frame is adjusted based on the histogram after at least the averaging processing to realize image normalization. In some other embodiments, the normalization processing may also be based on one or more of the zoom normalization and rotation normalization of the target lens subject. In addition, the normalization processing may also be for video clips, initial images, and / Or normalization of the brightness, hue, saturation, etc. of the initial video.

Fig. 8 is an exemplary flow chart of generating a target video according to some embodiments of the present application. In some embodiments, one or more steps in the process 800 may be stored as instructions in a storage device (for example, the database 140), and called by a processing device (for example, the processing device 112 in the server 110) and/or implement.

Step 810: Obtain an initial image or an initial video, and generate multiple video clips based on the initial image or the initial video.

In some embodiments, only the initial video may be acquired, and the initial video may also be referred to as a video file. For step 810, reference may be made to the related description of step 310 in the process 300, which will not be repeated here.

In some embodiments, step 810 may also be omitted in the process 800, for example, the multiple video clips may be directly obtained.

Step 820: Obtain one or more candidate video clips from the multiple video clips based on a first preset condition.

The first preset condition may be used to reflect the requirements for the content and/or duration of the target video. For example, the first preset condition may include requirements in multiple elements. The multiple elements include, but are not limited to, the target video contains a specific object (F), the target video contains a specific theme (S), the duration of the video segment (tp), the total duration of the target video (ta), and the target video contains a specific Shot picture (P), the number of shot pictures contained in the target video (Pn), the number of overlaps of a specific subject in the target video (Fn), the focusing time of a specific subject in the target video (St), etc. The number of shots (Pn) included in the target video can be the same or different. Similar to multiple sentences forming paragraphs or articles, the combination of multiple shots can form a new video to express more detailed content information. The focus time (St) of a specific topic refers to the length of time occupied by the content on the specific topic in the video segment or the target video. The number of overlaps of a specific topic (Fn) refers to the number of occurrences of content on a specific topic in a video clip or target video. The number of overlaps (Fn) of a specific topic and the focus time (St) of a specific topic can be related to the degree of prominence of the specific topic. The greater the number of overlaps (Fn) of a specific theme, the longer the focusing time (St) of the specific theme, and the higher the degree of highlighting the specific theme. By restricting the number of overlaps (Fn) of a specific theme and the focus time (St) of a specific theme, better publicity/promotion of the content expressed by a specific theme can be achieved. In some embodiments, the first preset condition may be specified by the user, or automatically determined by the multimedia system 100 (for example, the processing device 112) based on the video configuration information, the promotional effect that the target video needs to produce, and the like.

The first preset condition may be a constraint on at least one of the above-mentioned multiple elements. The constraints may include qualitative constraints (for example, whether to include a specific object (F), a specific theme (S), a specific shot (P), etc.) or quantitative constraints (for example, the total duration of the target video (ta), the target video contains The number of shots (Pn), the number of overlaps of a specific subject in the target video (Fn), the focusing time of a specific subject in the target video (St), etc.). In some embodiments, the screening of video clips can be achieved by limiting the value corresponding to the element (for example, comparing with a preset corresponding threshold). For example, when the target video contains a specific object (F), its corresponding value is 1; on the contrary, when the target video does not contain a specific object (F), its corresponding value is 0. For qualitative constraints, the first preset condition may be that the value of the corresponding element is greater than 0. For quantitative constraints, the first preset condition can be that the value of the corresponding element exceeds the corresponding threshold (for example, the video duration is less than 30 seconds, the topic focus time exceeds 15 seconds, etc.) to filter out those that meet the needs (for example, those that meet the specific group Interest or request) video clips.

In some embodiments, a plurality of video clips may be selected from a plurality of video clips that satisfy the first preset condition and are determined to be candidate video clips. For example, the first preset condition may include constraints on the total duration (ta) of the target video and the number of shots (Pn) contained in the target video, that is, it may be determined according to the total duration (ta) of the video segment and the number of shots (Pn). Candidate video clips. Exemplarily, if the first preset condition can be that the target video needs to contain 3 shots and the total duration is 40 seconds, then 3 video clips that each contain different shots and the total duration is 40 seconds can be filtered, such as 15 seconds Video segment 1, 15-second video segment 2, and 10-second video segment 3 are candidate video segments. By constraining the total duration (ta) of the target video and the number of shots (Pn) included in the target video, a predetermined degree of exposure can be provided to a specific shot with a certain total duration of the target video.

For another example, the first preset condition may include a constraint on a specific shot picture (P) of the video clip, that is, the candidate video clip may be determined according to the shot picture contained in the video clip. Exemplarily, if the first preset condition is that the target video needs to include the shots of "surfing on the sea", one or more video clips with the shots of "surfing on the sea" may be filtered as candidate video clips. By restricting the specific shots (P), the target video can be made to contain specific content to meet the interests or needs of users.

For another example, the first preset condition may simultaneously include constraints on the target video containing a specific object (F), the total duration of the target video (ta), and the number of shots (Pn) contained in the target video. The included objects and video duration determine candidate video segments. Exemplarily, if the first preset condition is that the target video needs to contain 3 shots and the logo including the "×× area", and the total duration of the target video cannot exceed 70 seconds, you can first filter out the "×× area" The video clip 4 marked with ”is a candidate video clip, and then according to the duration of the video clip 4, such as 20 seconds, two video clips whose total video duration does not exceed 70 seconds are selected as candidate video frequency bands, such as 30 seconds respectively. Video clip 5 and 20 second video clip 6. By constraining the target video containing a specific object (F), the total duration of the target video (ta), and the number of shots (Pn) contained in the target video, it is possible to limit the duration of the target video when the target video contains a specific object. A specific lens frame has a predetermined degree of exposure to highlight the specific object.

For another example, the first preset condition may include constraints on the number of overlaps (Fn) of a specific topic in the target video and the focus time (St) of a specific topic in the target video, that is, the candidate video may be determined according to the number of occurrences of the topic and the focus duration. Fragment. Exemplarily, if the first preset condition is that the target video needs to contain two specific themes that are the same or different, and the focus time of the specific topic in the target video exceeds 1 minute, a specific topic that also contains this number and the focus time can be filtered Video clips exceeding 1 minute or multiple video clips containing this number of specific themes and having a cumulative focus duration of more than 1 minute are regarded as candidate video clips. By restricting the number of overlaps (Fn) of a specific topic in the target video and the focusing time (St) of a specific topic in the target video, the specific topic can be highlighted in the target video, which is suitable for advertising or promotion for the specific topic.

For another example, the first preset condition may include constraints on the total duration of the target video (ta), the number of shots (Pn) contained in the target video, and the focus time (St) of a specific subject in the target video, that is, it may be based on the video The length of time, the number of shots and the focus time of a specific subject determine the candidate video clips. Exemplarily, if the first preset condition is that the total duration of the target video is 30 seconds, 3 shots need to be included, and the focus time of a specific subject in the target video is not less than 15 seconds. You can filter out 3 lens images first, the duration of the 3 lens images containing a specific subject is no less than 15 seconds (for example, 16 seconds, 18 seconds, 20 seconds, etc.), and the total duration of the 3 lens images is 30 seconds, for example, Screen 15-second video clip 1 (10 seconds for content containing a specific subject), 10-second video clip 2 (6 seconds for content containing a specific subject), and 5-second video clip 3 (which contains content on a specific subject) 3 seconds), which is a candidate video segment. By constraining the total duration of the target video (ta), the number of shots (Pn) contained in the target video, and the focus time (St) of a specific subject in the target video, the total duration of the target video can be restricted. The specific shots of the specific subject have a predetermined degree of focus, so as to more highlight the specific subject in the target video.

In some embodiments, based on the first preset condition, the process of determining candidate video segments to generate the target video can determine a model: f(a,b,c,...,n)→y, where f in the model can be The first preset condition, a, b, c,..., n are multiple video clips, and y is the target video. By adjusting the relationship between the value corresponding to each element in the first preset condition and the corresponding threshold value, a candidate video segment can be determined, and a target video can be generated based on the candidate video segment. In some embodiments, based on the video configuration information, the process of generating the target video may also determine a similar model. The model may be executed by the processing device 112 to automatically generate a target video based on multiple video clips. In some embodiments, a machine learning model can be used to learn and train f in the model to determine the first preset condition or video configuration information that meets a specific requirement (for example, a playback effect).

In some embodiments, there may be one or more candidate video clips that meet any one of the first preset conditions. For example, there may be 3 candidate video clips that meet the requirements of a specified lens picture, and 5 candidate video clips that meet the requirements of a specific object. Correspondingly, the candidate video clips meeting the first preset condition may be one or more groups.

In some embodiments, the first preset condition may include constraints on each shot segment (video clip) in the target video. Correspondingly, the video clip that meets the constraint may be divided into the shot segment according to the constraints of the specific shot segment. Candidate video segment groups to obtain candidate video segment groups for each shot segment.

It can be understood that the foregoing content is only an example. In some embodiments, the first preset condition is related to at least one of the multiple elements, and the first preset condition may be characterized as (The requirements for the requirements can also be understood as the requirements for the target video). In the case of multiple element constraints, the selection of video clips that meet each requirement can be in any reasonable order. No restrictions.

In some embodiments, the processing device 112 may use a machine learning model to determine candidate video segments. For example, a trained machine learning model may be used to determine whether each of the multiple video clips contains a specific object, and a video clip that contains the specific object among the multiple video clips is determined as a candidate video clip. In some embodiments, the input of the trained machine learning model may be a video clip, and the output may be the object contained in the video clip, or the binary classification result of whether the video clip contains a specific object, this specification does not limit this. In some alternative embodiments, the candidate video segment may be determined in other feasible ways, which is not limited in this specification.

In some embodiments, the first preset condition further includes element constraint conditions of two or more specific elements in the plurality of elements. The two or more specific elements may be different elements. In some embodiments, the element constraint condition may involve the priority of the two or more specific elements. For example, based on the difficulty of the judgment, the priority of the specific subject (F) is higher than the priority of the specific theme (S); based on the highlighting effect of the theme, the priority of the theme focus time (St) is higher than the number of repetitions of the theme (Sn) The priority etc. In some embodiments, the element constraint conditions may involve the order in which different elements are considered. For example, when the total duration of the target video (ta), the number of shots contained in the target video (Pn), and the focus time (St) of a specific subject in the target video are present at the same time, the number of shots (Pn) is given priority, followed by the specific The focus time (St) of the subject, and finally the total duration (ta) of the video can be adjusted by fast or slow playback. In some embodiments, the element constraint condition may also involve compatibility and matching between different elements. For example, the 15-second total video duration (ta) is not compatible with the 20-second topic focus time (st).

In some embodiments, the first preset condition may include the binding condition of the shot picture in the target video. The binding condition may reflect the association relationship of at least two specified lens images in the target video. For example, the binding condition (also can be understood as an association relationship) may be that the specified lens image a and the specified lens image b must be in the target video Appear at the same time, or the designated lens picture a must appear before the designated lens picture b in the target video, etc. In some embodiments, the processing device 112 may determine a video clip containing a specified shot frame from a plurality of video clips, and combine the video clip containing the specified shot frame based on a binding condition to serve as a candidate video clip. For example, if the binding condition is that the lens image a must appear before the lens image b in the target video, you can combine the lens image a and the lens image b, and place the lens image a before the lens image b, or mark the lens image The sequence of a and shot b is used as a candidate video segment. Combining shots and pictures that meet the binding conditions into a candidate video segment can help to process them as a whole in the subsequent processing process, so as to improve the efficiency of video synthesis. In some embodiments, the shots that meet the binding condition may not be combined into one candidate video segment, but exist in continuous or discontinuous (for example, interval) candidate video segments, respectively.

Step 830: Group one or more candidate video clips to determine at least one clip set.

In some embodiments, the process 800 may be used to generate multiple (eg, target number) target videos at the same time. The grouping can be understood as a combination of candidate video segments, and the corresponding step 830 can include combining the candidate video segments to generate a target number of segment sets.

Each segment set in the at least one segment set is a segment set that is combined by one or more candidate video segments and meets the first preset condition and other conditions in the video configuration information at the same time. In some embodiments, other conditions in the video configuration information may be the second preset condition. The second preset condition is related to the content difference of the segment set/video segment. For example, the second preset condition may specifically include a constraint on the combination difference degree of the segment sets, that is, the difference degree of the combination of candidate videos of each segment set in at least one segment set is greater than a preset threshold. Regarding the judgment of the second preset condition, refer to the following figure 9 and related descriptions.

In some embodiments, other conditions in the video configuration information may also include, but are not limited to, requirements for one or more combinations of frame, subtitle, hue, saturation, background music, etc. of the target video. For example, at least one fragment set may include the aforementioned video fragment 1, video fragment 2, and video fragment 3 combined fragment set 1, video fragment 4, video fragment 5, and video fragment 6 combined fragment set 2, video fragment 1, video fragment 2. Video segment 3, video segment 4 combined segment set 3, etc.

Step 840: Generate a target video based on each segment set in the at least one segment set.

In some embodiments, when the process 800 is used to generate multiple target videos at the same time, step 840 may include synthesizing a target video based on the set of segments for each set of segments.

For each segment set in at least one segment set (or a target number of segment sets), a target video can be synthesized based on the segment set. Correspondingly, at least one segment set can synthesize a corresponding number of target videos. In some embodiments, the video clips may be sorted and combined according to the sequence information included in the video configuration information to generate the target video. In some embodiments, for video clips without order requirements, the target video may be randomly synthesized based on the cohesion of the shots of each video clip in the clip set. For example, a video clip whose presentation content is daytime may be placed before a video clip whose presentation content is night. In some embodiments, the target video may be synthesized based on the promotional copy of the product or information to be promoted. For example, for a video used to promote garbage classification, its promotional copy can be to first show the possible consequences of unclassification, then show the benefits of classification, and finally show the classification method. You can follow the content of each video segment in the segment set according to the Synthesize the target video in the sequence of the content displayed in the promotional copy.

In some embodiments, when a target number of target videos are synthesized, the synthesized target number of target videos may be delivered in batches or at the same time. In some embodiments, the first preset condition, the second preset condition, and/or other conditions may be adjusted based on the promotion effect of each target video. The publicity effect can be obtained based on user feedback, broadcast volume, evaluation, feedback results (such as product sales, garbage classification results, etc.). This manual does not limit this, and the specific content can refer to the relevant descriptions in Figures 17 and 18 of this application. .

In some embodiments, the second preset condition may include that the difference between any two segment sets in the at least one segment set is greater than a preset difference threshold. In some embodiments, the difference degree between any two fragment sets may also be referred to as the candidate video fragment combination difference degree between any two fragment sets.

Fig. 9 is an exemplary flowchart of determining a segment set according to some embodiments of the present application. Specifically, it relates to a method of screening out at least one segment set based on a second preset condition. In some embodiments, one or more steps in the process 900 may be stored as instructions in a storage device (for example, the database 140), and called by a processing device (for example, the processing device 112 in the server 110) and/or implement.

Step 910: Determine a combination difference degree of candidate video segments between each of the two or more segment sets and other segment sets.

In some embodiments, the multimedia system 100 (for example, the processing device 112) may randomly select multiple sets of candidate video clips that meet the first preset condition, and randomly combine them to form two or more clip sets. In some embodiments, for multiple candidate video clips determined according to the first preset condition, one or more candidate video clips may be randomly selected from each group of candidate video clips corresponding to the same element to form a clip set. By repeating the steps of random selection, two or more fragment sets can be formed. Among them, random selection of video clips with qualitative restrictions can be performed first, and then other video clips are randomly selected. For example, according to the video configuration information, the last shot should be a product shot, and other shots have no order requirements but have content requirements. When generating two or more fragment sets, you can randomly select the video fragment of the product shot from the candidate video fragments as the last shot. , And then randomly select the corresponding video clips from the corresponding candidate video clips according to the content requirements of other shots, and then sort them (for example, random sorting). Among them, the candidate shots can be grouped according to the content requirements of other shots, from each group Determine the video segment to get a combination of candidate video segments.

In step 910, the difference in the combination of candidate video segments between each of the two or more segment sets and other segment sets can be achieved by comparing the combination of the segment sets, for example, comparing the combination of the various segment sets The difference rate, that is, the ratio of video clips that are different from each other in each clip set. In some embodiments, step 910 can also be implemented by assigning values to different video clips or using machine learning algorithms. For specific descriptions, reference may be made to subsequent related descriptions in FIG. 10 and FIG. 11.

In some embodiments, the degree of difference between any two fragment sets may include a combination of candidate video fragments between any two fragment sets and/or candidate video fragments and borders, subtitles, and tones between the two fragment sets. The degree of difference in combination of other conditions. For example, segment set 1 may include video segment 1, video segment 2, frame 1, subtitle 1, segment set 5 may include video segment 1, video segment 2, frame 2, subtitle 2, and the degree of difference between segment set 2 and segment set 5. It can be the difference between the frame and the caption, then the difference rate of the segment set 1 is 0% and the difference rate of the segment set 2 is 33%. At this time, the difference rate can be used to characterize the degree of difference.

Step 920: Use a segment set whose combination difference with other segment sets is higher than a preset threshold as at least one segment set.

Each fragment set can be filtered according to the combination difference degree of each fragment set in the aforementioned two or more fragment sets. For example, the filtering condition may be that the combination difference between each fragment set in at least one of the selected fragment sets is higher than a preset threshold ( For example, the difference rate is higher than 50%). By filtering out multiple fragment sets that meet the second preset condition, the target video can be generated based on at least one fragment set, which has different content display effects, and thus achieves different promotional effects, while meeting the requirements of video platforms for placing videos .

In some embodiments, the screening order of the first preset condition and the second preset condition can be changed. For example, multiple candidate clip sets can be generated based on multiple video clips, so that the multiple candidate clip sets meet the second preset condition . Then, at least one segment set is selected from the multiple candidate segment sets based on the first preset condition; and then based on each segment set in the at least one target segment set, a target video is generated. For details, refer to FIG. 12 and its description.

The beneficial effects that the process 900 may bring include, but are not limited to: (1) The target number of fragment sets is determined based on the difference between the fragment sets, multiple target videos with different content performance effects can be determined, and the variety of generated target videos is improved. (2) No manual operation is required to generate the entire target video, which improves the efficiency of video synthesis. It should be noted that different embodiments may have different beneficial effects. In different embodiments, the possible beneficial effects may be any one or a combination of the above, or any other beneficial effects that may be obtained.

Fig. 10 is an exemplary flow chart for determining the degree of difference in a combination of fragment sets according to some embodiments of the present application. In some embodiments, one or more steps in the process 1000 may be stored as instructions in a storage device (for example, the database 140), and called by a processing device (for example, the processing device 112 in the server 110) and/or implement.

Step 1010: Assign an identification character to each of one or more candidate video clips.

In some embodiments, different identification characters may be assigned to each candidate video segment. For each fragment set in at least one fragment set (or target number of fragment sets), the identification character can be determined according to the number of candidate video fragments. For example, when the number of candidate video fragments is small, an English letter can be assigned as the identification character . When the number of candidate video clips is large, an ASCI code can be assigned as an identification character. In particular, for specific candidate video clips that have special requirements or must appear in the target video, no identification characters can be assigned. For example, the last shot of the target video must be a product display video clip. If there is only one video clip, it can be No identification characters are assigned.

Step 1020, based on the identification characters of one or more candidate video segments, determine character strings corresponding to the segment set and other segment sets.

For each segment set, the character string of the segment set is a set of identification characters of each candidate video segment in the segment set. For example, if the identifying character of video segment 1 is A, the identifying character of video segment 2 is B, the identifying character of video segment 3 is C, and the identifying character of video segment 4 is D, then the character string corresponding to segment set 1 is ABC. In some embodiments, the fragment set may also include order requirements. For example, the fragment set 2 is a combination with the same content as the fragment set 1 in a different order, and the corresponding character string may be CAB.

Step 1030: Determine the edit distance of the character string corresponding to the segment set and other segment sets as the combined difference degree of the candidate video segments between the segment set and the other segment sets.

Edit distance can reflect the number of different characters between two strings. The smaller the editing distance, the smaller the number of corresponding different characters, and the smaller the difference between the two fragment sets. For example, if the character string corresponding to segment set 1 is ABC, and the character string corresponding to segment set 3 is ABCD, the edit distance between segment set 1 and segment set 3 is 1, that is, the degree of difference is 1. For strings with sequence requirements, characters with different sequences can also be included in the edit distance. In addition, to avoid repetition of content, the sequence difference can be calculated once as a whole. For example, the edit distance between segment set 1 and segment set 2 in step 1020 Can be 1.

In some embodiments, the number of fragment sets can be any positive integer, for example, 1, 3, 5, 8, 10, etc. In some embodiments, the candidate video fragments may be randomly combined into N candidate fragment sets, and a target number of fragment sets are selected from the N candidate fragment sets based on the second preset condition. By screening the target number of segment sets that meet the second preset condition, a target number of target videos with different content presentation effects can be generated based on the target number of segment sets, thereby achieving different promotional effects.

Fig. 11 is an exemplary flow chart for determining the degree of difference in the combination of fragment sets according to some embodiments of the present application. In some embodiments, one or more steps in the process 1100 may be stored as instructions in a storage device (for example, the database 140), and called by a processing device (for example, the processing device 112 in the server 110) and/or implement.

Step 1110: Generate a segment feature corresponding to each candidate video segment based on the trained feature extraction model and candidate video segments in two or more segment sets.

The feature extraction model may also be referred to as a shot feature extraction model. In some embodiments, the candidate video segments in the two or more segment sets may be processed based on the trained feature extraction model to generate corresponding segment features, which specifically includes obtaining multiple candidate video segments included in each candidate video segment. Video frames, and determine one or more image features corresponding to each video frame, and then process the image features in the multiple video frames and the relationship between the image features in the multiple video frames based on the trained feature extraction model , Determine the segment feature corresponding to the candidate video segment. Among them, the feature extraction processing can refer to processing the original information and extracting feature data, and the feature extraction processing can improve the expression of the original information to facilitate subsequent tasks.

The image characteristics corresponding to the video frame may include the shape information of the object (for example, the subject) in the video frame, the positional relationship information between multiple objects in the video frame, the color information of the object in the video frame, the completeness of the object in the video frame, or At least one of the brightness in the video frame.

In some embodiments, the feature extraction process may use statistical methods (such as principal component analysis methods), dimensionality reduction techniques (such as linear discriminant analysis methods), feature normalization, data bucketing, and other methods. Exemplarily, taking the brightness in a video frame as an example, the brightness value within 0-80 can be proportionally corresponding to [1,0,0], and the brightness value 80-160 can be proportionally corresponding to [0,1,0]. Values above 80 correspond to [0,0,1].

However, because the obtained image features are diverse, some of the obtained image features are difficult to measure with fixed functions or clear rules. Therefore, the feature extraction process can also rely on machine learning methods (such as using feature extraction models), which can automatically learn the collected information to form a predictable model, thereby obtaining higher accuracy. The feature extraction model can be a generative model, a decision model, or a deep learning model in machine learning. For example, it can be a deep learning model using algorithms such as the yolo series algorithm, the FasterRCNN algorithm, or the EfficientDet algorithm. The machine learning model can detect the set objects that need attention in each video frame. The objects that need attention can include creatures (humans, animals, etc.), commodities (cars, decorations, cosmetics, etc.), backgrounds (mountains, roads, bridges, houses, etc.), etc. Further, for a video, the object to be paid attention to can be further set, for example, it can be set as a character or a product. Multiple candidate video clips can be input into the machine learning model, and the machine learning model can output image features such as position information and brightness of objects in each candidate video clip.

It should be noted that those skilled in the art can arbitrarily change the feature extraction model used, which is not limited in this specification. For example, the feature extraction model can be GoogLeNeT model, VGG model, ResNet model, etc. By using the machine learning model to extract the features of the video frame, the image features can be determined more accurately.

The trained feature extraction model can be a sequence-based machine learning model, which can convert a variable-length input into a fixed-length vector expression for output. It can be understood that due to the different durations of different shots, the number of corresponding video frames is also different. After processing through the trained feature extraction model, it can be converted into a fixed-length vector for characterization, which is helpful for subsequent processing .

Exemplarily, the sequence-based machine learning model may be a deep learning model (DNN), a recurrent neural network model (LSTM), or a bi-directional LSTM (Bi-directional Long Short-Term Memory) model, and the gate repeats The unit GRU model, etc. and its combination are not limited in this specification. Specifically, the image features corresponding to the obtained video frames (such as the features of 1, 2, 3..., n) and their relationships (such as sequential sequence and/or chronological relationship) are input into the feature extraction model, and the feature extraction model can be Output the sequence of the coded hidden state at each time (such as h ₁ ～h _n ), where h _n contains all the information of the lens during this period of time. In this way, the feature extraction model can convert multiple image features within a period of time (such as a candidate video segment corresponding to a certain shot) into a fixed-length vector expression h _n (ie, shot feature). For the training process of the feature extraction model, refer to the corresponding description in FIG. 19, which will not be repeated here.

It can be understood that, for multiple candidate video clips, the above steps and methods can be used to process separately to obtain the clip characteristics of different candidate video clips. Here, it is assumed that the candidate video segments are 1, 2, 3..., m respectively, and the corresponding segment features obtained are R _c,1 , R _c,2 ,...R _c,i …R _c,m , in the following description Follow this setting.

Step 1120: Generate a set feature vector corresponding to each set of fragments based on the characteristics of the fragments.

The collection feature vector corresponding to each segment collection may be generated based on the acquisition sequence and the segment characteristics of each candidate video segment in the segment collection. Exemplarily, vector splicing, vector concatenation, etc. may be used to obtain the set feature vector corresponding to the fragment set. In an example set of segments c, c set of segments corresponding to a set of feature vector _{_{R c = {R c, 1}} , R c, 2, ... R c, i ... R c, m} T; where the superscript T denotes the matrix transpose .

Step 1130: Determine the degree of similarity between each fragment set and other fragment sets based on the trained discriminant model and the set feature vector corresponding to each fragment set.

The multimedia system 100 (for example, the processing device 112) may determine the similarity degree of the set feature vectors corresponding to any two fragment sets based on the trained discriminant model, so as to determine the degree of similarity between each fragment set and other fragment sets.

Assuming a, b, c ... k set of segments, each having a corresponding set of eigenvectors of _{_{_{R a, R b, R c}}} ... R k. When step 930 is performed, one of the k fragment sets can be selected (for example, fragment set c), and the similarity between the set feature vector R _c and the set feature vector of the other (k-1) fragment sets can be calculated to obtain For all similarity comparison results, the similarity comparison result is regarded as the similarity degree.

In some embodiments, step 1130 may also use a vector similarity coefficient to determine the degree of similarity between two fragment sets. The similarity coefficient refers to the calculation of the similarity between samples using a calculation formula. The smaller the value of the similarity coefficient, the smaller the similarity between individuals and the greater the difference. When the similarity coefficient between the two fragment sets is large, it can be judged that the similarity between the two fragment sets is relatively high. In some embodiments, the similarity coefficient used includes but is not limited to simple matching similarity coefficient, Jaccard similarity coefficient, cosine similarity, adjusted cosine similarity, Pearson correlation coefficient, and the like.

Step 1140: Determine the degree of combined difference between each segment set and other segment sets based on the degree of similarity.

The similarity coefficient used in step 1130 refers to the calculation of the degree of similarity between samples using a calculation formula. The smaller the value of the similarity coefficient, the smaller the similarity between individuals and the greater the difference. When the similarity coefficient between the two fragment sets is large, it can be judged that the difference between the two fragment sets is small. In some embodiments, when the similarity coefficient is a real number within [0,1], the combined difference degree can be its reciprocal, inverse number, 1-similarity coefficient, etc.

In some embodiments, at least one segment set may also be generated based on the set feature vector. Specifically, the multiple set feature vectors may be clustered based on a clustering algorithm to obtain multiple set cluster clusters, and at least one segment set is generated based on the clustering result.

Specifically, it is assumed that the number of video segments of at least one segment set that needs to be obtained (that is, the preset value of the number of target video shots) is P, and the number of clusters actually obtained is Q. If the number of videos required by the segment set P is less than or equal to the number of cluster clusters actually obtained is Q, then P clusters are selected, and a candidate video is selected from each cluster; if the video required by the video set is recommended The number P is greater than the number of cluster clusters actually obtained, which is Q, then several candidate videos far away from the cluster center can be selected from each cluster cluster, and P candidate video segments are randomly selected to form a segment set.

In some embodiments, a density-based clustering algorithm (such as the DBSCAN density clustering algorithm) may be used to obtain multiple aggregate clusters. Specifically, determine the preset value of the required fragment set, that is, determine the number of fragment sets (the number is P); further, determine the neighborhood parameter (∈, MinPts) of the cluster, where ∈ corresponds to the cluster cluster In the radius of the vector space, MinPts corresponds to the minimum number of samples required to form clusters, and the number of clusters is Q. The neighborhood parameters (∈, MinPts) are adjusted multiple times and the video feature vectors are clustered. Class processing until the obtained number of clusters Q is greater than or equal to the preset value P. At this time, a number of aggregate clusters equal in number to the preset value P are randomly selected; based on the aggregate cluster clusters, the segment set is determined.

In some embodiments, the process 1100 may also be used to determine the similarity of multiple generated target videos (for example, the multiple target videos in the process 800 or 1200), and recommend the target video based on the similarity.

Fig. 12 is an exemplary flow chart of generating a video according to some embodiments of the present application. In some embodiments, one or more steps in the process 1200 may be stored in a storage device (for example, the database 140) as instructions, and called by a processing device (for example, the processing device 112 in the server 110) and/or implement.

For the process 1200, step 1210 and step 1240 are the same as or similar to step 810 and step 840 in the process 800, and reference may be made to FIG. 8 and its related descriptions, which will not be repeated here.

Steps

1220 and 1230 in the process 1200 are different from the process 800.

Step 1220: Randomly generate multiple candidate segment sets based on multiple video segments.

The multimedia system (for example, the processing device 112) may randomly combine multiple video clips to generate a set of candidate clips. In some embodiments, the processing device 112 may randomly combine the partial video clips obtained in step 810 to generate M (M greater than or equal to the target number) candidate clip sets. In some embodiments, the processing device 112 may combine all the video clips obtained in step 810, and select M (M greater than or equal to the target number) combinations from them to determine the set of candidate clips, or determine all the combinations as candidate clips gather. A candidate segment set may include one or more video segments.

In some embodiments, multiple candidate segment sets satisfy the second preset condition. The second preset condition includes that the video segment combination difference between any two candidate segment sets in the plurality of candidate segment sets is greater than the preset difference threshold. The preset difference degree threshold can be any positive integer greater than 0, for example, 1, 2, and so on. For more content about the second preset condition, please refer to FIG. 9, FIG. 10 and related descriptions, which will not be repeated here.

Step 1230: Filter out at least one fragment set from the multiple candidate fragment sets based on the first preset condition.

The first preset condition may include, but is not limited to, the total duration of the target video, the number of shots included in the target video, the target video contains the specified shots, the target video contains the specific object, and a combination of one or more. For more content about the first preset condition, refer to FIG. 8 and related descriptions, which will not be repeated here. In some embodiments, the processing device may determine at least one (for example, a target number) segment set based on the candidate segment set as a whole. For example, it is possible to filter a set of candidate fragments whose total duration of video fragments and/or the number of shots meets the first preset condition as one or more of the target number of fragment sets. In some embodiments, the processing device 112 may determine the target number of fragment sets based on the content contained in the video fragments in each candidate fragment set. For example, trained machine learning can be used to determine whether a video clip in the set of candidate clips contains a specific object, and based on the determination result, the set of candidate clips containing the specific object can be screened. The input of the trained machine learning model can be a set of candidate fragments, or a video fragment in the set of candidate fragments. Correspondingly, the output can be whether the candidate fragment set contains a specific object, whether the video fragment contains a specific object, etc. This specification does not Do restrictions.

In the process 800 and the process 1200, the target video that meets the requirements can be synthesized through one or more combination operations such as splitting, filtering, combining, cropping, and beautifying the initial video or the initial image. In some embodiments, the server 110 (for example, the processing device 112) may obtain multiple video clips by splitting the initial video or initial image, and filter out one of the multiple video clips based on constraints such as video duration and video content. Or a plurality of candidate video segments, by combining the candidate video segments to obtain at least one segment set or a target number of segment sets whose differences meet a preset threshold, and generate the target number of target videos based on the segment set. In some embodiments, the server 110 may randomly combine multiple video clips obtained by splitting to obtain multiple candidate clip sets whose differences between each other meet a preset threshold, and collect candidate clips based on constraints such as video duration and video content. The target number of fragment sets are filtered out, and the target number of target videos are generated based on the target number of fragment sets. In some embodiments, the video synthesis method and/or system provided in the embodiments of this specification can be used to synthesize promotional videos. For example, it can be based on pre-shot video files used for product, cultural, or public welfare, etc., through Splitting, screening, beautifying, compositing and other operations process video files to generate diversified promotional videos.

In some embodiments, the target video usually has background music or theme music. The background music or theme music is used as a kind of music to adjust the atmosphere. It can be inserted into the video to enhance the expression of emotions and achieve a The audience's immersive experience. At the same time, background music or theme music has time attributes, and elements such as the duration and rhythm of the background music or theme music can be used as time parameters in some embodiments of the present application.

Fig. 13 is an exemplary flowchart of adding audio material according to some embodiments of the present application. In some embodiments, one or more steps in the process 1300 may be stored as instructions in a storage device (for example, the database 140), and called by a processing device (for example, the processing device 112 in the server 110) and/or implement.

Step 1310: Obtain initial audio.

In some embodiments, the initial audio (also referred to as music to be processed) may be imported by the user or selected by the user in the database 140. The initial audio can have different types, such as warm, soothing, brisk, focused, agitated, angry, scary, etc. The multimedia system 100 (for example, the processing device 112) may select different types of initial audio based on the subject, theme, video effect, or video configuration information of the target video. For example, the initial audio of a public welfare promotion video can be a warm and soothing initial audio. In some embodiments, if the target video is longer, multiple initial audios can be selected and the audios are connected end to end. In other embodiments, if the target video is short, only part of the initial audio (such as a chorus) can be selected.

Step 1320: Mark the initial audio based on the rhythm to obtain at least one audio segmentation point.

In some embodiments, marking based on rhythm can be based on the structure of the entire song, such as marking the intro, verse, and chorus, or it can be divided into more detailed songs, such as segmentation based on drums or beats. mark. In some embodiments, the marking granularity of the initial audio may be determined by the number of initial images and/or initial videos. Just as an example, assuming that the number of image and video materials is medium, after marking the initial audio according to the drum beats, some of the cut points cannot match the material, so the initial audio can be marked as the intro, verse and chorus first, and then the chorus part is marked as Drum marks to get the right number of cut points.

In some embodiments, marking the initial audio based on the rhythm can be implemented by software (such as Adobe Audition, FL Studio, etc.) or plug-ins (such as audio wave plugin based on Vue.js, etc.). In some embodiments, the automatic marking of the initial audio can be realized by an audio rhythm analysis algorithm based on signal analysis. It should be noted that there are various audio tag processing methods, which are not limited in this embodiment.

Step 1330: Determine at least one video segmentation point of the target video based at least in part on the video configuration information;

In some embodiments, the video configuration information may be used to determine the video segmentation point of the target video. Among them, different themes, different shots, and splicing methods can be determined according to the video configuration information. At least one video segmentation point of the target video is determined according to different themes, different shots, and splicing methods, where the splicing time point of each joint can be used as the video segmentation point. For example, for a tire advertisement, based on the video configuration information, it can be determined that the target video involves shots including racing cars, tire changing, award ceremony, tire display, etc., and at least one video segmentation point of the target video may be determined based on the different shots, As an optional editing point.

In some embodiments, a single selectable editing point can choose to add a video clip or not to add a material. Whether to add a material depends on the number of selectable editing points and the time interval between two selectable editing points. Just as an example, if no material is added at a single selectable editing point, the duration of the previous material or the next material can be appropriately extended. Since the optional cutting point is associated with the rhythm, adding material through the optional cutting point makes it easy to arrange the material while providing a good rhythm pattern and improving the effect of the target video. In some other embodiments, the optional cut point may also be used as the starting point or ending point of the target video.

Step 1340: Match at least one audio segmentation point with at least one video segmentation point.

In some embodiments, the matching of the video clip with the optional cut points may be performed according to the interval time between the two optional cut points. Just as an example, suppose there is a split point at 30s in the initial audio, and the nearest split point after the split point is 45s. At this time, a video with a duration of about 15s can be inserted into the split point at 30s. In some embodiments, the interval between two editing points may be only a few seconds. At this time, a threshold can be set. For example, if the interval between two editing points is less than the threshold (such as 3 seconds or 5 seconds, etc.), insert it into the video clip Static material or material with a short period of time.

In some embodiments, the length of the video clips is different, and it may happen that some video clips cannot be matched with the optional cutting points due to time issues. In some embodiments, the video can be divided or changed, for example, the duration is 15s. The video clip can be split to get a 10s material and a 5s material, and the split material can be matched with the optional cutting point. For another example, the duration of the video clip is 22s, and the interval between the two optional editing points is 20s. At this time, the video clip can be played at an accelerated rate. After shortening the duration to 20s, insert the optional editing point. It should be noted that, In some embodiments, in order to ensure the effect of the target video, a threshold (such as ±5% or ±10%) may be set for the variable speed of the video clip, and the video clips whose variable speed exceeds the threshold may be processed in a splicing manner.

In some embodiments, the original audio track (such as background sound, monologue, etc.) may be included in the video clip. According to actual needs, the original audio track in the video clip can be removed, or the audio track can be kept in the target The videos are played at the same time, and there is no restriction in this application.

In some embodiments, the process 1300 may be completed by an audio matching model. By inputting the target video into the audio matching model, audio can be added to the target video. The audio matching model can complete the operations of steps 1310-1340 in the process 1300. The audio matching model may be a machine learning model, such as a neural network model, a deep learning model, etc.

In some embodiments, the generated target video can be simultaneously played on different playback media. Exemplary playback media may include a horizontal video interface of a video website, a vertical video interface in a short video application of a mobile phone, an outdoor electronic large screen, and the like. In order to facilitate the placement of the target video in different playback media, the method provided in this application can also post-process the target video after the target video is generated. Specifically, the target video is post-processed to meet at least one video output condition, and at least one video The output condition is related to the playback medium of the target video. Wherein, the at least one video output condition may include a video size condition, and the post-processing may include cropping a picture of the target video according to the video size condition.

FIG. 14 is a schematic diagram of target video post-processing (screen cropping) shown in some embodiments of the present application.

The post-processing (screen cropping) of the target video can set the corresponding cropping frame based on the target video at different video sizes. In the prior art, the center of the cropping frame coincides with the center of the screen of the target video for screen cropping, and then based on cropping. Frame cropping. Such a screen cropping method may make the main information of the target video missing after cropping (for example, the screen of the advertisement product is cropped). As shown in FIG. 14, the target video post-processing system of the present application can achieve the purpose of converting the size of the video screen by executing the method of target video cropping. The target video screen cropping system can crop the video screen based on the information of the cropped subject and the preset screen size, which can make the main information of the target video after the screen cropping is not easy to be missing (such as the screen of the advertisement product is retained) .

It can be understood that the to-be-processed video 14 in FIG. 14 may be a target video, or may be one or more of an initial video, an initial image, and a video segment. In some embodiments, the process 1400 can be used alone, and the to-be-processed video 14 can be any video that needs to be processed (size changed).

In some embodiments, the system for cropping the image of the to-be-processed video 14 can be integrated in the system 100 for generating the video, and the post-processing of the target video can be achieved through the processing device 112 (or a processing terminal).

The processing device 112 can be used to convert the screen size of the video in a variety of application scenarios. For example, the processing device 112 may be used to convert the screen size of the target video originally placed on an outdoor electronic screen to make it suitable for placement on a subway TV screen. For another example, the processing device 112 may be used to adjust the screen size of a video shot by a mobile phone or a camera to a preferred size for playing on a video website. For another example, the processing device 112 may be used to convert a horizontal screen video into a vertical screen video.

In a typical application scenario, when a horizontal screen video (for example, the aspect ratio of the picture is 16:9) needs to be converted to a vertical screen video (for example, the aspect ratio of the picture is 9:16), the processing device 112 can obtain The to-be-processed video 14 (horizontal screen video), and split the to-be-processed video 14 into multiple video segments 16 based on the model 12; the processing device 112 can identify the subject 15 in the to-be-processed video 14; the processing device 112 can be based on the subject 15 And the preset picture size of the vertical screen video is used to configure a cropping frame 13 for the to-be-processed video 14, and according to the cropping frame 13, the images 17 of multiple video clips 16 are respectively cropped, and then the cropped video clips 16 are re-spliced into Complete video to get vertical screen video.

The model 12 may be included in the processing device 112. The processing device 112 obtains the cropped subject 15 and/or the video clip 16 based on the model 12. For example, the model 12 may be a machine learning model, and the cropped subject 15 in the identified video clip 16, for example, the cropped subject 15 can be the aforementioned specific object and/or subject, and the cropped subject 15 can be a person, car, cosmetics, etc.

The model 12 may be stored in the processing device 112, and when the related functions of the model 12 need to be used, the processing device 112 performs an operation of calling the model 12. The model 12 may refer to a collection of several methods performed based on the processing device 112. These methods can include a large number of parameters. When the model 12 is used, the parameters in the model 12 can be preset or can be dynamically adjusted. Some parameters can be obtained through training, and some parameters can be obtained during execution. For the specific description of the model in this manual, please refer to the relevant part of this manual.

The processing device 112 may perform post-processing on the target video by setting the cropping frame 13. The cropping frame 13 can be understood as a cropping boundary determined according to the target size of the video to be processed to be converted. In the process of cropping the video based on the cropping frame 13, the screens in the cropping frame 13 can be retained and the screens outside the cropping frame 13 can be deleted, so that the target video can be cropped to the size corresponding to each output requirement.

In some embodiments, the processing device 112 may be a playback device for playing the target video to be trimmed. Therefore, the playback device for playing the target video to be trimmed can obtain the target video and base it on the device itself. The size of the playing video is used to crop the screen of the target video, and the target video after screen cropping is automatically played. In other embodiments, the processing device 112 may be a smart device (such as a computer, a mobile phone, a smart wearable device, etc.) capable of performing screen cropping operations of the target video, and the smart device may send the screen cropped target video to the user. It is a playback device that plays the target video to be cropped.

The process of cropping the target video screen as shown in FIG. 14 may be executed by the processing device 112. For example, the method for cropping a target video frame may be stored in a storage device (such as a storage device or memory of a processing terminal) in the form of a program or instruction. The process 1400 of the method for cropping a target video frame may include the following steps:

Step 1410: Obtain the target video to be cropped.

In some embodiments, the target video can be used as an advertising video. The advertising video can be understood as a kind of video content that uses flexible creativity to lock the audience group associated with it, so as to disseminate information or market products to the audience group, etc. Purpose. In some embodiments, the target video may be presented to the audience in the form of a TV, an outdoor advertising screen, a web page or a pop-up window of an electronic device (such as a mobile phone, a computer, a smart wearable device, etc.). Screen cropping can be understood as a way of cropping a video screen according to a preset screen size based on a preset size. In some embodiments, cropping the screen can set a cropping frame based on the main information in the screen and a preset screen size (which can be understood as the target screen size), and crop the screen based on the cropping frame. The main information in the screen can include scenes, characters, commodities, and so on. In step 1410, the target video generated in the foregoing steps can be directly obtained. In some embodiments, the process 1400 can run independently, that is, videos from other channels can be obtained as the to-be-processed video 14.

Step 1420: Determine one or more shot segments based on the target video.

In some embodiments, reference may be made to the related description of the foregoing process 400 for step 1420.

Step 1430: Obtain the cropping subject information of each video segment contained in the target video, and the cropping subject information reflects the specific cropping subject of the video segment and the position information of the specific cropping subject.

In some embodiments, the cropped subject may also be one of the aforementioned specific object and subject, and the method of obtaining may refer to the corresponding description.

In some embodiments, the tailoring subject information can be determined by a machine learning method. Correspondingly, step 1430 can include obtaining the tailoring subject information in each shot segment (video segment) by using a machine learning model. Correspondingly, the machine learning model can identify the cropped subject in each of the shot segments, and the machine learning model can also obtain the cropped subject information while recognizing the cropped subject. The cropped subject information can represent some information related to the cropped subject, and the cropped subject information is used to at least reflect the position of the cropped subject. In some embodiments, the cropping subject information only needs to include the location information and name information of the cropping subject. In other embodiments, the crop subject information may include position information, size information, name information, category information, etc. of the crop subject. The position information of the cropped subject can understand the information of the position of the position in the screen of the target video, for example, it can be the information of the coordinates of the reference point. The size information of the cropped subject may include actual size information of the cropped subject and information on the proportion of the cropped subject to the size of the target video frame. The category information of the crop subject can be understood as the category of the crop subject. For example, the category information of the crop subject includes information about whether the subject of the crop is classified as a person or an object. For example, the category information of the crop subject can also include whether the subject is a skin care product, Information about toiletries or cars. As an example only, when the cutting subject is shampoo, the name information of the cutting subject may be shampoo, and the category information of the cutting subject may be toiletries.

In some embodiments, the machine learning model may be a generative model, a decision model, or a deep learning model in machine learning. For example, it may be a deep learning model using algorithms such as the yolo series algorithm, the FasterRCNN algorithm, or the EfficientDet algorithm. The machine learning model can detect the set objects that need attention in each video frame. Objects that need attention can include creatures (humans, animals, etc.), commodities (cars, decorations, cosmetics, etc.), backgrounds (mountains, roads, bridges, houses, etc.), and so on. Further, for the target video, the object to be paid attention to can be further set, for example, it can be set as a human face or a product. Multiple shot fragments can be input into the machine learning model, and the machine learning model can output data such as name information and position information of the cropped subject in each shot fragment.

In some embodiments, the machine learning model can be trained based on a large number of training samples with logos. Specifically, the training samples with logos are input into the machine learning model, and the training is performed by a common method (for example, the gradient descent method). Update the relevant parameters of the machine learning model. In some embodiments, the training samples may be shot fragments and cropped subjects included in the shot fragments. The way to obtain the training samples can call the data in the memory and the database. In some embodiments, the identification of the sample may be whether the object in the shot segment is the subject of cropping. If yes, mark it as "1", otherwise mark it as "0". In some embodiments, the identification method may be manual marking, automatic marking by machine or other methods, which is not limited in this embodiment.

In some embodiments, the cropping theme information of the target video can also be obtained. In some embodiments, the cropped subject information may be a specific subject of the aforementioned target video. In some alternative embodiments, the cropping theme information of the target video can be the keyword information in the title or introduction of the target video to be processed, it can also be the tag information of the target video, it can also be user-defined information or a database. Information that has been stored in.

Step 1440: According to the preset picture size and the cropping subject information according to the video size condition, the picture of each video segment included in the target video is cropped.

In some embodiments, step 1440 may include cropping the frame of the shot fragment according to the preset frame size and cropping subject information.

The picture size preset by the video size condition can be understood as the target size for cropping the picture of the target video. The preset screen size may include the target aspect ratio of the screen, and may also include the target width and/or target height of the screen. In some embodiments, in the process of screen cropping, the aspect ratio and specific size of the cropping frame of each video frame are set according to the preset screen size, and each video frame is set based on the cropping frame of each video frame. The picture outside the cropping frame is cropped and the picture inside the cropping frame is retained. The user can manually input the preset screen size into the system according to the display and playback size of the target playback terminal of the target video. The system can also automatically obtain the optimal size of the target playback terminal for the target video. The optimal size of the data can be Stored in the device where the target video is to be played.

The cropping frame can be understood as the cropping boundary determined according to the target screen size of the screen cropping. The cropping frame can be rectangular, parallelogram, circular, etc.

In some embodiments, in order to prevent the picture in each video segment from being greatly shaken, step 1440 may further include the following steps: determining the size and size of the cropping frame of several video frames in the video segment according to the cropping subject information and the preset picture size. Initial position; the initial position of the cropping box of several video frames is processed, and the final position corresponding to the cropping box of several video frames is determined; according to the final position of the cropping box, the picture of each video frame of the video clip is cropped to Keep the picture inside the cropping frame. In this embodiment, according to the initial positions of the cropping boxes of several video frames contained in the video clip, the final positions of the cropping boxes of several video frames are determined. While ensuring that the subject of the crop is in the cropping box, it can also be reduced. The small positional difference between the cropping boxes of adjacent video frames can prevent the positional difference between the cropping boxes of adjacent video frames from being too large to cause sudden jumps or jitters in the pictures in the video clip. The initial position of the cropping frame can be understood as the position of the cropping frame preliminarily determined based on the cropping subject information and the preset screen size, and the final position of the cropping frame can be understood as the new position determined after processing the initial position information. In some embodiments, the initial position information may include the initial coordinate information of the reference point, and the final position information may include the final coordinate information of the reference point.

In some embodiments, when determining the size and initial position of the crop box of several video frames in the video clip according to the cropping subject information and the preset picture size, each cropping subject and the subject can be determined according to the subject information and cropping subject information. The relevance of the information, and then determine the initial position and size of the cropping frame according to the relevance, cropping subject information and the preset screen size. For the specific implementation method of this embodiment, please refer to the relevant description in the part of FIG. 16.

In other embodiments, taking the cropping frame as a rectangle as an example, when the size and initial position of the cropping frame of several video frames are determined according to the cropping subject information and the preset picture size, the cropping frame can be determined according to the preset picture size The initial position and size of the cropping frame are determined based on the position and size of the crop subject and the aspect ratio of the cropping frame, and then the cropping frame is scaled equally according to the preset screen size. For example, in the process of determining the aspect ratio of the cropping frame, if the preset picture size is 800×800, the aspect ratio of the cropping frame is set to 1:1. For another example, if the preset picture size is 720×540, the aspect ratio of the cropping frame is set to 4:3. After setting the aspect ratio of the cropping frame, you can set multiple cropping frames with the same aspect ratio but different sizes based on the aspect ratio, and then determine the crop based on the cropping subject identified in step 1110 and the aspect ratio of the cropping frame The position and size of the frame and the cropping frame are such that each cropping subject is located in the cropping frame, and then the cropping frame and the images within the cropping frame and the images inside are determined to be scaled in an equal proportion according to the preset picture size. Specifically, the initial position and size of the cropping frame can be determined based on the overlapping area of the cropping frame with the same aspect ratio and the area where the crop subject is located in the picture of the video frame. In addition, the cropping frame and the screen within the cropping frame are reduced or enlarged in proportion to width and height, so that the size of the cropping frame meets the preset screen size, thereby preventing black borders from appearing in the cropped screen. Just as an example, the picture size of the video frame is 1024×768, and the default picture size is 960×960. You can first determine a 768×768 cropping frame, and then crop the video frame according to the cropping frame. Enlarge the same scale to 960×960.

In some embodiments, determining the final position corresponding to the crop box of several video frames in the video clip may specifically include: selecting several video frames from all the video frames contained in the video clip, and judging the interval of each pair (every two) Whether the distance between the reference points (such as the center point) of the cropping frame of the preset number of video frames is less than the preset distance; if the logarithm of the cropping frame less than the preset distance is greater than the preset logarithm, it can be understood as this The position of the crop subject in the video clip is relatively static. At this time, the average position of the reference point of the crop box of all video frames contained in the video clip can be obtained, and the position of the crop box of each video frame can be adjusted based on the average position; If the logarithm of the crop frame that is less than the preset distance is less than the preset logarithm, it can be understood that the position of the crop subject in the video clip is dynamically changing. At this time, it can be based on the size of the crop frame of all video frames contained in the video clip. The position of the reference point determines a smooth trajectory line, and the position of the crop box of each video frame is adjusted based on the trajectory line (for example, so that the reference point of the crop box of each video frame is located on the trajectory line). In some embodiments, the preset number of frames may be 2 frames, 3 frames, 5 frames, or the like. In other embodiments, a pair of video frames separated by a preset number of frames may also be a pair of adjacent video frames. It should be noted that the reference point in this specification can be the center point of the cropping frame, the top left corner point of the rectangle, the bottom right corner of the rectangle, and so on. The reference point is preferably the center point of the cropping frame, so as to reduce changes in the relative positions of the cropping frame and each crop subject in the cropping frame when the position of the cropping frame is moved.

In other embodiments, adjusting the position of the cropping frame of each video frame in the video clip may specifically include the following steps: smoothing the initial coordinate information of the reference point of the cropping frame of several video frames of the video clip according to time; According to the result of the smoothing process, the final coordinate information of the reference point is determined; the position of the reference point is determined based on the final coordinate information. In some embodiments, the initial coordinate information of the reference points of the crop frames of several video frames of the video clip is smoothed according to time, which may be linear regression processing on the coordinate values of the reference points. For the specific method and more details of linear regression processing, please refer to the relevant description in Figure 15.

FIG. 15 is a schematic diagram of smoothing processing according to some embodiments of the present application. As shown in FIG. 15, in some embodiments, smoothing the initial coordinate information of the reference point includes: performing linear regression processing to obtain the linear regression equation and its slope. Specifically, linear regression can be performed on the initial coordinate information of the reference point of each clipping frame based on time to obtain the linear regression equation, the fitted straight line segment (see Figure 15) and the slope of the linear regression equation; based on the fitted straight line segment and slope , To get the final coordinate information of the reference point of each cropping frame. Specifically, the absolute value of the slope can be compared with a slope threshold. If the absolute value of the slope is less than the slope threshold, the position of the cropped subject in the video segment is considered to be relatively static, and the position of the midpoint of the fitted straight line segment is taken as the final position of the reference point of the cropping frame of each video frame; If the absolute value of the slope is greater than the slope threshold, the position of the cropped subject in the video segment is considered to be dynamically changing, and the position of each time point on the fitted straight line segment is taken as the reference for the cropping frame of each video frame at the corresponding time point The final position of the point. The slope threshold can be set to 0.1, 0.05, or 0.01, etc. Those skilled in the art can set the slope threshold according to the actual situation of the target video. The slope threshold is set higher, such as 0.1.

Just as an example, linear regression processes a video segment consisting of 12 video frames. In this example, the horizontal screen video is converted to the vertical screen video, so the vertical coordinate of the center of the cropping frame can be fixed at the center position 0.5, and only the horizontal coordinate needs to be smoothed. The specific process of linear regression processing is as follows: 12 video frames corresponding to 12 time points 1, 2, 3, ..., 12, the initial relative position of the abscissa of the reference point of the cropping frame of each video frame is 0.91 ,0.87,0.83,0.74,0.68,0.61,0.55,0.51,0.43,0.39,0.37,0.34. As shown in Figure 15, based on the above time points and abscissas, 12 data points are obtained. The coordinates are: (1,0.91), (2,0.87), (3,0.83), (4,0.74), (5 ,0.68), (6,0.61), (7,0.55), (8,0.51), (9,0.43), (10,0.39), (11,0.37), (12,0.34).

Based on the above 12 data points for linear fitting, the approximate linear equation is obtained as x=-0.06t+0.91, and the slope is about -0.06; the absolute value of the slope is greater than 0.01, it is considered that the lens is tracking motion; t=1, 2,... …, 12 are respectively substituted into the approximate linear equation to obtain the abscissa of the final position of the cropping frame in each video frame: 0.91, 0.85, 0.79, 0.73, 0.67, 0.61, 0.55, 0.49, 0.43, 0.37, 0.31, 0.25.

In other embodiments, the initial coordinate information of the reference points of the crop boxes of the multiple video frames included in each video segment is smoothed according to time, which may be a polynomial regression process on the coordinate values. Specifically, a polynomial regression can be performed on the coordinate value of the reference point of each cropping frame based on time to obtain a fitting curve. Then, the position of each time point on the fitting curve can be taken as the final position of the reference point of the crop frame of each video frame at the corresponding time point.

FIG. 16 is a flowchart of determining the size and position of the crop box of each video frame according to some embodiments of the present application. In some embodiments, one or more steps in the process 1600 may be stored as instructions in a storage device (for example, the database 140), and called by a processing device (for example, the processing device 112 in the server 110) and/or implement.

Step 1610: Determine the correlation between one or more cropped subjects in the cropped subject information and the subject information according to the subject information and the cropped subject information of the target video.

The subject information can be tailored to include a specific subject of the target video. In some embodiments, the degree of relevance between the tailored subject and the topic information can be used to indicate the degree of relevance between the two; the higher the degree of relevance, the higher the degree of relevance. Just as an example, the correlation between "steering wheel cover" and "car interior" is greater than the correlation between "car door" and "car interior"; the correlation between "car door" and "car interior" is greater than that of "hand cream" "Relevance to "car interior".

In some embodiments, the correlation between the cropped subject and the topic information can be obtained based on the interpretation text of the two. The explanatory text can be a text description of the tailored subject or subject information. For example, the explanatory text of "car interior" is "car interior mainly refers to the car products used in the interior modification of the car, involving all aspects of the car interior, such as the car steering wheel. Covers, car seat cushions, car floor mats, car perfumes, car pendants, interior decorations, storage boxes, etc. are all car interior products." For another example, the explanatory text of "steering wheel cover" is "steering wheel finger cover on the steering wheel. The steering wheel cover is very decorative." The interpretation text of the tailored subject and the subject information can be pre-stored in the system, or it can be obtained from the Internet in real time based on the tailored subject’s name and subject information.

In some embodiments, the representation vector of the explanatory text can be obtained based on a text embedding model such as wor18vec, and the correlation between the cropped subject and the topic information can be obtained based on the distance of the representation vector. The smaller the distance representing the vector, the higher the correlation. For example, by calculating the vector distance of the explanatory text of the subject "steering wheel cover" and the subject information "car interior", the correlation between the two can be obtained.

In some embodiments, the crop subject can also be confirmed by the candidate crop subject method. The candidate crop subject is similar to the aforementioned candidate subject and can be input by the user or determined from each video clip through the method of machine learning. The method of determining the subject of cropping can refer to the method of determining a specific subject based on the candidate subject in the foregoing process, that is, using a machine learning model to obtain the candidate cropped subject in each of the video clips, and then according to the subject information of the target video , Selecting the one or more specific cropped subjects from the candidate cropped subjects.

Step 1620: Determine multiple candidate cropping frames corresponding to at least one video frame according to the preset picture size and the specific cropping subject.

In some embodiments, in each video frame, at least one candidate cropping frame can be set according to a preset picture size and a specific cropping subject. In a video frame that does not contain any subject to be cropped, only one candidate cropping frame can be set, and the candidate cropping frame is centered by default. In a video frame containing at least one crop subject, multiple candidate cropping frames can be set, and the positions and/or sizes of the reference points of the multiple candidate cropping frames are different, and the aspect ratios of the multiple candidate cropping frames are the same.

Step 1630: Score at least one candidate cropping frame according to the cropping subject information and the correlation degree.

In some embodiments, based on the correlation between each crop subject in the candidate crop frame and the subject of the target video, score each crop subject first, determine the score of each crop subject, and then calculate the score of the candidate crop frame . Specifically, the correlation between the cropped subject and the video theme can be used as the weight, multiplied by the score of the corresponding cropped subject, and then summed to obtain the score of each candidate cropping frame. In some embodiments, the score of each cropped subject may be the ratio of the area occupied by the cropped subject to the total area of the video frame. Just as an example, the subject of a video is “toilet care products”. An optional cropping box in a certain frame of the video contains the complete cropped subject: shampoo 1, shampoo 2, and face 1. The correlations between shampoo 1, shampoo 2, and face 1 and "toilet products" are 0.86, 0.86, and 0.45, respectively. The cut subject scores of shampoo 1, shampoo 2, and face 1 are 0.35, 0.1, and 0.1, respectively. . Then the score of the candidate cropping frame is 0.86×0.35+0.86×0.1+0.45×0.1=0.432.

Step 1640: Determine the size and position of the crop frame of at least one video frame based on the scoring result of the candidate crop frame.

In some embodiments, based on the score of the candidate cropping frame, the method for determining the size and position of the cropping frame of the video frame may be that the position of the reference point of the candidate cropping frame with the highest score in the video segment shall prevail. The position of the reference point of the cropping frame is taken as the final position of the reference point of the cropping frame of all video frames in the video clip, and the size of the candidate cropping frame is taken as the size of the cropping frame of all video frames in the video clip. In other embodiments, the method for determining the size and position of the crop frame of the video frame may also be to select the candidate crop frame with the top Y rank in the score of each video frame, and calculate the value of the Y candidate crop frames. The average position of the reference point is taken as the position of the crop frame of the video frame, and the size of the candidate crop frame with the highest score is taken as the size of the crop frame of the video frame. The value of Y can be selected as 3, 4, 5, or 8, etc., and those skilled in the art can determine the value of Y according to the number of candidate cropping frames in each video frame.

In the embodiment shown in Figure 16, the size and position of the crop frame are determined based on the subject information of the cropped subject and the correlation between the subject information of the target video and the subject of cropping. The target video loses as little main information as possible (information related to the subject information of the target video).

In some embodiments, the method 1400 for cropping a target video frame may further include: step 1450: splicing the cropped shot fragments into a new target video in a predetermined order. The predetermined sequence may be the sequence determined by the sequence information of the video configuration information, or may be a new splicing sequence set by the user.

In some embodiments, the above-mentioned cropping of the target video can also be applied to other crop-related parts of the present application. For example, in the normalization process of the video segment, each initial video and/or initial image can be cropped to be the same size.

In some embodiments, this application may generate multiple target videos for delivery, and optimize the video generation algorithm of this application according to the feedback results of different videos.

In some embodiments, the aforementioned multiple target videos may be generated based on different audiences, and correspondingly, the target videos are delivered to a specific audience group, where the audience refers to the target video delivery group. Specifically, the specific audience group may be a group of people of a specific age, gender, and behavior characteristics. The specific age may refer to younger age (for example, if the proportion of users who are 15-40 years old in the delivery platform is 80%, the user is considered to be younger), middle-aged, aging, and so on. The gender can be characterized by a male to female ratio (for example, a male to female ratio of 1:3). The behavior characteristics may include browsing habits, shopping preferences, and so on. For example, what kind of videos the users of the platform prefer to browse. The delivery time may be shorter, for example, less than 1 week, longer, for example, more than one week, etc. The launch period can be peak period (618 promotion period, double eleven period), idle period (non-promotion period), etc. The platform placement position can be, for example, whether it is a homepage recommendation. The characteristics of the released platform can include the type of platform (online platform (APP), offline platform (airport, subway, etc.), APP type (video playback, music playback, learning, etc.)), platform traffic characteristics (such as Large flow) and so on.

Fig. 17 is a flowchart of generating a target video according to a target video audience group shown in some embodiments of the present application. In some embodiments, one or more steps in the process 1700 may be stored in a storage device (for example, the database 140) in the form of instructions, and called by a processing device (for example, the processing device 112 in the server 110) and/or implement.

Step 1710: Obtain the audience acceptance of multiple video clips.

In some embodiments, the video clip may have obvious relevance to the video audience. For example, a video clip containing Ultraman may be popular with a child audience, while a middle-aged audience may not be interested. Audience acceptance is an indicator that describes the relevance of video clips and video audiences. The audience acceptance of the video clip can be realized by inputting the video clip or the elements of the video clip (for example, each tag ID and tag value) into a trained machine learning model. Among them, the machine learning model can be determined according to the delivery effect of each video clip under different audiences, and the relevant description of the delivery effect can refer to the subsequent step 1740.

Step 1720: For a specific audience group, determine candidate fragments whose audience acceptance is higher than the threshold from a plurality of video fragments according to corresponding demographic conditions, so as to generate a target video.

In some embodiments, audience acceptance may be specifically expressed as a specific tag of the video clip and its corresponding tag value. For example, the tag ID of the audience acceptance of female audiences may be #61, the tag ID of the audience acceptance of male audiences It can be #62, and the corresponding tag value is the specific audience acceptance.

In some embodiments, the audience acceptance can be quantitatively described. For example, the tag value corresponding to the audience acceptance can be any real number within [-1,1], where a positive value means like, a negative value means dislike, and 0 Indicates no interest, the greater the absolute value of the value, the higher the degree. Correspondingly, the threshold in step 1720 may be a value in the range of tag values that represents a preference, such as 0.5.

In some embodiments, the audience acceptance can be qualitatively described. For example, the tag value corresponding to the audience acceptance can be -3, -2, -1,0,1,2,3. In practice, it can pass four Bit binary tag expression, where the first two digits of the tag value indicate dislike, the last two digits of the tag value indicate like, and the tag value of 0 indicates no interest. Correspondingly, the threshold may be a value indicating a preference, for example, the first two digits of the tag value are 00, and the last two digits of the tag value are greater than 01.

In some embodiments, the first preset condition in the process 800 may include that the audience acceptance in step 1720 is higher than the threshold. Correspondingly, the implementation of step 1720 can refer to the related description of step 820 in the process 800.

Step 1730: Obtain the delivery effect feedback of the target video, and adjust at least one of demographic conditions or audience acceptance according to the delivery effect feedback.

In some embodiments, the delivery effect may be determined by at least one of related indicators such as the click rate, the completion rate, the number of replays, and the number of viewers of the target video. For example, a higher completion rate of a target video can be considered that the audience likes the video, and a higher number of replays can indicate a higher degree of likeness.

In some embodiments, the delivery effect of each video segment of the target video can be determined. For example, it can be based on the ratio of the switching amount of each video segment (referring to the user stopping the playback of this video and switching to the next video in the video segment) and the playback volume It is determined that the user likes it, the higher the ratio, the more the user likes the video clip. For another example, the delivery effect of the video clip may be determined according to the skipped part of the user after the broadcast or the user of the rebroadcast, where the playback effect of the skipped video clip is poor.

In some embodiments, after the delivery effect is obtained, the delivery effect may be input to the machine learning model in step 1710 as feedback to realize the re-determination of demographic characteristics or audience acceptance. In some embodiments, the corresponding tag value of the corresponding video segment can be directly modified according to the delivery effect of each video segment.

In some embodiments, the delivery effect of the target video can be estimated based on the aforementioned determined delivery effect and the machine learning model, and the target video with the delivery effect higher than a preset value can be delivered. Specifically, when the type of the target video is a creative advertisement, the estimated effect data of the target video may be determined based on the element effect parameter of at least one advertisement element of the creative advertisement.

Advertising elements can be understood as the component units of advertising creativity, which can specifically include the main elements of the lens (the aforementioned specific objects, subjects, cropped subjects, etc.), and decorative elements (background pictures, models, copywriting, trademarks, product pictures, and/or promotional logos, etc.) ) And element presentation methods (animation actions, AE templates, etc.).

In some embodiments, in addition to the delivery effect in the foregoing step 1730, the data related to the delivery effect of the advertisement creative may also include click-through rate, exposure rate, conversion rate, return on investment, and the like. Click-through rate can be understood as the click-through rate of online advertising creatives, that is, the actual number of clicks on the advertisement. Exposure rate can be understood as the number of times an ad creative is displayed per unit time. For example, the number of views of a certain network media is 3000 people per day. If an advertisement exclusively occupies a certain advertising space, then the daily advertising exposure will be 3000. If the advertising space displays 3 advertisements in turn, then the daily advertising exposure will be 3000/3. Conversion rate can be understood as the ratio of the number of times an advertisement is clicked and converted to further effects (such as purchase and payment orders) to the number of advertisement clicks. The return on investment can be understood as the ratio of the return on advertising to the cost of input. The delivery effect data of the placed advertising creative may include the effect data of a preset time period, such as one day, one week, one month, or one quarter. In some embodiments, the placement effect data of the advertisement creative that has been placed may also include the placement effect change trend of the advertisement creative over time, season, and platform audience characteristics.

In some embodiments, FIG. 18 is a flowchart of confirming the estimated effect data of the target video according to some embodiments of the present application. In some embodiments, one or more steps in the process 1800 may be stored as instructions in a storage device (for example, the database 140), and called by a processing device (for example, the processing device 112 in the server 110) and/or implement.

Step 1810: Obtain an advertisement element effect prediction model.

In some embodiments, the advertising element effect prediction model may be a model that can score each advertising element. The advertising element effect prediction model can be a trained machine learning model. Through the trained model, the delivery effect data of several advertising creatives containing advertising elements and several advertising creatives containing advertising elements are used as feature input The advertising element effect prediction model can get the score of the advertising element.

In some embodiments, the advertising element effect estimation model can be determined based on the aforementioned specific audience group and the specific theme of the target video. For example, the specific audience group may be women, and the specific theme of the target video may correspond to the mouthwash advertisement. Advertising element effect prediction model The advertising element effect prediction model of cleaning and/or washing and nursing daily necessities products for women.

Step 1820: Input the advertisement element marked with at least one element tag into the advertisement element effect prediction model, and determine the element effect parameter of the advertisement element.

The at least one element tag includes the relationship between the advertisement element and the creative advertisement, where the element effect parameter of the advertisement element may refer to the contribution amount of the advertisement element in a certain period of time.

In some embodiments, the element tags may be equivalent to the various elements in the foregoing step 230, where the advertisement element and the creative advertisement may be the relationship between a specific object, subject, or cropped subject in the video clip and a specific subject of the target video, which may be represented by correlation . In some embodiments, the relationship between the advertisement element and the creative advertisement may also include the relationship between the advertisement element and a specific occasion (such as a subway station, a railway station, a giant screen in a city center), and a specific time (such as Double Eleven, Valentine's Day).

In some embodiments, step 1820 may first determine a number of published advertisement creatives containing advertisement elements. Based on the placement effect data of the advertisements that have been placed, determine the placement effect data of several placed advertising creatives that contain advertising elements; the server will include the average, median, and accumulation of the placement effect data of several placed advertising creatives that contain advertising elements The sum or the weighted cumulative sum is used as the element effect parameter of the advertisement element. For example, the published advertisement creatives containing the advertisement element a are M1, M2, M3, and M4. The average click-through rates of M1, M2, M3, and M4 are 1,000, 2,000, 500, and 3500, respectively. Taking the advertisement creative as the cumulative sum of the average click-through rates of M1, M2, M3, and M4 as the element effect parameter of the advertisement element a, it is the average click-through rate of 7000 times. The element effect parameter of the advertisement element may include data obtained by numerical statistical calculation through the placement effect data of the advertisement creative that contains the advertisement element, which can intuitively reflect the contribution of the advertisement element over a period of time.

In some embodiments, the element effect parameter of each advertisement element may be determined through a placement experiment. In the process of placing experiments, since some advertising elements will be reused, orthogonal experimental algorithms can be used to calculate the smallest set of advertising creatives that contain the most advertising elements, so that the most advertising elements can be obtained by placing the least creatives data. Further, you can also classify advertising elements first, and specify that all advertising elements contained in certain categories of advertising elements (such as models or copywriting, etc.) need to be delivered, and use the orthogonal experiment algorithm to calculate the specified types of advertising elements. The minimum advertising creative collection of all advertising elements.

Step 1830, based on the element effect parameter of the at least one advertisement element, determine an advertisement element that meets the expectation among the at least one advertisement element, and the element effect parameter of the advertisement element that meets the expectation is greater than the parameter threshold.

Step 1840: Determine the proportion of advertisement elements that meet the expectations in at least one advertisement element in the creative advertisement.

In some embodiments, it is determined that the proportion of advertisement elements that meet expectations in at least one advertisement element in the creative advertisement may be the proportion of each expected advertisement element in the combination of video clips occupying the target video.

Step 1850: Determine the estimated effect data of the target video based on the ratio.

In some embodiments, the target video can be estimated based on the delivery effect data of each advertising element in the target video and the delivery effect data of several advertising creatives containing the advertising elements to obtain the estimated effect data of the target video.

Fig. 19 is an exemplary flowchart of a model training process according to some embodiments of the present application. In some embodiments, one or more steps in the process 1900 may be stored as instructions in a storage device (for example, the database 140), and called by a processing device (for example, the processing device 112 in the server 110) and/or implement. In some embodiments, the process 1900 may be used to train the first initial model.

Step 1910: Obtain the first training set.

The first training set refers to the training sample set used to train the first initial model. The first training set includes multiple video pairs, where each video pair includes a corresponding image feature corresponding to the first sample video, an image feature corresponding to the second sample video, and label values corresponding to the two sample videos. Obtaining the corresponding image features from the first sample video and the second sample video can be obtained in a characteristic extraction process.

The tag value in the video pair reflects the degree of similarity between the first sample video and the second sample video. The label value in the sample set can be manually annotated, or the video pair can be automatically annotated by the corresponding machine learning model. For example, the degree of similarity of each video pair can be obtained from the trained classifier model.

In some embodiments, the method for obtaining the first training set may be from an image collector such as a camera, a camera, a smart phone, or the terminal device 130. In some embodiments, the method of obtaining the first training set may be to directly read from a storage system that stores a large number of pictures. In some embodiments, the first training set may also be obtained in any other manner, which is not limited in this embodiment.

The first initial model can be understood as an untrained neural network model or an untrained neural network model. The first initial model may be or include the trained feature extraction model and/or the initial model corresponding to the discriminant model described in the process 1100. Each layer of the initial model can be set with initial parameters, and the parameters can be adjusted continuously during the training process until the training is completed.

Step 1920: Based on the first training set, train a first initial model through multiple iterations to generate a trained first neural network model.

The first neural network model may be or include the trained feature extraction model and/or the discriminant model described in the process 1100. Each iteration further includes the following steps.

Step 1921: Use the updated first feature extraction model to process the image feature corresponding to the first sample video in the video pair to obtain the corresponding first segment feature.

Step 1922: Use the updated second feature extraction model to process the image feature corresponding to the second sample video in the same video pair to obtain the second segment feature.

Step 1923: Use the updated discriminant model to process the first segment feature and the second segment feature to generate a discrimination result, and the discrimination result is used to reflect the degree of similarity between the first segment feature and the second segment feature.

Step 1924: Determine whether to perform the next iteration or determine the trained first neural network model based on the discrimination result and the label value.

After the discriminant result is obtained by forward propagation of the discriminant model, a loss function can be constructed based on the discriminant result and the sample label, and the model parameters can be updated based on the loss function backpropagation. In some embodiments, the training sample label data can be expressed as y ₁ , and the discrimination result can be expressed as

The calculated loss function value is expressed as Loss ₁ . In some embodiments, different loss functions can be selected according to the type of the model, such as the mean square error loss function or the cross entropy loss function as the loss function, which is not limited in this specification. Illustratively,

In some embodiments, a gradient backpropagation algorithm can be used to update the model parameters. The backpropagation algorithm compares the prediction results of a specific training sample with the label data, and determines the update range of each weight of the model. In other words, the backpropagation algorithm is used to determine the change of the loss function relative to each weight (also called gradient or error derivative), which is recorded as

Furthermore, the gradient backpropagation algorithm can pass the value of the loss function through the output layer, pass it back to the hidden layer and the input layer layer by layer, and determine the correction value (or gradient) of the model parameter of each layer in turn. Among them, the correction value (or gradient) of the model parameters of each layer includes multiple matrix elements (such as gradient elements), which correspond to the model parameters one-to-one, and each gradient element reflects the correction direction (increase or decrease) of the parameter, and Correction amount. In one or more embodiments involved in this specification, after the discriminant model is returned to complete the gradient, it further transfers the model parameters to the first segment feature extraction model, and the second segment feature extraction model returns the model parameters layer by layer to complete a round of iterative update . Compared with the individual training of each model, the joint training of the first segment feature extraction model, the second segment feature extraction model, and the discriminant model adopts a unified loss function for training, and the training efficiency is higher.

In some embodiments, it is possible to determine whether to perform the next iteration or determine the trained first neural network model based on the discrimination result and the label value. The criterion for judgment may be whether the number of iterations has reached the preset number of iterations, whether the updated model meets the preset performance index threshold, etc., or whether an instruction to terminate training is received. If it is determined that the next iteration is required, the next iteration can be performed based on the first part of the updated model during the current iteration. In other words, the updated model obtained in the current iteration will be used as the initial model for the next iteration in the next iteration. If it is determined that the next iteration is not necessary, the updated model obtained during the current iteration can be used as the final trained model.

20A-20E are schematic diagrams of video synthesis systems according to some embodiments of the present application.

As shown in FIG. 20A, the multimedia system 2000 may include an acquisition module 2010, a configuration module 2020, and a generation module 2030.

The obtaining module 2010 may be used to obtain multiple video clips.

The configuration module 2020 may be used to obtain video configuration information.

The generating module 2030 may be configured to generate a target video based on the at least part of the video clip and the video configuration information. In some embodiments, the generating module 2030 may also be referred to as a target video generating module.

In some embodiments, step 210 may be implemented by the acquisition module 2010, step 220 may be implemented by the configuration module 2020, and step 230 may be implemented by the generation module 2030.

As shown in FIG. 20B, the acquisition module 2010 may further include a media acquisition module 2011, a segmentation module 2013, and a material processing module 2015. The material processing module 2015 also includes a video processing module 2015a and an image processing module 2015b. The configuration module 2020 may further include an identification module 2021, where the identification module 2021 may also be referred to as a subject acquisition module. The generating module 2030 may also include a screening module 2031, a combination module 2033, and a video synthesis module 2035. In some embodiments, the multimedia system 2000 may further include a post-processing module 2040, where the post-processing module 2040 may include a cropping module 2041, an effect estimation module 2043.

The media acquisition module 2011 may be used to acquire the initial video or initial image to implement

steps

310, 610, and 810 related to the initial video or initial image. The media acquisition module 2011 may also be used to acquire initial audio to implement step 1310.

The segmentation module 2013 may be used to segment the video file according to the lens frame, so as to implement steps 320, steps 420 to 440, step 1330 and other steps related to segmentation of the lens frame. The segmentation module 2013 may also be used to determine the clipping point of the audio file, so as to implement step 1320.

The material processing module 2015 can be used to generate video clips, and can also be used to process segmented video materials, such as rendering, beautifying, and applying video templates. It can also be used to combine different types of materials. For example, in step 1340, The audio file is merged with the video file. The material processing module 2015 may specifically include a video processing module 2015a for processing video materials corresponding to the initial video and a picture processing module 2015b for processing image materials corresponding to the initial image.

The configuration module 2020 may further include an identification module 2021 for identifying a subject. The identification module 2021 can be combined with other modules, and the corresponding subject can be changed to corresponding content, for example, combined with the cropping module 2041, and the corresponding subject can be a cropped subject.

The screening module 2031 may be specifically configured to screen video files according to conditions. For example, in step 820, candidate video clips may be screened from clips according to a first preset condition. The screening module 2031 can also determine the initial image, the initial video, and the video segment related to the target video according to whether the subject is included.

The combining module 2033 may be specifically configured to generate a segment combination according to the candidate video segments, and the combining module 2033 may also be configured to determine a set of segments used to generate the target video according to a second preset condition.

The video synthesis module 2035 may specifically generate the target video according to the segment set.

The post-processing module 2040 may be used to reprocess the target video after the target video is generated, for example, to target the target video to a specific audience.

The cropping module 2041 may be used to modify the size of the video file, for example, modify the size of the target video according to the size of the playback medium.

The effect estimation module 2043 is used to estimate the playback effect of the target video.

This application does not limit the hierarchical relationship and containment relationship of each target. For example, the media acquisition module 2011 can also be regarded as the acquisition module 2010. It is understandable that the above modules can be combined according to actual needs to implement different methods. For example, as shown in FIG. 20C, the media acquisition module 2011, the subject acquisition module (ie, the recognition module 2021), the video processing module 2015a, the image processing module 2015b, and the target video generation module (ie, the generation module 2030) can be combined to achieve utilization Video material and image material generate video. For another example, as shown in FIG. 20D, the acquisition module 2010, the splitting module (ie, the splitting module 2013), the filtering module 2031, the combining module 2033, and the video synthesis module 2035 can be combined to realize the splitting and reorganization of the video file. For another example, as shown in FIG. 20E, the acquisition module 2010, the segmentation module 2013, the recognition module 2021, and the cropping module 2041 can be combined to achieve precise cropping of a specific video file.

More detailed descriptions of the functions implemented by the module can be found elsewhere in this manual, and will not be repeated here. It should be noted that the above description of the video generating system and its modules is only for convenience of description, and does not limit this specification to the scope of the embodiments mentioned.

The basic concepts have been described above. Obviously, for those skilled in the art, the above detailed disclosure is only an example, and does not constitute a limitation of the specification. Although it is not explicitly stated here, those skilled in the art may make various modifications, improvements and amendments to this specification. Such modifications, improvements, and corrections are suggested in this specification, so such modifications, improvements, and corrections still belong to the spirit and scope of the exemplary embodiments of this specification.

Meanwhile, this specification uses specific words to describe the embodiments of this specification. For example, "one embodiment", "an embodiment", and/or "some embodiments" mean a certain feature, structure, or characteristic related to at least one embodiment of this specification. Therefore, it should be emphasized and noted that “one embodiment” or “one embodiment” or “an alternative embodiment” mentioned twice or more in different positions in this specification does not necessarily refer to the same embodiment. . In addition, some features, structures, or characteristics in one or more embodiments of this specification can be appropriately combined.

In addition, those skilled in the art can understand that various aspects of this specification can be explained and described through a number of patentable categories or situations, including any new and useful process, machine, product, or combination of substances, or a combination of them. Any new and useful improvements. Correspondingly, various aspects of this specification can be completely executed by hardware, can be completely executed by software (including firmware, resident software, microcode, etc.), or can be executed by a combination of hardware and software. The above hardware or software can be referred to as "data block", "module", "engine", "unit", "component" or "system". In addition, various aspects of this specification may be embodied as a computer product located in one or more computer-readable media, and the product includes computer-readable program codes.

A computer storage medium may contain a propagated data signal containing a computer program code, for example on a baseband or as part of a carrier wave. The propagated signal may have multiple manifestations, including electromagnetic forms, optical forms, etc., or suitable combinations. The computer storage medium may be any computer readable medium other than the computer readable storage medium, and the medium may be connected to an instruction execution system, device, or device to realize communication, propagation, or transmission of the program for use. The program code located on the computer storage medium can be transmitted through any suitable medium, including radio, cable, fiber optic cable, RF, or similar medium, or any combination of the above medium.

The computer program codes required for the operation of each part of this manual can be written in any one or more programming languages, including object-oriented programming languages such as Java, Scala, Smalltalk, Eiffel, JADE, Emerald, C++, C#, VB.NET, Python Etc., conventional programming languages such as C language, VisualBasic, Fortran2003, Perl, COBOL2002, PHP, ABAP, dynamic programming languages such as Python, Ruby and Groovy, or other programming languages. The program code can run entirely on the user's computer, or as an independent software package on the user's computer, or partly on the user's computer and partly on a remote computer, or entirely on the remote computer or processing equipment. In the latter case, the remote computer can be connected to the user's computer through any network form, such as a local area network (LAN) or a wide area network (WAN), or connected to an external computer (for example, via the Internet), or in a cloud computing environment, or as a service Use software as a service (SaaS).

In addition, unless explicitly stated in the claims, the order of processing elements and sequences, the use of numbers and letters, or the use of other names in this specification are not used to limit the order of the processes and methods in this specification. Although the foregoing disclosure uses various examples to discuss some embodiments of the invention that are currently considered useful, it should be understood that such details are only for illustrative purposes, and the appended claims are not limited to the disclosed embodiments. On the contrary, the rights The requirements are intended to cover all modifications and equivalent combinations that conform to the essence and scope of the embodiments of this specification. For example, although the system components described above can be implemented by hardware devices, they can also be implemented only by software solutions, such as installing the described system on existing processing devices or mobile devices.

For the same reason, it should be noted that, in order to simplify the expressions disclosed in this specification and help the understanding of one or more embodiments of the invention, in the foregoing description of the embodiments of this specification, multiple features are sometimes combined into one embodiment. In the drawings or its description. However, this method of disclosure does not mean that the subject of this specification requires more features than those mentioned in the claims. In fact, the features of the embodiment are less than all the features of the single embodiment disclosed above.

In some embodiments, numbers describing the number of ingredients and attributes are used. It should be understood that such numbers used in the description of the embodiments use the modifiers "approximately", "approximately" or "substantially" in some examples. Retouch. Unless otherwise stated, "approximately", "approximately" or "substantially" indicates that the number is allowed to vary by ±20%. Correspondingly, in some embodiments, the numerical parameters used in the description and claims are approximate values, and the approximate values can be changed according to the required characteristics of individual embodiments. In some embodiments, the numerical parameter should consider the prescribed effective digits and adopt the method of general digit retention. Although the numerical ranges and parameters used to confirm the breadth of the ranges in some embodiments of this specification are approximate values, in specific embodiments, the setting of such numerical values is as accurate as possible within the feasible range.

For each patent, patent application, patent application publication and other materials cited in this specification, such as articles, books, specifications, publications, documents, etc., the entire contents are hereby incorporated into this specification as a reference. Except the application history documents that are inconsistent or conflict with the content of this specification, and the documents that restrict the broadest scope of the claims of this specification (currently or later attached to this specification) are also excluded. It should be noted that if there is any inconsistency or conflict between the description, definition, and/or use of terms in the auxiliary materials of this manual and the content of this manual, the description, definition and/or use of terms in this manual shall prevail. .

Finally, it should be understood that the embodiments described in this specification are only used to illustrate the principles of the embodiments of this specification. Other variations may also fall within the scope of this specification. Therefore, as an example and not a limitation, the alternative configuration of the embodiment of the present specification can be regarded as consistent with the teaching of the present specification. Accordingly, the embodiments of this specification are not limited to the embodiments explicitly introduced and described in this specification.

Claims

A system for generating video, including:

At least one storage medium storing a set of instructions; and

At least one processor is configured to communicate with the at least one storage medium, wherein when the instruction set is executed, the at least one processor is configured to perform one or more operations, and the operations include:

Get multiple video clips;

Acquiring video configuration information, where the video configuration information includes one or more configuration features of at least some of the multiple video clips, and the configuration features include at least one of content features or arrangement features; and

Based on the at least part of the video segment and the video configuration information, a target video is generated.
The system according to claim 1, wherein said acquiring a plurality of video clips comprises:

Acquiring at least one of an initial image or an initial video; and

Perform editing processing on the initial image or initial video to obtain the multiple video clips.
The system according to claim 2, wherein the editing process on the initial image or the initial video to obtain the multiple video clips comprises:

Acquiring features of each pair of adjacent images or video frames in the initial image or initial video;

Determining the similarity of each pair of adjacent images or video frames;

Identify segment boundaries based on the similarity of each pair of adjacent images or video frames; and

Based on the segment boundary, the initial image or the initial video is divided to obtain the multiple video segments.
The system according to claim 1, wherein each video segment of the plurality of video segments is a shot segment.
The system according to claim 2, wherein the editing process on the initial image or the initial video to obtain the multiple video clips comprises:

Determine subject information of the initial image or initial video, where the subject information includes at least the subject and the subject position;

Perform editing processing on the initial image or initial video based on the subject information to obtain the multiple video clips.
The system according to claim 5, wherein the determining the subject information in the initial image or the initial video comprises:

Obtain a subject information deterministic model;

The subject information is determined by inputting the initial image or the initial video into the subject information determination model.
According to the system of claim 5 or 6, said editing the initial image or initial video based on the subject information comprises:

Based on the subject information, identifying the outer contour of the subject in the initial image or initial video; and

According to the outer contour of the subject, the initial image or initial video is cropped or zoomed.
The system according to claim 1, wherein the video configuration information includes a first preset condition and a second preset condition.
The system according to claim 8, wherein said generating a target video based on said at least part of the video segment and said video configuration information comprises:

Obtaining one or more candidate video clips from the multiple video clips based on the first preset condition;

Grouping the one or more candidate video segments to determine at least one segment set; and

Based on each segment set in the at least one segment set, a target video is generated.
The system according to claim 9, wherein the first preset condition is related to at least one of a plurality of elements, the plurality of elements including a target video containing a specific object, a target video containing a specific subject, and a total duration of the target video, The number of shots contained in the target video, the specific shots contained in the target video, the number of overlaps of a specific subject in the target video, or the focusing time of a specific subject in the target video.
The system according to claim 10, the first preset condition includes that the value of the at least one element is greater than a corresponding threshold.
The system according to claim 10 or 11, the first preset condition further includes an element constraint condition between two or more specific elements in the plurality of elements.
The system according to any one of claims 10-12, wherein the first preset condition includes a binding condition of a shot frame in the target video, and the binding condition reflects the binding condition of at least two specific shot frames in the target video. The association relationship, the obtaining one or more candidate video clips from the multiple video clips based on the first preset condition includes:

Determine a video clip containing a specified lens picture from the multiple video clips;

Based on the binding condition, the video clips including the specified shots are combined to serve as a candidate video clip.
The system according to claim 9, wherein the at least one fragment set includes two or more fragment sets, the two or more fragment sets satisfy the second preset condition, and the second preset condition is consistent with the The combination differences of candidate video segments between two or more segment sets are related.
The system of claim 14, wherein the grouping the one or more candidate video segments to determine the at least one segment set comprises:

Determining the degree of difference in combination of candidate video segments between each of the two or more segment sets and other segment sets; and

A segment set whose combination difference with other segment sets is higher than a preset threshold is used as the at least one segment set.
The system according to claim 15, wherein the determining the combination difference of candidate video segments between each of the two or more segment sets and other segment sets comprises:

Assigning an identification character to each of the one or more candidate video segments;

Based on the identification characters of the one or more candidate video segments, determine character strings corresponding to the segment set and other segment sets; and

The edit distance of the character strings corresponding to the segment set and the other segment sets is determined as the combination difference degree of the candidate video segments between the segment set and the other segment sets.
The system according to claim 15, wherein the determining the combination difference of candidate video segments between each of the two or more segment sets and other segment sets comprises:

Generating a segment feature corresponding to each candidate video segment based on the trained feature extraction model and the candidate video segments in the two or more segment sets;

Generating a collection feature vector corresponding to each segment collection based on the segment features;

Determine the degree of similarity between each fragment set and other fragment sets based on the trained discriminant model and the set feature vector corresponding to each fragment set; and

Based on the degree of similarity, the degree of combined difference between each segment set and other segment sets is determined.
The system according to claim 17, wherein the determining the degree of similarity between each fragment set and other fragment sets based on the trained discriminant model and the set feature vector corresponding to each fragment set comprises:

Clustering the set feature vector based on a clustering algorithm to obtain a plurality of set cluster clusters.
The system according to claim 17, wherein the feature extraction model is a sequence-based machine learning model, and the candidate video segment is generated based on the trained feature extraction model and the candidate video segments in the set of two or more segments. The segment features corresponding to the video segment include:

Acquire multiple video frames included in each candidate video segment;

Determine one or more image features corresponding to each video frame; and

Based on the trained feature extraction model, the image features in the multiple video frames and the relationship between the image features in the multiple video frames are processed to determine the segment feature corresponding to the candidate video segment.
The system according to claim 19, wherein the image features corresponding to the video frame include the shape information of the object in the video frame, the positional relationship information between multiple objects in the video frame, the color information of the object in the video frame, and the information in the video frame. At least one of the completeness of the object or the brightness in the video frame.
The system according to claim 8, wherein said generating a target video based on said at least part of the video segment and said video configuration information comprises:

Generating a plurality of candidate fragment sets based on the plurality of video fragments, the plurality of candidate fragment sets satisfying a second preset condition;

At least one segment set is selected from the plurality of candidate segment sets based on the first preset condition; and

Based on each segment set in the at least one target segment set, a target video is generated.
The system according to claim 9 or 21, wherein the video configuration information further includes sequence information, and the generating a target video based on each segment set in the at least one segment set includes:

Based on the sequence information, the candidate video segments in each of the segment collections are sorted and combined to generate a target video.
The system according to claim 1, wherein the video configuration information further includes a beautification parameter, and the beautification parameter includes at least one of a filter parameter, an animation parameter, and a layout parameter.
The system of claim 1, the operation further comprising:

Based on the video configuration information, obtaining a text layer, a background layer or a decoration layer and loading parameters;

The layout of the text layer, the background layer, and the decoration layer in the target video is determined according to the loading parameters.
The system of claim 1, the operation further comprising:

Perform normalization processing on the multiple video clips.
The system of claim 1, the operation further comprising:

Get the initial audio;

Marking the initial audio based on rhythm to obtain at least one audio segmentation point;

Determine at least one video segmentation point of the target video based at least in part on the video configuration information;

Matching the at least one audio segmentation point with the at least one video segmentation point; and

Based on the matching result, the segmented audio is synthesized with the target video.
The system of claim 1, the operation further comprising:

Post-processing is performed on the target video to satisfy at least one video output condition, and the at least one video output condition is related to a playback medium of the target video.
The system according to claim 27, wherein the at least one video output condition includes a video size condition, and the post-processing of the target video includes:

The frame of the target video is cropped according to the video size condition.
The system according to claim 28, wherein said cropping the frame of said target video according to said video size condition comprises:

Acquiring cropping subject information of each video segment included in the target video, where the cropping subject information reflects a specific cropping subject of the video segment and position information of the specific cropping subject;

According to the preset picture size corresponding to the video size condition and the cropping subject information, the picture of each video segment included in the target video is cropped.
The system according to claim 29, wherein the cropping of the screen of each video segment included in the target video according to the preset screen size corresponding to the video size condition and the cropping subject information comprises:

For each video segment contained in the target video,

Determining the size and initial position of the cropping frame of at least one video frame in the video clip according to the cropping subject information and the preset picture size;

Processing the initial position of the cropping frame of the at least one video frame, and determining the final position corresponding to the cropping frame of the at least one video frame;

According to the final position of the cropping frame, the pictures of each video frame included in the video clip are cropped to preserve the pictures in the cropping frame.
The system according to claim 30, wherein the processing the initial position of the cropping box of the at least one video frame and determining the final position corresponding to the cropping box of the at least one video frame comprises:

Smoothing the initial coordinate information of the reference point of the crop frame of the at least one video frame of the video segment according to time;

Determine the final coordinate information of the reference point according to the result of the smoothing process; and

The position of the reference point is determined based on the final coordinate information.
The system according to claim 31, said smoothing the initial coordinate information of the reference point comprises:

Perform linear regression processing on the initial coordinates of the reference point to obtain the linear regression equation and its slope.
The system according to claim 32, wherein the determining the final coordinate information of the reference point according to the result of the smoothing process comprises:

Comparing the absolute value of the slope with a slope threshold;

In response to the absolute value of the slope being less than the slope threshold,

Taking the position of the midpoint of the trend line of the linear regression equation as the final position of the reference point of the cropping frame of each video frame; and

In response to the absolute value of the slope being greater than or equal to the slope threshold,

The position corresponding to the time point of each video frame on the trend line of the linear regression equation is used as the final position of the reference point of the crop frame of the video frame.
The system according to claim 30, wherein the determining the size and position of the cropping frame of at least one video frame in the video clip according to the cropping subject information and the preset picture size comprises:

Determine the correlation between one or more specific cropped subjects in the cropped subject information and the subject information according to the subject information of the target video and the cropped subject information;

Determining at least one candidate cropping frame corresponding to the at least one video frame according to the preset picture size and the specific cropping subject;

Scoring the at least one candidate cropping frame according to the cropping subject information and the correlation; and

Based on the scoring result of the candidate cropping frame, the size and position of the cropping frame of the at least one video frame are determined.
The system according to claim 34, wherein said obtaining the cropping subject information of each video segment contained in the target video comprises:

Using a machine learning model to obtain candidate cropped subjects in each of the video clips;

According to the subject information of the target video, the one or more specific crop subjects are selected from the candidate crop subjects.
The system of claim 1, the operation further comprising:

The target video is delivered to a specific audience group.
The system according to claim 36, wherein said specific audience group meets specific demographic characteristics, and said generating a target video based on said at least part of the video fragment and said video configuration information comprises:

Obtaining audience acceptance of the multiple video clips;

For the specific audience group, a candidate segment whose audience acceptance is higher than a threshold is determined from the plurality of video segments according to corresponding demographic conditions, so as to generate the target video.
The system of claim 37, the operations further comprising:

Obtain the delivery effect feedback of the target video, and adjust at least one of the demographic condition or audience acceptance according to the delivery effect feedback.
The system according to claim 38, wherein the delivery effect feedback is related to at least one of the completion rate, the number of replays, or the number of viewers of the target video.
The system according to any one of claims 36-39, wherein the target video includes a creative advertisement, and the operation further includes:

Determine the estimated effect data of the target video based on the element effect parameter of at least one advertisement element of the creative advertisement.
The system according to claim 40, based on the element effect parameter of at least one advertisement element of the creative advertisement, determining the estimated effect data of the target video comprises:

Obtain the effect prediction model of advertising elements;

The advertisement element marked with at least one element tag is input into the advertisement element effect prediction model to determine the element effect parameter of the advertisement element, and the at least one element tag includes the relationship between the advertisement element and the creative advertisement ；

Based on the element effect parameter of the at least one advertisement element, determining an expected advertisement element in the at least one advertisement element, and the element effect parameter of the expected advertisement element is greater than a parameter threshold;

Determining the proportion of the advertising element that meets expectations in the at least one advertising element in the creative advertisement;

The estimated effect data of the target video is determined based on the ratio.
A method for generating a video, the method being executed by a processing device including at least one memory and at least one processor, the method including:

Get multiple video clips;

Acquiring video configuration information, where the video configuration information includes one or more configuration features of at least some of the multiple video clips, and the configuration features include at least one of content features or arrangement features; and

Based on the at least part of the video segment and the video configuration information, a target video is generated.
The method according to claim 42, said acquiring a plurality of video clips comprises:

Acquiring at least one of an initial image or an initial video; and

Perform editing processing on the initial image or initial video to obtain the multiple video clips.
The method according to claim 43, wherein said editing the initial image or the initial video to obtain the plurality of video clips comprises:

Acquiring features of each pair of adjacent images or video frames in the initial image or initial video;

Determining the similarity of each pair of adjacent images or video frames;

Identify segment boundaries based on the similarity of each pair of adjacent images or video frames; and

Based on the segment boundary, the initial image or the initial video is divided to obtain the multiple video segments.
The method according to claim 42, wherein each video segment of the plurality of video segments is a shot segment.
The method according to claim 43, wherein said editing the initial image or the initial video to obtain the plurality of video clips comprises:

Determine subject information of the initial image or initial video, where the subject information includes at least the subject and the subject position;

Perform editing processing on the initial image or initial video based on the subject information to obtain the multiple video clips.
The method according to claim 46, wherein the determining the subject information in the initial image or the initial video comprises:

Obtain a subject information deterministic model;

The subject information is determined by inputting the initial image or the initial video into the subject information determination model.
According to the method of claim 46 or 47, said editing the initial image or initial video based on the subject information comprises:

Based on the subject information, identifying the outer contour of the subject in the initial image or initial video; and

According to the outer contour of the subject, the initial image or initial video is cropped or zoomed.
The method according to claim 42, wherein the video configuration information includes a first preset condition and a second preset condition.
The method according to claim 49, wherein said generating a target video based on said at least part of the video segment and said video configuration information comprises:

Obtaining one or more candidate video clips from the multiple video clips based on the first preset condition;

Grouping the one or more candidate video segments to determine at least one segment set; and

Based on each segment set in the at least one segment set, a target video is generated.
The method of claim 50, wherein the first preset condition is related to at least one of a plurality of elements, the plurality of elements including a target video containing a specific object, a target video containing a specific subject, and a total duration of the target video, The number of shots contained in the target video, the specific shots contained in the target video, the number of overlaps of a specific subject in the target video, or the focusing time of a specific subject in the target video.
The method according to claim 51, the first preset condition includes that the value of the at least one element is greater than a corresponding threshold.
The method according to claim 51 or 52, wherein the first preset condition further comprises an element constraint condition between two or more specific elements in the plurality of elements.
The method according to any one of claims 51-53, wherein the first preset condition comprises a binding condition of a shot frame in the target video, and the binding condition reflects the binding condition of at least two specific shot frames in the target video. The association relationship, the obtaining one or more candidate video clips from the multiple video clips based on the first preset condition includes:

Determine a video clip containing a specified lens picture from the multiple video clips;

Based on the binding condition, the video clips including the specified shots are combined to serve as a candidate video clip.
The method according to claim 50, wherein the at least one segment set includes two or more segment sets, the two or more segment sets satisfy the second preset condition, and the second preset condition is consistent with the The combination differences of candidate video segments between two or more segment sets are related.
The method of claim 55, the grouping the one or more candidate video segments to determine the at least one set of segments comprises:

Determining the degree of difference in combination of candidate video segments between each of the two or more segment sets and other segment sets; and

A segment set whose combination difference with other segment sets is higher than a preset threshold is used as the at least one segment set.
The method according to claim 56, wherein the determining the combination difference of candidate video segments between each of the two or more segment sets and other segment sets comprises:

Assigning an identification character to each of the one or more candidate video segments;

Based on the identification characters of the one or more candidate video segments, determine character strings corresponding to the segment set and other segment sets; and

The edit distance of the character strings corresponding to the segment set and the other segment sets is determined as the combination difference degree of the candidate video segments between the segment set and the other segment sets.
The method according to claim 56, wherein the determining the combination difference of candidate video segments between each of the two or more segment sets and other segment sets comprises:

Generating a segment feature corresponding to each candidate video segment based on the trained feature extraction model and the candidate video segments in the two or more segment sets;

Generating a collection feature vector corresponding to each segment collection based on the segment features;

Determine the degree of similarity between each fragment set and other fragment sets based on the trained discriminant model and the set feature vector corresponding to each fragment set; and

Based on the degree of similarity, the degree of combined difference between each segment set and other segment sets is determined.
The method according to claim 58, wherein the determining the degree of similarity between each fragment set and other fragment sets based on the trained discriminant model and the set feature vector corresponding to each fragment set comprises:

Clustering the set feature vector based on a clustering algorithm to obtain a plurality of set cluster clusters.
The method according to claim 58, wherein the feature extraction model is a sequence-based machine learning model, and the candidate is generated based on the trained feature extraction model and candidate video segments in the set of two or more segments The segment features corresponding to the video segment include:

Acquire multiple video frames included in each candidate video segment;

Determine one or more image features corresponding to each video frame; and

Based on the trained feature extraction model, the image features in the multiple video frames and the relationship between the image features in the multiple video frames are processed to determine the segment feature corresponding to the candidate video segment.
The method of claim 60, wherein the image features corresponding to the video frame include shape information of the object in the video frame, positional relationship information between multiple objects in the video frame, color information of the object in the video frame, and At least one of the completeness of the object or the brightness in the video frame.
The method according to claim 49, wherein said generating a target video based on said at least part of the video segment and said video configuration information comprises:

Generating a plurality of candidate fragment sets based on the plurality of video fragments, the plurality of candidate fragment sets satisfying a second preset condition;

At least one segment set is selected from the plurality of candidate segment sets based on the first preset condition; and

Based on each segment set in the at least one target segment set, a target video is generated.
The method according to claim 50 or 62, wherein the video configuration information further includes sequence information, and the generating a target video based on each segment set in the at least one segment set includes:

Based on the sequence information, the candidate video segments in each of the segment collections are sorted and combined to generate a target video.
The method according to claim 42, wherein the video configuration information further includes a beautification parameter, and the beautification parameter includes at least one of a filter parameter, an animation parameter, and a layout parameter.
The method of claim 42, further comprising:

Based on the video configuration information, obtaining a text layer, a background layer or a decoration layer and loading parameters;

The layout of the text layer, the background layer, and the decoration layer in the target video is determined according to the loading parameters.
The method of claim 42, further comprising:

Perform normalization processing on the multiple video clips.
The method of claim 42, further comprising:

Get the initial audio;

Marking the initial audio based on rhythm to obtain at least one audio segmentation point;

Determine at least one video segmentation point of the target video based at least in part on the video configuration information;

Matching the at least one audio segmentation point with the at least one video segmentation point; and

Based on the matching result, the segmented audio is synthesized with the target video.
The method of claim 42, further comprising:

Post-processing the target video to satisfy at least one video output condition, and the at least one video output condition is related to a playback medium of the target video.
The method according to claim 68, wherein the at least one video output condition comprises a video size condition, and the post-processing of the target video comprises:

The frame of the target video is cropped according to the video size condition.
The method according to claim 69, wherein the cropping the frame of the target video according to the video size condition comprises:

Acquiring cropping subject information of each video segment included in the target video, where the cropping subject information reflects a specific cropping subject of the video segment and position information of the specific cropping subject;

According to the preset picture size corresponding to the video size condition and the cropping subject information, the picture of each video segment included in the target video is cropped.
The method according to claim 70, wherein, according to the preset picture size corresponding to the video size condition and the cropping subject information, cropping the picture of each video segment included in the target video comprises:

For each video segment contained in the target video,

Determining the size and initial position of the cropping frame of at least one video frame in the video clip according to the cropping subject information and the preset picture size;

Processing the initial position of the cropping frame of the at least one video frame, and determining the final position corresponding to the cropping frame of the at least one video frame;

According to the final position of the cropping frame, the pictures of each video frame included in the video clip are cropped to preserve the pictures in the cropping frame.
The method according to claim 71, wherein the processing the initial position of the cropping frame of the at least one video frame, and determining the final position corresponding to the cropping frame of the at least one video frame comprises:

Smoothing the initial coordinate information of the reference point of the crop frame of the at least one video frame of the video segment according to time;

Determine the final coordinate information of the reference point according to the result of the smoothing process; and

The position of the reference point is determined based on the final coordinate information.
The method according to claim 72, said smoothing the initial coordinate information of the reference point comprises:

Perform linear regression processing on the initial coordinates of the reference point to obtain the linear regression equation and its slope.
The method according to claim 73, wherein the determining the final coordinate information of the reference point according to the result of the smoothing process comprises:

Comparing the absolute value of the slope with a slope threshold;

In response to the absolute value of the slope being less than the slope threshold,

Taking the position of the midpoint of the trend line of the linear regression equation as the final position of the reference point of the cropping frame of each video frame; and

In response to the absolute value of the slope being greater than or equal to the slope threshold,

The position corresponding to the time point of each video frame on the trend line of the linear regression equation is used as the final position of the reference point of the crop frame of the video frame.
The method according to claim 71, wherein the determining the size and position of the cropping frame of at least one video frame in the video clip according to the cropping subject information and the preset picture size comprises:

Determine the correlation between one or more specific cropped subjects in the cropped subject information and the subject information according to the subject information of the target video and the cropped subject information;

Determining at least one candidate cropping frame corresponding to the at least one video frame according to the preset picture size and the specific cropping subject;

Scoring the at least one candidate cropping frame according to the cropping subject information and the correlation; and

Based on the scoring result of the candidate cropping frame, the size and position of the cropping frame of the at least one video frame are determined.
The method according to claim 75, wherein said obtaining the cropping subject information of each video segment included in the target video comprises:

Using a machine learning model to obtain candidate cropped subjects in each of the video clips;

According to the subject information of the target video, the one or more specific crop subjects are selected from the candidate crop subjects.
The method of claim 42, further comprising:

The target video is delivered to a specific audience group.
The method according to claim 77, wherein said specific audience group meets specific demographic characteristics, and said generating a target video based on said at least part of the video fragment and said video configuration information comprises:

Obtaining audience acceptance of the multiple video clips;

For the specific audience group, a candidate segment whose audience acceptance is higher than a threshold is determined from the plurality of video segments according to corresponding demographic conditions, so as to generate the target video.
The method of claim 78, further comprising:

Obtain the delivery effect feedback of the target video, and adjust at least one of the demographic condition or audience acceptance according to the delivery effect feedback.
The method of claim 79, wherein the delivery effect feedback is related to at least one of the completion rate, the number of replays, or the number of viewers of the target video.
The method according to any one of claims 77-80, wherein the target video includes a creative advertisement, and the method further comprises:

Determine the estimated effect data of the target video based on the element effect parameter of at least one advertisement element of the creative advertisement.
The method according to claim 81, based on an element effect parameter of at least one advertisement element of the creative advertisement, determining the estimated effect data of the target video comprises:

Obtain the effect prediction model of advertising elements;

The advertisement element marked with at least one element tag is input into the advertisement element effect prediction model to determine the element effect parameter of the advertisement element, and the at least one element tag includes the relationship between the advertisement element and the creative advertisement ；

Based on the element effect parameter of the at least one advertisement element, determining an expected advertisement element in the at least one advertisement element, and the element effect parameter of the expected advertisement element is greater than a parameter threshold;

Determining the proportion of the advertising element that meets expectations in the at least one advertising element in the creative advertisement;

The estimated effect data of the target video is determined based on the ratio.
A non-transitory computer-readable medium comprising at least one set of instructions for determining an optimal strategy, wherein when executed by one or more processors of a computing device, the at least one set of instructions causes the computing device to Execution method, the method includes:

Get multiple video clips;

Acquiring video configuration information, where the video configuration information includes one or more configuration features of at least some of the multiple video clips, and the configuration features include at least one of content features or arrangement features; and

Based on the at least part of the video segment and the video configuration information, a target video is generated.