WO2021259322A1 - System and method for generating video - Google Patents

System and method for generating video Download PDF

Info

Publication number
WO2021259322A1
WO2021259322A1 PCT/CN2021/101816 CN2021101816W WO2021259322A1 WO 2021259322 A1 WO2021259322 A1 WO 2021259322A1 CN 2021101816 W CN2021101816 W CN 2021101816W WO 2021259322 A1 WO2021259322 A1 WO 2021259322A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
segment
frame
subject
initial
Prior art date
Application number
PCT/CN2021/101816
Other languages
French (fr)
Chinese (zh)
Inventor
陈万锋
李韶辉
谢统玲
吴庆宁
殷焦元
Original Assignee
广州筷子信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202010578632.1A external-priority patent/CN111815645B/en
Priority claimed from CN202010738213.XA external-priority patent/CN111918146B/en
Priority claimed from CN202010741962.8A external-priority patent/CN111739128B/en
Priority claimed from CN202110503297.3A external-priority patent/CN112989116B/en
Application filed by 广州筷子信息科技有限公司 filed Critical 广州筷子信息科技有限公司
Publication of WO2021259322A1 publication Critical patent/WO2021259322A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/735Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/845Structuring of content, e.g. decomposing content into time segments

Definitions

  • This application relates to video processing, and in particular to a system and method for generating video.
  • the system includes at least one storage medium storing a set of instructions and at least one processor configured to communicate with the at least one storage medium, wherein when the instruction set is executed, the at least one processor is used to execute one or Multiple operations, the operations include: obtaining multiple video clips.
  • Obtain video configuration information where the video configuration information includes one or more configuration features of at least some of the multiple video clips, and the configuration features include at least one of content features or arrangement features. Based on the at least part of the video segment and the video configuration information, a target video is generated.
  • Another aspect of the present application provides a method of generating a video.
  • the method is executed by a processing device including at least one memory and at least one processor, and the method includes: acquiring a plurality of video clips.
  • Obtain video configuration information where the video configuration information includes one or more configuration features of at least some of the multiple video clips, and the configuration features include at least one of content features or arrangement features.
  • a target video is generated.
  • the non-transitory computer-readable medium includes at least one set of instructions for determining an optimal strategy, wherein when executed by one or more processors of a computing device, the at least one set of instructions causes the computing device to execute
  • the method includes: acquiring a plurality of video clips.
  • a target video is generated.
  • the acquiring multiple video clips includes acquiring at least one of an initial image or an initial video. Perform editing processing on the initial image or initial video to obtain the multiple video clips.
  • the performing editing processing on the initial image or the initial video to obtain the plurality of video clips includes: acquiring the characteristics of each pair of adjacent images or video frames in the initial image or the initial video. Determine the similarity of each pair of adjacent images or video frames. Based on the similarity of each pair of adjacent images or video frames, segment boundaries are identified. Based on the segment boundary, the initial image or the initial video is divided to obtain the multiple video segments.
  • each video segment of the plurality of video segments is a shot segment.
  • the editing of the initial image or the initial video to obtain the multiple video clips includes: determining subject information of the initial image or the initial video, the subject information including at least the subject and the subject position. Perform editing processing on the initial image or initial video based on the subject information to obtain the multiple video clips.
  • the determining the subject information in the initial image or the initial video includes: obtaining a subject information determination model.
  • the subject information is determined by inputting the initial image or the initial video into the subject information determination model.
  • the editing of the initial image or the initial video based on the subject information includes: recognizing the outline of the subject in the initial image or the initial video based on the subject information. According to the outer contour of the subject, the initial image or initial video is cropped or zoomed.
  • the video configuration information includes a first preset condition and a second preset condition.
  • the generating a target video based on the at least part of the video clips and the video configuration information includes: obtaining one or more candidates from the multiple video clips based on the first preset condition Video clips.
  • the one or more candidate video segments are grouped to determine at least one segment set. Based on each segment set in the at least one segment set, a target video is generated.
  • the first preset condition is related to at least one of a plurality of elements, the plurality of elements including that the target video contains a specific object, the target video contains a specific subject, the total duration of the target video, and the target video The number of shots included, the specific shots contained in the target video, the number of overlaps of a specific subject in the target video, or the focusing time of a specific subject in the target video.
  • the first preset condition includes that the value of the at least one element is greater than a corresponding threshold.
  • the first preset condition further includes element constraint conditions between two or more specific elements in the plurality of elements.
  • the first preset condition includes a binding condition of a shot frame in the target video
  • the binding condition reflects the association relationship of at least two specific shot frames in the target video
  • the binding condition is based on the first preset condition. It is assumed that obtaining one or more candidate video clips from the plurality of video clips includes: determining a video clip containing a specified shot picture from the plurality of video clips. Based on the binding condition, the video clips including the specified shots are combined to serve as a candidate video clip.
  • the at least one fragment set includes two or more fragment sets, and the two or more fragment sets satisfy the second preset condition, and the second preset condition is consistent with the two and The combination difference degree of the candidate video clips between the above clip sets is related.
  • the grouping the one or more candidate video clips to determine the at least one clip set includes: determining the difference between each of the two or more clip sets and other clip sets. The degree of difference in the combination of candidate video clips between. A segment set whose combination difference with other segment sets is higher than a preset threshold is used as the at least one segment set.
  • the determining the degree of difference in combination of candidate video segments between each of the two or more segment sets and other segment sets includes: Each of the is assigned an identifying character. Based on the identification characters of the one or more candidate video segments, a character string corresponding to the segment set and other segment sets is determined. The edit distance of the character strings corresponding to the segment set and the other segment sets is determined as the combination difference degree of the candidate video segments between the segment set and the other segment sets.
  • the determining the combination difference of candidate video segments between each of the two or more segment sets and other segment sets includes: based on a trained feature extraction model and the two Generate a segment feature corresponding to each candidate video segment from the candidate video segments in the set of more than one segment. Based on the segment features, a set feature vector corresponding to each segment set is generated. The degree of similarity between each segment set and other segment sets is determined based on the trained discriminant model and the set feature vector corresponding to each segment set. Based on the degree of similarity, the degree of combined difference between each segment set and other segment sets is determined.
  • the determining the degree of similarity between each fragment set and other fragment sets based on the trained discriminant model and the set feature vector corresponding to each fragment set includes: comparing the set based on a clustering algorithm The feature vector is clustered, and multiple clusters are obtained.
  • the feature extraction model is a sequence-based machine learning model
  • the candidate video segment corresponding to the candidate video segment is generated based on the trained feature extraction model and the candidate video segments in the two or more segment sets
  • the segment features include: obtaining multiple video frames contained in each candidate video segment. Determine one or more image features corresponding to each video frame. Based on the trained feature extraction model, the image features in the multiple video frames and the relationship between the image features in the multiple video frames are processed to determine the segment feature corresponding to the candidate video segment.
  • the image features corresponding to the video frame include the shape information of the object in the video frame, the positional relationship information between multiple objects in the video frame, the color information of the object in the video frame, and the integrity of the object in the video frame. At least one of the degree or the brightness in the video frame.
  • the generating a target video based on the at least part of the video fragments and the video configuration information includes: generating a plurality of candidate fragment sets based on the plurality of video fragments, and the plurality of candidate fragment sets satisfy The second preset condition. At least one segment set is selected from the plurality of candidate segment sets based on the first preset condition. Based on each segment set in the at least one target segment set, a target video is generated.
  • the video configuration information further includes sequence information
  • the generating a target video based on each segment set in the at least one segment set includes: based on the sequence information, combining each segment The candidate video clips in the collection are sorted and combined to generate a target video.
  • the video configuration information further includes beautification parameters, and the beautification parameters include at least one of filter parameters, animation parameters, and layout parameters.
  • the method further includes: obtaining a text layer, a background layer or a decoration layer and loading parameters based on the video configuration information.
  • the layout of the text layer, the background layer, and the decoration layer in the target video is determined according to the loading parameters.
  • the method further includes: normalizing the plurality of video clips.
  • the method further includes: obtaining initial audio. At least one audio segmentation point is obtained by marking the initial audio based on the rhythm. At least one video segmentation point of the target video is determined based at least in part on the video configuration information. The at least one audio segmentation point is matched with the at least one video segmentation point. Based on the matching result, the segmented audio is synthesized with the target video.
  • the method further includes: performing post-processing on the target video to satisfy at least one video output condition, and the at least one video output condition is related to a playback medium of the target video.
  • the at least one video output condition includes a video size condition
  • the post-processing of the target video includes: cropping a frame of the target video according to the video size condition.
  • the cropping of the screen of the target video according to the video size condition includes: obtaining cropping subject information of each video segment included in the target video, where the cropping subject information reflects the The specific cropped subject and the position information of the specific cropped subject. According to the preset picture size corresponding to the video size condition and the cropping subject information, the picture of each video segment included in the target video is cropped.
  • the cropping the screen of each video segment included in the target video according to the preset screen size corresponding to the video size condition and the cropping subject information includes: For each included video segment, the size and initial position of the cropping frame of at least one video frame in the video segment are determined according to the cropping subject information and the preset picture size. The initial position of the cropping frame of the at least one video frame is processed, and the final position corresponding to the cropping frame of the at least one video frame is determined. According to the final position of the cropping frame, the pictures of each video frame included in the video clip are cropped to preserve the pictures in the cropping frame.
  • the processing the initial position of the cropping frame of the at least one video frame, and determining the final position corresponding to the cropping frame of the at least one video frame includes: The initial coordinate information of the reference point of the cropping frame of the video frame is smoothed according to time. According to the result of the smoothing process, the final coordinate information of the reference point is determined. The position of the reference point is determined based on the final coordinate information.
  • the smoothing of the initial coordinate information of the reference point includes: performing linear regression processing on the initial coordinate of the reference point to obtain a linear regression equation and its slope.
  • the determining the final coordinate information of the reference point according to the result of the smoothing process includes: comparing the absolute value of the slope with a slope threshold. In response to the absolute value of the slope being less than the slope threshold, the position of the midpoint of the trend line of the linear regression equation is taken as the final position of the reference point of the crop frame of each video frame. In response to the absolute value of the slope being greater than or equal to the slope threshold, the position corresponding to the time point of each video frame on the trend line of the linear regression equation is used as the final position of the reference point of the crop frame of the video frame .
  • the determining the size and position of the cropping frame of at least one video frame in the video clip according to the cropping subject information and the preset picture size includes: according to the subject information of the target video And the cropping subject information to determine the correlation between one or more specific cropping subjects in the cropping subject information and the subject information. According to the preset picture size and the specific cropping subject, at least one candidate cropping frame corresponding to the at least one video frame is determined. Score the at least one candidate cropping frame according to the cropping subject information and the correlation degree. Based on the scoring result of the candidate cropping frame, the size and position of the cropping frame of the at least one video frame are determined.
  • the obtaining the cropping subject information of each video segment included in the target video includes: using a machine learning model to obtain the candidate cropping subject in each of the video segments. According to the subject information of the target video, the one or more specific crop subjects are selected from the candidate crop subjects.
  • the method further includes: delivering the target video to a specific audience group.
  • the specific audience group meets specific demographics conditions
  • the generating a target video based on the at least part of the video clips and the video configuration information includes: obtaining audience acceptance of the plurality of video clips .
  • a candidate segment whose audience acceptance is higher than a threshold is determined from the plurality of video segments according to corresponding demographic conditions, so as to generate the target video.
  • the method further includes: obtaining a delivery effect feedback of the target video, and adjusting at least one of the demographic condition or audience acceptance according to the delivery effect feedback.
  • the delivery effect feedback is related to at least one of the completion rate, the number of replays, or the number of viewers of the target video.
  • the target video includes a creative advertisement
  • the method further includes determining the estimated effect data of the target video based on an element effect parameter of at least one advertisement element of the creative advertisement.
  • determining the estimated effect data of the target video based on the element effect parameter of at least one advertisement element of the creative advertisement includes: obtaining an advertisement element effect estimation model.
  • the advertisement element marked with at least one element tag is input into the advertisement element effect prediction model to determine the element effect parameter of the advertisement element, and the at least one element tag includes the relationship between the advertisement element and the creative advertisement .
  • Based on the element effect parameter of the at least one advertisement element an advertisement element that meets expectations is determined in the at least one advertisement element, and the element effect parameter of the advertisement element that meets the expectation is greater than a parameter threshold. Determine the proportion of the advertisement elements that meet the expectations in the at least one advertisement element in the creative advertisement.
  • the estimated effect data of the target video is determined based on the ratio.
  • the system includes an acquisition module for acquiring a plurality of video clips.
  • the configuration module is configured to obtain video configuration information, the video configuration information including one or more configuration features of at least some of the multiple video clips, and the configuration feature includes at least one of a content feature or an arrangement feature .
  • the generating module is configured to generate a target video based on the at least part of the video clip and the video configuration information.
  • Another aspect of this specification provides a computer-readable storage medium that stores computer instructions. After the computer reads the computer instructions in the storage medium, the computer executes the aforementioned method.
  • Fig. 1 is a schematic diagram of a scene of a system for generating a video according to some embodiments of this specification
  • Fig. 2 is an exemplary flow chart of generating a video according to some embodiments of the present application
  • Fig. 3 is an exemplary flowchart of a method for determining a video segment shown in some embodiments of the present application
  • Fig. 4 is an exemplary flow chart of a method for editing an initial image or an initial video according to some embodiments of the present application
  • Fig. 5 is an exemplary flowchart of a method for editing an initial image or an initial video according to other embodiments of the present application
  • Fig. 6 is an exemplary flowchart of a method for generating a video shown in other embodiments of the present application.
  • Fig. 7 is an exemplary flowchart of a method for generating a video shown in other embodiments of the present application.
  • Fig. 8 is an exemplary flowchart of a method for generating a target video shown in some embodiments of the present application.
  • Fig. 9 is an exemplary flowchart of a method for determining a fragment set according to some embodiments of the present application.
  • Fig. 10 is an exemplary flowchart of a method for determining a degree of combination difference according to some embodiments of the present application.
  • FIG. 11 is an exemplary flowchart of another method for determining the degree of combination difference according to some embodiments of the present application.
  • FIG. 12 is an exemplary flowchart of a method for generating a video according to other embodiments of the present application.
  • Fig. 13 is an exemplary flowchart of adding audio material according to some embodiments of the present application.
  • FIG. 14 is an application scene diagram of a screen cropping method according to some embodiments of the present application.
  • Fig. 15 is a schematic diagram of a smoothing method according to some embodiments of the present application.
  • FIG. 16 is a flowchart of a method for determining the size and position of the crop box of each video frame according to some embodiments of the present application.
  • Fig. 17 is a flowchart of a method for generating a target video based on an audience according to some embodiments of the present application.
  • FIG. 18 is a flowchart of confirming the estimated effect data of the target video according to some embodiments of the present application.
  • Fig. 19 is an exemplary flowchart of a model training process according to some embodiments of the present application.
  • 20A to 20E are schematic diagrams of a video synthesis system according to some embodiments of the present application.
  • system is a method for distinguishing different components, elements, parts, parts, or assemblies of different levels.
  • the words can be replaced by other expressions.
  • a multimedia system can obtain multiple video clips and video configuration information.
  • the video configuration information may be generated based on script information and/or video templates.
  • the video configuration information may be used to determine one or more configuration features of at least some of the multiple video clips.
  • the configuration feature includes at least one of content feature or arrangement feature.
  • the content features may include video clips or the final generated video (also known as the target video) including a specific subject (object), a specific theme, a specific shot, a specific audio, and so on.
  • the arrangement feature may include the size of the video segment, the layout of the target object in the video segment, the duration of the video segment, the specific position of the video segment in the target video, the visual effect characteristics of the video segment, and the like.
  • the multimedia system may generate a target video based on the at least part of the video segment and the video configuration information. The multimedia system can realize automatic processing and generate target videos, and has high efficiency and saves labor costs.
  • Fig. 1 is a schematic diagram of a scene of a multimedia system according to some embodiments of the present application.
  • the multimedia system 100 can be used for media, advertising, the Internet, etc., and can quickly and targetedly generate target videos for delivery.
  • the multimedia system 100 may include a server 110, a network 120, a terminal device 130, a database 140, and other data sources 150.
  • the server 110 and the terminal device 130 may be connected through the network 120 or directly; the database 140 may be connected to the server 110 through the network 120, or may be directly connected to the server 110 or located inside the server 110.
  • the database 140 and other data sources 150 can be connected to the network 120 to communicate with one or more components of the multimedia system 100.
  • One or more components of the multimedia system 100 can access data or instructions stored in the terminal device 130, the database 140, and other data sources 150 through the network 120.
  • the various components in the multimedia system 100 can be integrated in the same device, and the above-mentioned communication relationship can be realized through the internal bus of the device. At least part of the components in the multimedia system 100 can be integrated in the same device, and each device can be connected through the communication port of each device, so that the various components in the multimedia system 100 can be communicatively connected, thereby realizing the aforementioned communication relationship.
  • the server 110 may be used to manage resources and process data and/or information from at least one component of the system or an external data source (for example, a cloud data center).
  • the server 110 may be a single server or a group of servers.
  • the server group may be centralized or distributed (for example, the server 110 may be a distributed system), it may be dedicated, or other devices or systems may provide services at the same time.
  • the server 110 may be regional or remote.
  • the server 110 may be implemented on a cloud platform or provided in a virtual manner.
  • the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud, a distributed cloud, an internal cloud, a multi-layer cloud, etc., or any combination thereof.
  • the server 110 may include a processing device 112.
  • the processing device 112 may process data and/or information obtained from other devices or system components.
  • the processor may execute program instructions based on these data, information, and/or processing results to perform one or more functions described in this application.
  • the processing device 112 may include one or more sub-processing devices (for example, a single-core processing device or a multi-core and multi-core processing device).
  • the processing device 112 may include a central processing unit (CPU), an application specific integrated circuit (ASIC), an application specific instruction processor (ASIP), a graphics processing unit (GPU), a physical processor (PPU), a digital signal processor ( DSP), Field Programmable Gate Array (FPGA), Editable Logic Circuit (PLD), Controller, Microcontroller Unit, Reduced Instruction Set Computer (RISC), Microprocessor, etc. or any combination of the above.
  • CPU central processing unit
  • ASIC application specific integrated circuit
  • ASIP application specific instruction processor
  • GPU graphics processing unit
  • PPU physical processor
  • DSP digital signal processor
  • FPGA Field Programmable Gate Array
  • PLD Editable Logic Circuit
  • Controller Microcontroller Unit
  • RISC Reduced Instruction Set Computer
  • the network 120 may connect various components of the system and/or connect the system and external resource parts.
  • the network 120 enables communication between various components and with other parts outside the system, and facilitates the exchange of data and/or information.
  • the network 120 may be any one or more of a wired network or a wireless network.
  • the network 120 may include a cable network, a fiber optic network, a telecommunications network, the Internet, a local area network (LAN), a wide area network (WAN), a wireless local area network (WLAN), a metropolitan area network (MAN), and a public switched telephone network (PSTN) , Bluetooth network, ZigBee network (ZigBee), near field communication (NFC), device bus, device line, cable connection, etc. or any combination thereof.
  • LAN local area network
  • WAN wide area network
  • WLAN wireless local area network
  • MAN metropolitan area network
  • PSTN public switched telephone network
  • Bluetooth network ZigBee network
  • NFC near field communication
  • the network connection between the various parts can be in one of the above-mentioned ways, or in multiple ways.
  • the network may be a variety of topological structures such as point-to-point, shared, and centralized, or a combination of multiple topologies.
  • the network 120 may include one or more network access points.
  • the network 120 may include wired or wireless network access points, such as base stations and/or network exchange points 120-1, 120-2, ..., through which one or more components of the system can be connected to the network 120 To exchange data and/or information.
  • the terminal device 130 refers to one or more terminal devices or software used for data query and/or multimedia display.
  • the terminal device 130 may be one or more users, may include users who directly use the service, or may include other related users.
  • the terminal device 130 may be one or any combination of the mobile device 130-1, the tablet computer 130-2, the laptop computer 130-3 and other devices with input and/or output functions.
  • the terminal device 130 may also include a user terminal that can be used to input and/or obtain data or information.
  • the user may generate or obtain the original video or original image through the user terminal 110.
  • the user can use the camera of the user terminal to record an image or take a picture and store it as an original video or original image, or download the original video from video software through the user terminal.
  • the user may input the constraint condition of the target video (for example, video configuration information) through the user terminal.
  • the user can obtain or browse the synthesized target video through the user terminal.
  • the database 140 may be used to store data and/or instructions.
  • the database 140 is implemented in a single central server, multiple servers connected through a communication link, or multiple personal devices.
  • the database 140 may include mass memory, removable memory, volatile read-write memory (for example, random access memory RAM), read-only memory (ROM), etc., or any combination thereof.
  • the mass storage device may include magnetic disks, optical disks, solid-state disks, and the like.
  • the database 140 may be implemented on a cloud platform.
  • Other data sources 150 may be one or more sources used to provide other information for the system.
  • the other data source 150 can be one or more devices, can be a camera device that directly obtains the initial image or initial video, can be one or more application program interfaces, can be one or more database query interfaces, and can be one or more A protocol-based information acquisition interface can be other methods of acquiring information, or a combination of two or more of the above methods.
  • the information provided by the information source may already exist when the information is extracted, or it may be generated temporarily when the information is extracted, or a combination of the above methods.
  • other data sources 150 may be used to provide multimedia information such as pictures, videos, and music to the system.
  • Fig. 2 is an exemplary flow chart of generating a video according to some embodiments of the present application.
  • one or more steps in the process 200 may be stored as instructions in a storage device (for example, the database 140), and called by a processing device (for example, the processing device 112 in the server 110) and/or implement.
  • the process 200 may include:
  • Step 210 Obtain multiple video clips.
  • a video segment may refer to a video composed of video frames.
  • Each video segment may be a sub-sequence in the sequence of images constituting the video.
  • a video clip can be a short video of 3 seconds, 4 seconds, or 5 seconds.
  • the video frame can be understood as decomposing the continuous image according to the time interval to obtain the corresponding frame image.
  • the time interval between each frame of images can be exemplarily set to 1/24s (it can also be said that 24 frames of images are obtained within 1 second).
  • a video segment may be or include one or more shot segments.
  • the shot segment may refer to a continuous picture composed of video frames between two editing points.
  • a lens segment can be the sum of a segment of pictures taken continuously by the camera from start to standstill.
  • the first picture in the video file is a seaside, and then the picture is switched to a girl drinking yogurt, and then switched to a girl surfing on the sea, then the girl drinking yogurt is a scene fragment, and a picture of the seaside in front of it It is another scene, and the scene behind it of a girl surfing on the sea is another scene.
  • a video segment including a shot frame will be used as an example for description.
  • the database 140 and/or other data sources 150 may store the multiple video clips, and step 210 may be implemented by directly obtaining the multiple video clips from the database 140 and/or other data sources 150.
  • the database 140 and/or other data sources 150 may store unprocessed initial data.
  • the initial data may include an initial image (also may be referred to as a to-be-processed image) and/or an initial video (also may be referred to as a to-be-processed video).
  • Step 210 can obtain the multiple video clips by processing the initial data (for example, determining a cutting point and editing).
  • a video segment may be generated based on a shot segment contained in a video file (for example, an initial video and/or an initial image). For example, if a video file contains 5 shots, 5 video clips can be generated.
  • the video file may be split manually or by machine to generate multiple video clips.
  • the user manually edits based on the number of shots in the video file, or uses a trained machine learning model to split the video file into multiple video clips according to preset conditions (such as the number of shots, duration, etc.).
  • the processing device 112 may also obtain multiple video clips intercepted from the video file based on the time window, and this specification does not limit the means for splitting the video clips.
  • FIGS. 3-7 For the method of determining the multiple video clips, reference may be made to FIGS. 3-7 and related descriptions.
  • Step 220 Obtain video configuration information.
  • the video configuration information refers to the finally generated video (also referred to as the target video) and information related to the configuration of each video segment constituting the target video.
  • the video configuration information may reflect requirements on the content or form of the target video and the video segments that make up the target video.
  • Each video segment composing the target video may be at least a part of the multiple video segments.
  • the video configuration information may include one or more configuration features of each video segment (that is, the at least part of the video segment) constituting the target video.
  • the configuration features may include content features and/or arrangement features.
  • the content feature is related to the content of the at least part of the video clip.
  • the content feature may include that the video clip or target video contains a specific subject (object), a specific theme, a specific shot, a specific audio, and so on.
  • the arrangement feature is related to the presentation form of the at least part of the video segment.
  • the arrangement feature may include the size of the video segment, the duration of the video segment, the specific position of the video segment in the target video, the visual effect feature of the video segment, and so on.
  • the specific subject may also be referred to as a specific object.
  • the specific subject may be a specific object included in each video clip, or may be a specific object related to a specific theme of the target video among multiple objects.
  • the specific subject may be products (electronic products, daily necessities, decorations, etc.), creatures (humans, animals, etc.), signs (for example, trademarks, regional signs, etc.), or landscapes (mountains, houses, etc.), etc.
  • products electronic products, daily necessities, decorations, etc.
  • creatures humans, animals, etc.
  • signs for example, trademarks, regional signs, etc.
  • landscapes mountains, houses, etc.
  • the specific topic may be the main content of the video clip.
  • the topic can be determined by keyword information in the title or introduction of the video clip, tag information, user-defined information, or information stored in a database.
  • the specific theme may be composed of specific content and video type.
  • perfume is a specific content
  • the advertisement is a video type.
  • the specific content may also include specific actions (such as live broadcast, evening parties, etc.), specific dates (such as Valentine's Day, Children's Day, Double Eleven, etc.) and so on.
  • the method for determining a specific topic may be user-defined or user-selected from a list.
  • the user directly enters the specific theme of the target video as "advertising for car interiors.”
  • the user may first select the target video as the advertisement video, and then select "Shampoo” -> "Shampoo” from the product category list.
  • the specific shot picture refers to a video frame or a sequence of video frames containing a specific picture.
  • specific shots can include children drinking milk, models using cosmetics, playing volleyball on the beach, and browsing mobile shopping pages on the Double Eleven Shopping Festival.
  • the specific lens frame may be related to a specific subject or a specific subject.
  • the promotional video can include shots of the specific action of "brushing teeth”.
  • the advertisement video may include a shot of the specific action of "spraying perfume”.
  • the specific audio may include a specific sound.
  • the specific audio may include dialogue, monologue, theme music, background music, or other specific sounds (for example, wind, rain, bird song, brake sound, impact sound, etc.).
  • the theme music and/or background music can be of different types, such as soothing, brisk, focused, aggressive, and so on.
  • the size of the video segment may be the width and height of the video frame in the video segment.
  • the size of the video frame in the video segment is 512 ⁇ 384, 1024 ⁇ 768, 1024 ⁇ 1024, 1920 ⁇ 1080, 2560 ⁇ 2560, etc.
  • the size of the video segment may also be the aspect ratio of the video frame in the video segment.
  • the aspect ratio of the video frame in the video segment is 9:16, 1:1, 5:4, 4:3, 16:9, etc.
  • the duration of the video segment refers to the length of time required to play the video segment. For example, 3 seconds, 5 seconds, 30 seconds, 2 minutes, etc.
  • the specific position of the video segment in the target video may refer to a specific range of the video segment in the target video.
  • the specific range may be expressed in terms of time. For example, a certain video segment can be in the target video from 2 minutes 15 seconds to 2 minutes 30 seconds.
  • the specific range may also be represented by a video frame range. For example, a certain video segment can be in the position of frames 1 to 50 in the target video.
  • the specific range may also be represented by a position relative to other video clips. For example, a certain video segment may be located between the third video segment and the fifth video segment in the target video.
  • the visual effect characteristics of the video segment may be used to describe operations performed on the video segment and related to its visual effect.
  • the operations may include beautification, normalization, template decoration, and so on.
  • the beautification may include operations such as filters and animations to enhance the video effect.
  • the video configuration information may be determined by the user, or determined according to system default settings.
  • the database 140 and/or other data sources 150 may store the video configuration information, and step 220 may be implemented by directly obtaining the video configuration information from the database 140 and/or other data sources 150, for example, After determining the specific subject/specified subject of the target video, the corresponding video configuration information is obtained from the database 140 and/or other data sources 150 according to the specific subject/special subject of the target video.
  • the video configuration information may be determined based on script information and/or video templates.
  • the multimedia system 100 determines the video configuration information by analyzing script information and/or video templates.
  • the script information and/or video template may be pre-stored in the database 140 and/or other data sources 150, and the user may automatically obtain information from the database 140 and/or other data sources 150 after inputting relevant information of the target video. Or determine the corresponding script information and/or video template. For example, after the user inputs a picture of a specific perfume, the system 100 can automatically recognize the specific theme "perfume advertisement" of the target video and call the script information and/or video template related to the perfume advertisement from the database 140 and/or other data sources 150.
  • the script information refers to determining the screen content (for example, a specific subject, a specific background, a specific action, etc., or a combination thereof) of each video segment and/or video frame in a video (for example, a target video), a screen duration, and a screen scene (For example, panorama, close-up, medium-range, close-up, etc.), lens change, appearance time, etc.
  • Script information can define the content and/or arrangement of video clips. For example, for a target video of an advertisement type, the script information can determine that the target video includes video clips: 1viewer interactive video clips, 2use experience video clips, 3product selling points Video clips, 4How to use video clips, 5Function effect video frequency bands, and 6Action guidance video clips.
  • the target video contains a specific object
  • the target video contains a specific theme
  • the total duration of the target video the number of shots contained in the target video, the specific shots contained in the target video, and the target video The number of overlaps of a specific theme in the target video, or the focus time of a specific theme in the target video, etc.
  • the script information may further include the sequence of each shot.
  • the video clips included in the script information may be arranged in a preset order, for example, in the order of 1 ⁇ 6.
  • the target video can be generated based on the sequence.
  • the script information can be of different types.
  • the script information may be general-purpose script information.
  • the general-purpose script information may be applicable to different subjects (for example, products) or themes.
  • the general-purpose script information may sequentially include audience interaction, usage experience, product selling point, usage method, function function, and action guide, or a specific theme, product selling point, use method, use feeling, function function, and product cost performance, or Multiple video clips or shots such as applicable scenes/crowds, product selling points, efficacy, product cost-effectiveness, and action guidelines.
  • the script information may be related to the subject (eg, product) or topic.
  • the corresponding script information may in turn include specific themes, effects, design, product ingredients, and action guidelines, or audience interaction, product/brand recommendations, applicable scenarios/populations, efficacy 1, efficacy 2 , Product components and audience interaction, or target group pain points 1, target group pain points 2, product/brand recommendations, product ingredients, efficacy, specific themes, usage experience, and specific themes and other video clips or shots.
  • the corresponding script information may in turn include eating experience, food attributes, cooking method/process, product ingredients, and eating experience, or product recommendation, cooking method/process, eating experience, brand promotion, and product utility, or Brand promotion, product recommendation, packaging design, cooking method/process, eating experience and action guidelines.
  • the corresponding script information may include specific themes, appearance introduction, efficacy, product attributes 1, production process, product attributes 2, action guidelines and appearance introduction, or product cost performance, product/brand recommendation, Applicable scenes/crowd, product texture, product/brand recommendation, or multiple video clips or shots such as audience interaction, product attributes, efficacy, and product/brand recommendation.
  • the video template may refer to the form packaging of the video.
  • the video template may include credits, credits, watermarks, subtitles, titles, borders, filters, etc.
  • the title/end of the title refers to a piece of audio-visual material added at the beginning and end of the video to create an atmosphere, heighten the momentum, attract eyeballs, present the title of the work, the photographer, and product information.
  • Watermark refers to the pattern attached to the video, which reflects the company, product and other information or personalized design.
  • Subtitles refer to non-visual content such as dialogues, product/topic introductions, etc. displayed in the video in the form of text.
  • the title is a short sentence indicating the content of the video.
  • the frame refers to one or more patterns of specific shapes (for example, strips) surrounding the video page. Filter refers to the operation used to realize the special display effect of the image.
  • the video template may be a template material in Adobe After Effects (AE) software, which is commonly used software in the field of video production, and will not be repeated here.
  • AE Adobe After Effects
  • the video template may be related to the subject (e.g., product, model, etc.), theme (e.g., charity, entertainment, education, life, shopping, etc.), video effect (e.g., promotion/advertising effect), etc.
  • the database 140 and/or other data sources 150 may be preset with multiple video templates corresponding to different subjects, themes, video effects, etc. After determining the specific subject, specific subject, and video effect of the target video, The corresponding video template is called from the database 140 and/or other data sources 150.
  • the user determines the video template by presetting the title, ending, watermark, subtitle, title, border, and/or filters of the video template based on different subjects, themes, and/or video effects.
  • the video configuration information may include at least one preset condition (for example, a first preset condition, a second preset condition, etc.) related to the content of the video segment included in the target video.
  • the first preset condition may be related to at least one of the multiple elements of the content feature in the configuration feature.
  • the multiple elements may include that the target video contains a specific object, the target video contains a specific theme, the total duration of the target video, the number of shots contained in the target video, the specific shots contained in the target video, and the overlap of specific themes in the target video The number, or the focus time of a specific topic in the target video, etc.
  • the first preset condition may implement screening of the multiple video clips by restricting at least one of the multiple elements.
  • the first preset condition may include limiting the last shot of the target video to a product display shot, and selecting a video clip containing the product display shot from the multiple video clips as the last video clip of the target video.
  • the specific content of the first preset condition refer to the process 800 and its related description.
  • the second preset condition may be related to the difference degree feature of the content feature in the configuration feature.
  • the second preset condition may be related to the difference degree of the segment set.
  • the segment set refers to one or more sets formed by grouping video segments that meet specific conditions (for example, video segments that meet a first preset condition), which may also be called candidate video segments.
  • the second preset condition may implement the screening of the fragment set by restricting the degree of difference of the fragment set.
  • the target video may be generated based on a filtered set of segments.
  • the specific content of the second preset condition refer to the process 900 and its related descriptions.
  • the video configuration information includes sequence information.
  • the sequence information may be related to the arrangement feature of the configuration feature. That is, the sequence information can determine the arrangement of each video segment in the target video. For example, when generating the target video, the candidate video clips in the clip collection may be sorted based on the sequence information to generate the target video. For the specific content of the sequence information, refer to the process 800 and its related descriptions.
  • the video configuration information includes beautification parameters.
  • the beautification parameter may be related to the arrangement feature of the configuration feature. Beautify the target video by beautifying parameters to obtain better visual effects.
  • the beautification parameters may include at least one of filter parameters, animation parameters, and layout parameters.
  • the beautification parameter may be used to beautify at least part of the video clips in the plurality of video clips.
  • the beautification parameter may also be directly used for the target video, initial image, initial video, and so on.
  • Step 230 Generate a target video based on at least part of the video segment and the video configuration information.
  • the multimedia system 100 may determine at least one segment set from the multiple video segments based on the video configuration information to generate the target video. For example, the multimedia system 100 may obtain one or more candidate video fragments from the multiple video fragments based on a first preset condition, and group the one or more candidate video fragments to determine at least one set of fragments. At least one segment set satisfies the second preset condition, and based on each segment set in the at least one segment set, a target video is generated according to the corresponding sequence information.
  • the multimedia system 100 may generate a plurality of candidate fragment sets based on the plurality of video fragments, the plurality of candidate fragment sets satisfying a second preset condition; and based on the first preset condition, from the plurality of candidate fragment sets At least one segment set is filtered out; and based on each segment set in the at least one target segment set, a target video is generated according to the corresponding sequence information.
  • the type of the target video may include, but is not limited to, advertising video, promotional video, video web log (vlog), entertainment short video, and the like.
  • the process 800 or 1200 and related descriptions can be used.
  • the multimedia system 100 may beautify at least a part of the video clips or the target video in the plurality of video clips based on the beautification parameters to obtain a better visual effect.
  • the multimedia system 100 may also perform conventional video processing on at least part of the video clips or the target video in the plurality of video clips, for example, cropping, zooming, editing based on templates, and so on.
  • the multimedia system 100 may further perform post-processing on the target video to satisfy at least one video output condition.
  • the at least one video output condition is related to a playback medium of the target video.
  • the at least one video output condition is a video size condition.
  • the video size condition may be determined based on the size of the video playback medium.
  • Post-processing the target video may include cropping a frame of the target video according to the video size condition.
  • Fig. 3 is an exemplary flowchart of a method for determining a video segment according to an initial image or an initial video according to some embodiments of the present application.
  • one or more steps in the process 300 may be stored as instructions in a storage device (for example, the database 140), and called by a processing device (for example, the processing device 112 in the server 110) and/or implement.
  • the process 300 may include:
  • Step 310 Obtain at least one of an initial image or an initial video.
  • the initial video may refer to a dynamic image, and the dynamic image may be composed of a series of video frames.
  • the initial video may include video files in various formats such as MPEG, AVI, ASF, MOV, 3GP, WMV, DivX, XviD, RM, RMVB, FLV/F4V, etc.
  • the initial video may also include audio files (audio tracks) corresponding to the moving images.
  • the initial video may include promotional videos, personal recording videos, audiovisual images, network videos, advertisement clips, product demos, movies, and short films or movies containing related products and models.
  • the initial image can refer to a static image, for example, the initial image can include pictures in various formats such as bmp, jpg, png, tif, gif, pcx, tga, exif, fpx, svg, psd, cdr, pcd, dxf, ai, raw, etc. document.
  • the initial image may include photos taken by the camera, advertising images, product renderings, posters, and the like.
  • the initial image and the initial image may be obtained by camera or video/image processing equipment and stored in the database 140 and/or other data source 150. Specifically, the corresponding image may be called from the database 140 and/or other data source 150.
  • the initial image of and/or the initial image implements step 310.
  • the initial video and the initial image may be network public material resources, such as image resources and video resources in various open databases, and step 310 may be implemented by obtaining public materials.
  • the multimedia system 100 may also obtain the initial video and the initial image through other direct or indirect methods. For example, the multimedia system 100 directly obtains a video file or an image file uploaded by a user, or obtains a video file or an image file based on a link input by the user.
  • Step 320 Perform editing processing on the initial image or the initial video to obtain multiple video clips.
  • step 320 may determine the initial video or the initial image segment boundary, and segment or group the initial video or initial image based on the segment boundary to determine multiple shot segments of the initial video or initial image. Then take the shot fragment as the multiple video fragments. For example, if the initial video is a cooking video, the cooking steps, dish production steps, and tasting steps in the initial video can be divided into different shots, and then each shot can be used as the multiple video clips. For specific determination of the content of the video segment based on the initial video, reference may be made to FIG. 4 of this application and its related description.
  • each video segment may be a shot segment.
  • the initial video or the initial image often includes multiple shots, and therefore, the initial image or the initial video needs to be split. It is understandable that a video clip may also include multiple shots.
  • the initial video or the initial image can be split directly according to the number of shot fragments for a video fragment containing multiple shot fragments.
  • the video segment may be split into multiple video segments containing only one shot segment, and then the multiple video segments can be spliced into one video segment according to the constraint conditions (for example, binding conditions) of the video segments.
  • an initial video may include only one shot segment, and the entire initial video is treated as a shot segment and output as a video segment.
  • an initial video may be formed by splicing multiple shots, and one or more consecutive video frames at the junction of two adjacent shots may be called a fragment boundary (also may be called a shot Boundary frame).
  • the initial video may be segmented in units of shot segments to obtain each video segment. When splitting the initial video into multiple shot segments, you can split it at the segment boundary, that is, use the segment boundary as a cutting point to split the initial video into multiple video segments.
  • Fig. 4 is an exemplary flowchart of a method for editing an initial image or an initial video according to some embodiments of the present application, which specifically involves splitting the initial image or the initial video.
  • one or more steps in the process 400 may be stored as instructions in a storage device (for example, the database 140), and called by a processing device (for example, the processing device 112 in the server 110) and/or implement.
  • the process 400 may specifically include the following steps:
  • Step 410 Obtain the characteristics of each pair of adjacent images or video frames in the initial image or the initial video.
  • the image embedding model may be used to obtain the characteristic information of each pair of adjacent images or video frames (also can be understood as multiple video frames and video frames) in the initial image or initial video (for example, an advertisement video).
  • the processing method adopted for the initial image is similar to that of the initial video.
  • the multimedia system 100 can embed the initial video input image in the model.
  • the image embedding model can extract the images of each video frame that constitutes the initial video, and extract the characteristics of the image of the video frame, and generate the vector corresponding to the image of each video frame.
  • the image input image of the video frame that has been extracted may be embedded in the model, and the image embedding may correspond to the vector outputting the image of each video frame.
  • the feature information of the video frame can be obtained based on a mobilenet model (such as a mobilenetV2 model) pre-trained by the imagenet picture library.
  • the mobilenetV2 model can extract the image features of each video frame more accurately and quickly.
  • each video frame can be input into the mobilenetV2 model, and the normalized 1280-dimensional vector corresponding to each video frame can be output through the mobilenetV2 model.
  • other machine learning models with similar functions can also be used to obtain the feature information of the video frame, such as GoogLeNeT model, VGG model, ResNet model, etc., which is not limited in this application.
  • the camera boundary frame can be determined more accurately, so as to realize the accurate segmentation of the shot fragment, so that the subsequent cropping of the initial image or the initial video can be more convenient to operate and avoid The main information of the original image or original video is cut off.
  • Step 420 Determine the similarity of each pair of adjacent images or video frames.
  • this can be achieved by separately calculating the similarity between each video frame and a video frame preselected from a plurality of video frames according to the feature information of the video frame.
  • the inner product of the feature vectors of two video frames may be used as the similarity between the two video frames.
  • calculating the similarity between each adjacent image may be calculating each video frame and its previous and/or back neighbors
  • the similarity between video frames can also be calculated by calculating the similarity between each video frame and the video frame with a preset number of frames before and/or after it.
  • Step 430 Identify the segment boundary based on the similarity of each pair of adjacent images or video frames.
  • the segment boundary may include a hard-cut boundary frame or a soft-cut boundary frame.
  • identifying the segment boundary may include determining the hard-cut boundary frame of the shot segment. If the transition effect is not used between two adjacent lens segments, and the two adjacent video frames of the two adjacent lens segments jump directly, the two adjacent video frames can be understood as hard-cut boundary frames. In the process of determining the hard-cut boundary frame, the similarity between each video frame and its previous and/or subsequent adjacent video frames can be calculated, if the similarity between two adjacent video frames is lower than the similarity threshold , The two adjacent video frames are determined to be hard-cut boundary frames.
  • identifying the segment boundary may include determining the soft cut boundary frame of the shot segment. If a transition effect is used between two adjacent lens fragments, and the adjacent video frames of the two adjacent lens fragments will not jump directly, the several sub-video frames used for transition between the two lens fragments can be understood as Soft cut boundary frame.
  • the soft cut boundary frame can be determined by the following methods:
  • the candidate segmentation area can be determined by calculating the similarity between each video frame and the video frame with a preset number of frames before and/or after.
  • the preset interval frame number can be set to 2 frames, 3 frames, 5 frames, and so on. If it is calculated that the similarity between the two video frames is less than the preset threshold, the video frame between the two video frames is used as the candidate segmentation area, and the two video frames are used as the boundary frame of the candidate segmentation area. For example, if the preset interval frame number is 2 frames, the similarity between the 10th frame and the 14th frame can be calculated. If the similarity is less than the similarity threshold, the 12th and 13th frames are used as candidate segmentation regions.
  • the 10th and 14th frames are the boundary frames of the candidate segmentation area. Then, the candidate segmentation regions can be further merged, that is, the overlapping candidate segmentation regions are merged together. If the 12th and 13th frames are candidate segmentation regions, and the 13th and 14th frames are also candidate segmentation regions, then the 12th, 13th, and 14th frames are merged into one candidate segmentation region.
  • the candidate segmentation regions can be further screened.
  • the candidate segmentation area may be filtered based on the similarity S1 within the candidate segmentation area and the similarity S2 outside the candidate segmentation area.
  • the method of calculating the similarity S1 in the segmented area may be: calculating the similarity between the boundary frame of the candidate segmented area and the video frame located in the candidate segmented area and separated from the boundary frame of the candidate segmented area by a preset number of frames to obtain the candidate segmentation.
  • the minimum value of the degree is taken as the similarity S1 in the segmented area.
  • the method of calculating the similarity S2 outside the candidate segmentation area may be: calculating the similarity between the video frame in the front part of the candidate segmentation area and the video frame with a preset number of interval frames before it, and calculating the back end of the candidate segmentation area.
  • the similarity between the complemented video frame and the video frame with a preset number of frames in the subsequent interval is used to obtain the similarity S2 outside the candidate segmentation area. For example, if the candidate segmentation area is the 12th frame and the 13th frame, and the preset interval frame number is 2, the similarity between the 10th frame and the 12th frame, and the similarity between the 13th frame and the 15th frame are calculated, and the two The minimum value of the similarity is taken as the similarity S2 outside the divided area. If the value of S2 is greater than S1 and exceeds the threshold, the candidate segmentation area is deemed to be the final segmentation area, and the segmentation operation of the shot segment is performed based on the final segmentation area.
  • Step 440 Based on the segment boundaries, divide the initial image or the initial video to obtain multiple video segments.
  • the initial video can be split by using the segment boundary as the cutting point, so that each split video segment contains a shot segment.
  • the initial image can also be processed based on the method shown in FIG. 4 to obtain images that are determined to be non-repetitive in the initial image, as corresponding video segments or other editing processing to obtain corresponding video segments.
  • the present application may also determine a specific subject (also referred to as an object) related to the specific theme according to the specific theme of the target video.
  • the specific subject here can be understood as the object or main object contained in the video frame/shot fragment related to the specific theme (also can be expressed as the target video theme, cropping theme, etc.), and can include living things (humans, animals, etc.) , Merchandise (cars, daily necessities, decorations, cosmetics, etc.), background (mountains, roads, bridges, houses, etc.), etc., for example, when the specific theme of the target video is advertising, the specific object can include people, objects, or signs, etc. One or more combinations. Specifically, the person can be an event/product spokesperson, the item can be a corresponding product, and the logo can be a product trademark, a regional identification, etc.
  • specific objects can be expressed as different names.
  • specific objects can be expressed as specific subjects.
  • a specific object can be expressed as a subject.
  • the specific object may be expressed as the cropped subject.
  • the process of determining a specific subject based on a specific subject can be achieved by determining a specific subject from candidate subjects based on the specific subject.
  • the specific subject can be automatically selected based on the degree of relevance between the multiple candidate subjects included in the video clip and the specific subject. For example, rank the relevance of each candidate subject to a specific topic, and then select the top X candidate subjects.
  • X can be set to 1, 2, or 4, etc.
  • the top X candidate subjects can be determined as the specific subjects of the video clip.
  • the degree of association between the candidate subject and a specific topic can refer to the description of the correlation between one or more cropped subjects in the cropped subject information in the process 1600 and the subject information.
  • the candidate subject may be a subject set in the database 140 in advance.
  • Candidate subjects can be set specifically for a specific type of subject.
  • the candidate subject may be a commodity or a human face (including five senses such as eyes, nose, and mouth).
  • the candidate subject can also be the subject in each video frame/shot segment.
  • the process of obtaining the candidate subject can be achieved by determining the candidate subject of each video segment through a machine learning model, for example, determining each subject through a machine learning model.
  • the main content in the video frame of the video clip the main content is used as a candidate subject, and then the correlation between the candidate subject and the specific subject is determined according to semantic analysis, and then each specific subject is determined. This method can make the selected specific subject have a higher degree of relevance to the specific topic.
  • the specific subject of the video clip can be determined based on a specific subject.
  • the subject information of the target video is lipstick
  • the video clip may contain one or more subjects.
  • the subject of each video clip identified by the machine learning model is used as Candidate subjects, which can include the nose of the human face, the mouth of the human face, the eyes of the human face, commodities (lipstick), trees, roads, and houses. Based on a specific theme as lipstick, the mouth and commodities of the face can be further selected (lipstick). ) These candidate subjects with a high correlation with lipstick will eventually become the specific subjects of the corresponding video clips.
  • the target video may include multiple video themes.
  • a tooth protection promotion video may be spliced before the advertisement video of the mouthwash.
  • different video themes may contain different specific subjects.
  • the amount of overlap of a specific theme can be determined according to the specific subject contained in the video clip and the corresponding relationship between the specific subject and the specific theme (different video themes), for example, the aforementioned mouthwash with dental protection promotion.
  • the lens related to the effect display can be related to two specific themes at the same time.
  • the related lens of the effect display can include the effect of non-brushing, the effect of brushing with toothbrushes, the effect of mouthwash, and the effect of While protecting the teeth, it also highlights the cleaning effect of the mouthwash.
  • the focus time of a specific topic in the target video can be determined by counting the focus time corresponding to a specific topic (using the specific topic as the focus or main content of the video clip).
  • Fig. 5 is an exemplary flowchart of a method for editing an initial image or an initial video according to some embodiments of the present application.
  • one or more steps in the process 500 may be stored as instructions in a storage device (for example, the database 140), and called by a processing device (for example, the processing device 112 in the server 110) and/or implement.
  • Step 510 Determine the subject information of the initial image or the initial video.
  • the subject information includes at least the subject and the position of the subject.
  • the initial image or initial video usually includes one or more subjects for highlighting the subject.
  • the subject may specifically be the object most relevant to the target video theme among the various objects appearing in the video clip.
  • the subject may also be the object occupying the largest aspect ratio in the video frame.
  • the subject may be one or more of products (electronic products, daily necessities, decorations, etc.), creatures (humans, animals, etc.), or landscapes (mountains, houses, etc.), and the like.
  • the main body is one and the main body is the model.
  • the subject may be manually imported.
  • the user may select the subject from the database 140 or the terminal device 130.
  • the model can be used as the main body of the target video (specific theme).
  • the corresponding video clip, initial video, and initial image should contain the subject.
  • the user is in the database 140
  • the processor uses the processor to further process the initial image or initial video to determine the position of the model in each video frame.
  • the selection of the subject can be achieved by uploading a specific image, manually selecting a specific image in the video frame, semantic recognition and similar methods. For example, after the user enters the text content of the "model", the multimedia system 100 can automatically recognize each initial through semantic recognition.
  • the "model image" in the video and the initial image is the subject.
  • the subject information includes at least the presence or absence of the subject and the subject position of the corresponding subject.
  • the subject information can also include the subject's color information, size information, name information, category information, or facial recognition data.
  • the position of the subject is understood as the information about the position of the subject in the picture and/or video, for example, it may be the information of the coordinates of the reference point.
  • the size information of the main body may include information about the actual size of the main body and information about the proportion of the main body in the size of the screen of the advertisement video.
  • the category information of the subject can be understood as the category of the subject.
  • the category information of the subject includes the information that the subject’s classification is product or model, or it can be further refined into a certain category of product information. It is a mobile device.
  • the subject information can be characterized by tags (such as tag ID and tag value).
  • tags can be added to the video frame in the initial video, and the tag can represent the name of the subject included in the image or video. If a poster includes product A, product B, and model A, you can add tags for product A, product B, and model A to the poster (for example, add and modify the tag IDs corresponding to product A, product B, and model A, and add The corresponding tag value is modified to 1).
  • each image or video to be selected in the database 140 may hold a tag.
  • the database 140 can automatically match the aforementioned product A.
  • the poster of product B and model A is associated, and the poster is extracted as the initial image.
  • part of the video content with the video frame can be directly processed further (for example, the video frame of each video is analyzed through the main body information determination model). Thereby, a label containing the subject and the position of the subject is obtained.
  • the method of filtering video frames can also be implemented by a machine learning algorithm, that is, a machine learning model is used to identify whether each video segment contains a specific object.
  • a machine learning algorithm that is, a machine learning model is used to identify whether each video segment contains a specific object.
  • the subject of the target video can be tooth protection promotion, and the corresponding specific object can be "teeth”, “doctor”, “dental appliance” and other specific objects related to the theme (dental protection).
  • the machine Based on the determined specific object, the machine can be used The learning algorithm determines whether each video segment contains a specific object.
  • the initial image or the initial video may be processed by a subject information determination model (for example, a machine learning model) to obtain the subject and the location of the subject.
  • a subject information determination model for example, a machine learning model
  • the main body information determination model can be a generation model, a decision model, or a deep learning model in machine learning.
  • it can be a yolo series algorithm, a faster R-CNN algorithm, or an EfficientDet. Algorithms and other algorithms for deep learning models.
  • the subject information determination model can perform subject information confirmation alone, or can be combined with other processes/steps (such as process 400) to determine the subject and the location of the subject.
  • the subject information determination model can be trained separately or together with other steps.
  • a deep learning model is used for editing processing (for example, the process 400)
  • manually labeled object positions and categories can be used as training samples to train the model, so that the model can accurately label the subject in the initial video.
  • the image embedding model can be further used as the main body information determination model to extract images that make up each video frame in the initial video, and image features of the video frames are extracted to determine the main body of the initial video.
  • the initial image may be processed by the subject information determination model to obtain the position of the subject.
  • the image embedding model can continue to be used to determine the position of the subject. It is understandable that a single image in a video frame can be regarded as a picture, and the image embedding model that can process multiple video frames can also process the initial image.
  • the image embedding model for initial image and initial video processing can be trained separately or together.
  • the determination of the subject position in the initial image can also be used in the initial video.
  • the deep learning model of, for example may be a deep learning model using algorithms such as the yolo series algorithm, the R-CNN algorithm, or the Efficient Det algorithm.
  • the subject information determination model can perform the operation of determining subject information alone; the subject information determination model can also be combined with other operations to realize the determination of subject information during the execution of other operations.
  • the initial image or initial video may be directly input to the subject information determination model, so that the subject information determination model marks the corresponding subject and subject position in the corresponding video frame, thereby determining the subject information of the initial image or initial video .
  • one or more video clips may be obtained after the initial video with a long time is edited as shown in the process 400.
  • the subject information determination model can be combined with the related operations of the process 400, so that the video segment with a specific subject can be retained during the splitting or editing of the process 400, for example, the subject can be retained For a specific object related to a specific topic of the target video.
  • the subject information determination model may be used to label and crop the subject to ensure that the subject is included in the cropped video.
  • the subject information determination model can be combined with the related operations of step 310.
  • the image embedding model extracts the image features of the subject.
  • the image characteristics of a specific object determined by the subject such as the image characteristics of a "model”.
  • a series of video frames containing the subject are determined in the database 140, and a shot segment composed of the series of video frames is the initial video or initial image containing the subject.
  • Step 520 Perform editing processing on the initial image or the initial video based on the subject information to obtain multiple video clips.
  • the editing process can avoid the range of the subject according to the determined position of the subject, so as to generate a video clip that meets the requirements.
  • the outer contour of the main body in order to improve the processing accuracy, can be used to avoid the influence of the editing process on the main body.
  • the background part of the video is distinguished.
  • the information of the subject may also include color information and size information, etc. Obviously, the outer contour of the subject can be determined more quickly and efficiently based on the color information and size information on the basis of the position of the subject.
  • the outer contour of the main body can be determined by the size of the main body.
  • the smallest rectangular marquee containing the main body can be determined according to the size of the main body, and the smallest rectangular marquee can be used as the outer contour of the main body.
  • the outer contour of the subject can be determined by the edge of the subject, where the edge refers to the intersection point between the subject and the image background in the image. For example, after determining the position of the subject, an image recognition algorithm (such as an edge detection algorithm) is used. Determine the edge of the main body, and use the edge of the main body as the outer contour of the main body.
  • the area obtained by preset processing of the edge of the subject may also be used as the outer contour of the subject.
  • the preset processing may include one or more combinations of smoothing, area scaling, and the like.
  • the initial image or initial video is cropped or zoomed according to the outer contour of the subject to obtain a video segment that meets the requirements.
  • a video clip that meets the requirements may be a video clip that does not affect the main body by editing the original image or the original video. This can be achieved by cropping the initial image or initial video by avoiding the outer contour of the subject, and scaling the initial image or initial video by maintaining the inner aspect ratio of the outer contour of the subject.
  • the editing process may also include any of the editing methods mentioned in the present application. For example, when performing beautification operations, the outer contour of the subject can avoid blurring the image outside the outer contour of the subject to highlight the subject.
  • cutting the initial image or the initial video to avoid the outer contour of the subject can be achieved by matting.
  • the outer contour of the subject in the initial image or initial video has been identified, and the matting algorithm can be used to avoid the outer contour of the subject and separate the subject from the initial image or initial video.
  • the processing methods for the separated subject include But it is not limited to locking or creating a new layer. When the subject locks or creates a new layer, further processing can be performed on the background part.
  • the matting algorithm may be a matting algorithm based on deep learning, such as learning-based digital matting and nearest neighbor algorithm matting (KNN). matting) and so on.
  • the matting algorithm may also be a matting based on cluster sampling (Cluster-Based Sampling matting, CBS), or Iterative Transductive Learning for alpha matting (ITL). at least one.
  • scaling the initial image or initial video while maintaining the outer contour and inner aspect ratio of the subject can be achieved by separating the subject from the background. Specifically, in order to avoid deformation and distortion of the main body during the zooming process, the main body and the background part are zoomed separately, and the aspect ratio within the outer contour of the main body is maintained during the zooming process.
  • the initial image can be a poster with a pixel size of 800 ⁇ 600
  • the main body is a mobile phone with a pixel size of 150 ⁇ 330 in the poster (the main body 350 aspect ratio is 5:11)
  • the size of the target video or video clip is 1200 ⁇ 800
  • the size of the initial image needs to be scaled to 1200 ⁇ 800.
  • the scaled size will be 225 ⁇ 440.
  • the aspect ratio is 5:9.8.
  • the subject is deformed at this time.
  • the deformation of the subject in the target video or video clip may adversely affect the effect of the video and the customer's perception of the product.
  • the method of maintaining the internal aspect ratio of the outer contour of the main body may be to obtain the initial image or initial video separately to obtain the scale ratio in the width direction and the length direction of the target video size (or video segment size).
  • the initial image is scaled by 1.25 times in the width direction and 1.5 times in the length direction.
  • the outer contour of the subject may not be a rectangle.
  • the above scaling method is also applicable.
  • the image processing method is similar to the video processing method, so I will not repeat it here.
  • the initial image or the initial video background size ratio may be inconsistent with the size of the target video or video segment
  • directly scaling may cause the ratio to change.
  • the background part can be cropped first, and then zoomed after cropping.
  • Fig. 6 is an exemplary flow chart of generating a video according to some embodiments of the present application.
  • one or more steps in the process 600 may be stored as instructions in a storage device (for example, the database 140), and called by a processing device (for example, the processing device 112 in the server 110) and/or implement.
  • the process 600 specifically includes the following steps:
  • Step 610 Acquire at least one of an initial video and an initial image.
  • the initial video and the initial image may also be referred to as the to-be-processed video and the to-be-processed image, respectively.
  • Step 610 can be implemented with reference to step 310 in the process 300, and step 610 can also be implemented with reference to step 510 in the process 500.
  • Step 620 Obtain the main body of the target video in the initial video.
  • the main body of the target video here can be understood as each specific object corresponding to the specific theme of the target video, for example, the aforementioned target video with "teeth protection promotion" as the main body, the main body of the target video (A specific object corresponding to a specific theme) may be a specific object related to the theme (dental protection) such as "tooth”, “doctor”, and "dental appliance”.
  • step 630 the initial video is cropped, zoomed and/or edited based on the preset size of the target video and the main body, to obtain video materials that all include the main body.
  • the video material may be a video segment determined by the initial video in the video segment, and the video material that all includes the subject refers to the video segment that is determined based on the initial video and all contains the subject (specific object) of the target video.
  • the target video may be preset with a preset size as required.
  • the initial video size does not match the target video size or the size ratio does not match
  • the initial video may be cropped, zoomed and/or edited.
  • the target video size is FHD (Full High Definition, 1920 ⁇ 1080).
  • FHD Full Definition, 1920 ⁇ 1080
  • the initial video can be scaled to get the same as the target video.
  • the video size is the same as 1920 ⁇ 1080 video.
  • the cropping target size is 1024 ⁇ 768 according to the target video size ratio, that is, the initial video is first frame by frame Crop, and then enlarge the cropped video with a size of 1024 ⁇ 768 to 1920 ⁇ 1080 in equal proportions.
  • the initial video size is larger than the target video size, for example, when the initial video size is 2560 ⁇ 2560, the initial video can be directly cropped to the target video size of 1920 ⁇ 1080, or press
  • the above steps are first cropped to 2560 ⁇ 1440 and then scaled equally.
  • the method of cropping the video frame by frame in this step can refer to the image cropping method shown in FIG. 7 of this application.
  • the method of cropping the video frame by frame in step 630 can also refer to the video cropping method shown in FIG. 14 of the application.
  • the initial video size matches the target video size, or after cropping and scaling the size matches the target video size
  • the initial video with a longer time can be edited.
  • a longer time for example, more than 15 seconds or 20 seconds, etc.
  • one video material corresponds to a scene (shot clip). Playing the same scene for a long time may make the viewer feel bored.
  • shots clip Playing the same scene for a long time may make the viewer feel bored.
  • the duration of the video material can also be adjusted by interpolation or sampling.
  • the initial video with a longer time can be edited.
  • a longer time for example, more than 15 seconds or 20 seconds, etc.
  • one video material corresponds to a scene, and playing the same scene for a long time may make the viewer feel bored.
  • step 640 the initial image is cropped and/or scaled based on the preset size of the target video to obtain an image material including the subject.
  • the image material may be determined by the initial image, and the image material that all includes the subject refers to the subject (specific object) image that is determined based on the initial image and all contains the target video.
  • crop or zoom the image file whose size does not match the target video size in the initial image and continue to take the target video size as FHD as an example, through cropping and/ Or zoom to get an image with a size of 1920 ⁇ 1080 and including the subject.
  • step 610 at least one of the initial image and the initial video is acquired in step 610.
  • step 630 and step 640 are performed, and there is no sequence between the two steps;
  • step 630 may be performed without performing step 640;
  • step 630 may be skipped and step 640 may be performed.
  • step 640 may be implemented by the method shown in FIG. 7, where FIG. 7 is an exemplary flowchart of generating a video shown in some embodiments of the present application, and one or more sub-steps in step 640 It may be stored in a storage device (for example, the database 140) as an instruction, and called and/or executed by a processing device (for example, the processing device 112 in the server 110).
  • a storage device for example, the database 140
  • a processing device for example, the processing device 112 in the server 110.
  • Step 642 Obtain the subject information of the target video in the initial image.
  • the initial image may be processed by a subject information determination model (for example, a machine learning model) to obtain subject information.
  • a subject information determination model for example, a machine learning model
  • the subject information determination model can be a generation model, a decision model, or a deep learning model in machine learning. For example, it can use the yolo series algorithm, the faster R-CNN algorithm, or the EfficientDet. Algorithms and other algorithms for deep learning models.
  • the initial image or initial video may be directly input to the subject information determination model, so that the subject information determination model marks the corresponding subject and subject position in the corresponding video frame, thereby determining the subject information of the initial image or initial video .
  • the subject information determination model marks the corresponding subject and subject position in the corresponding video frame, thereby determining the subject information of the initial image or initial video .
  • reference may be made to the related description of step 510 in the process 500.
  • Step 644 Identify the outer contour of the subject based on the subject information.
  • the outer contour of the subject is determined based on the subject position, so as to distinguish the subject from the background part in the initial image. It should be noted that in some other embodiments, the information of the subject may also include color information and size information, etc. Obviously, the outer contour of the subject can be determined more quickly and efficiently based on the color information and size information on the basis of the position of the subject.
  • the outer contour of the main body can be determined by the size of the main body.
  • the smallest rectangular marquee containing the main body can be determined according to the size of the main body, and the smallest rectangular marquee can be used as the outer contour of the main body.
  • the outer contour of the subject can be determined by the edge of the subject, where the edge refers to the intersection point between the subject and the image background in the image. For example, after determining the position of the subject, an image recognition algorithm (such as an edge detection algorithm) is used. Determine the edge of the main body, and use the edge of the main body as the outer contour of the main body.
  • the area obtained by preset processing of the edge of the subject may also be used as the outer contour of the subject.
  • the preset processing may include one or more combinations of smoothing, area scaling, and the like.
  • Step 646 Crop the initial image avoiding the outer contour of the subject.
  • the initial image can be cropped by avoiding the outline of the subject.
  • cropping of the initial image by avoiding the outer contour of the subject can be achieved by matting.
  • the outline of the subject can be avoided by the matting algorithm and the subject can be separated from the initial image.
  • the processing methods for the separated subject include but are not limited to locking or creating a new layer. After the subject is locked or a new layer is created, the background part can be further processed.
  • the matting algorithm may be a matting algorithm based on deep learning, such as learning-based digital matting and nearest neighbor algorithm matting (KNN). matting) and so on.
  • the matting algorithm may also be a matting based on cluster sampling (Cluster-Based Sampling matting, CBS), and iterative transductive learning (Iterative Transductive Learning for alpha matting, ITL). at least one.
  • cluster-Based Sampling matting CBS
  • iterative transductive learning Iterative Transductive Learning for alpha matting, ITL.
  • Step 648 Maintain the external contour and internal aspect ratio of the main body to zoom the image to be processed.
  • scaling the initial image while keeping the outer contour and inner aspect ratio of the subject can be achieved by separating the subject from the background. Specifically, in order to avoid deformation and distortion of the main body during the zooming process, the main body and the background part are zoomed separately, and the aspect ratio within the outer contour of the main body is maintained during the zooming process.
  • the initial image can be a poster with a pixel size of 800 ⁇ 600
  • the main body is a mobile phone with a pixel size of 150 ⁇ 330 in the poster (the main body 350 aspect ratio is 5:11)
  • the size of the target video or video clip is 1200 ⁇ 800
  • the size of the initial image needs to be scaled to 1200 ⁇ 800.
  • the scaled size will be 225 ⁇ 440.
  • the aspect ratio is 5:9.8.
  • the subject is deformed at this time.
  • the deformation of the subject in the target video or video clip may adversely affect the effect of the video and the customer's perception of the product.
  • the method of maintaining the internal aspect ratio of the outer contour of the main body may be to obtain the initial image to be scaled in the width direction and the length direction of the target video size (or video segment size) respectively.
  • the initial image is scaled by 1.25 times in the width direction and 1.5 times in the length direction.
  • step 520 in the process 500 for step 648.
  • step 650 video clips are spliced based on the video configuration information to generate the target video.
  • the video clip in step 650 may include video material each including the subject and/or image material including the subject.
  • the video configuration information may be determined based on script information and/or video templates.
  • the video template may include the overall video template of the target video, and may also include the fragment video template of each video segment that composes the target video.
  • the video template can include at least a time parameter.
  • the time parameter at least reflects the length of the target video or video segment (the total duration of the target video).
  • the initial image and/ Or the initial video processing obtains a video segment (including image material and/or video material) consistent with the target video size. Therefore, the splicing can be random or predetermined rules to play video clips in an orderly manner based on time parameters.
  • the predetermined rule may be that image materials and video materials are alternately spliced, or image materials are concentrated in the middle of the target video for playback, etc., or they may be played in a preset order corresponding to each video segment.
  • the predetermined rule may be that image materials and video materials are alternately spliced, or image materials are concentrated in the middle of the target video for playback, etc., or they may be played in a preset order corresponding to each video segment.
  • the display time of a single picture (such as 3 seconds, 5 seconds, 10 seconds, etc.) can be defined in the stitching, and switch to the next material when the display time is satisfied.
  • the duration of the video clip may be different from the time parameter in the clip video template. You can cut some video clips, merge with other video clips, sample and play each video frame, and interpolate between each video frame. One or more methods such as playback can be combined to adjust the duration of the video clip.
  • the video template may also be used for steps such as obtaining an initial video or an initial image, and generating a video clip.
  • the object of the video module in this step may be a target video, a video clip, an initial image, and an initial video.
  • the video template may also include beautification parameters. Beautify the target video by beautifying parameters to obtain better results.
  • the aforementioned beautification parameters may not be included in the video template, and the acquisition is performed before video rendering, before determining a video segment, and/or before generating a target video.
  • the subsequent beautification method can also be applied to the target video, video segment, initial image, and initial video.
  • the beautification parameters may include at least one of filter parameters, animation parameters, and layout parameters.
  • the filter parameter can be to add a global effect filter (such as black and white, retro, bright, etc.) to the target video;
  • the animation parameter can be when the target video has multiple video clips in the splicing process, an animation effect is added between the video clips to make The target video effect is better and natural;
  • the layout parameters can be due to the different positions of the main body of each segment in the video clip.
  • the information of the main body position can be marked in the material (for example, the main body is located in the upper left and upper right of the entire image/video , Lower left, lower right, etc.), the layout parameters combine and arrange the subject position information to make the target video smoother and the subject more prominent.
  • the beautification parameter may also include removing watermark or adding watermark, etc.
  • a text layer, a background layer, or a decoration layer and corresponding loading parameters may also be obtained based on the video configuration information. Determine the layout of the text layer, the background layer, and the decoration layer in the target video according to the loading parameters, and embed the text layer, the decoration layer and/or the background layer into the video clip or the target video according to the layout during the splicing and rendering process.
  • the text layer may be subtitles or additional text introductions.
  • the image material sometimes has a transparent background, and a background layer may be required. When it is understandable, the above text layer and background layer are added according to the actual situation of the target video.
  • the text layer, the decoration layer, and the background layer may be included in the video template.
  • the initial image and the initial video may come from different sources, and the color difference may be relatively large. Therefore, when the video clip is generated, the difference between the video clips may also be relatively large, so normalization processing is required.
  • the normalization process may be performed when the video segment is determined, that is, the initial image and/or the initial video are normalized to generate a video segment with a uniform style.
  • the normalization process may be performed during splicing rendering, that is, the normalization process is performed on the video segment to generate a target video with a uniform style. Since a video frame can be regarded as an image, the normalization of an image refers to the process of performing a series of standard processing transformations on the image to transform it into a fixed standard form.
  • the standard image is called a normalized image.
  • the grayscale or Gamma value of the video segment, the initial image, and/or the initial video may be normalized.
  • the image histogram of the image or video frame may be obtained first, at least for the image The histogram is subjected to averaging processing, and the gray scale or gamma value of the image or video frame is adjusted based on the histogram after at least the averaging processing to realize image normalization.
  • the normalization processing may also be based on one or more of the zoom normalization and rotation normalization of the target lens subject.
  • the normalization processing may also be for video clips, initial images, and / Or normalization of the brightness, hue, saturation, etc. of the initial video.
  • Fig. 8 is an exemplary flow chart of generating a target video according to some embodiments of the present application.
  • one or more steps in the process 800 may be stored as instructions in a storage device (for example, the database 140), and called by a processing device (for example, the processing device 112 in the server 110) and/or implement.
  • Step 810 Obtain an initial image or an initial video, and generate multiple video clips based on the initial image or the initial video.
  • step 810 reference may be made to the related description of step 310 in the process 300, which will not be repeated here.
  • step 810 may also be omitted in the process 800, for example, the multiple video clips may be directly obtained.
  • Step 820 Obtain one or more candidate video clips from the multiple video clips based on a first preset condition.
  • the first preset condition may be used to reflect the requirements for the content and/or duration of the target video.
  • the first preset condition may include requirements in multiple elements.
  • the multiple elements include, but are not limited to, the target video contains a specific object (F), the target video contains a specific theme (S), the duration of the video segment (tp), the total duration of the target video (ta), and the target video contains a specific Shot picture (P), the number of shot pictures contained in the target video (Pn), the number of overlaps of a specific subject in the target video (Fn), the focusing time of a specific subject in the target video (St), etc.
  • the number of shots (Pn) included in the target video can be the same or different.
  • the focus time (St) of a specific topic refers to the length of time occupied by the content on the specific topic in the video segment or the target video.
  • the number of overlaps of a specific topic (Fn) refers to the number of occurrences of content on a specific topic in a video clip or target video.
  • the number of overlaps (Fn) of a specific topic and the focus time (St) of a specific topic can be related to the degree of prominence of the specific topic. The greater the number of overlaps (Fn) of a specific theme, the longer the focusing time (St) of the specific theme, and the higher the degree of highlighting the specific theme.
  • the first preset condition may be specified by the user, or automatically determined by the multimedia system 100 (for example, the processing device 112) based on the video configuration information, the promotional effect that the target video needs to produce, and the like.
  • the first preset condition may be a constraint on at least one of the above-mentioned multiple elements.
  • the constraints may include qualitative constraints (for example, whether to include a specific object (F), a specific theme (S), a specific shot (P), etc.) or quantitative constraints (for example, the total duration of the target video (ta), the target video contains The number of shots (Pn), the number of overlaps of a specific subject in the target video (Fn), the focusing time of a specific subject in the target video (St), etc.).
  • the screening of video clips can be achieved by limiting the value corresponding to the element (for example, comparing with a preset corresponding threshold).
  • the first preset condition may be that the value of the corresponding element is greater than 0.
  • the first preset condition can be that the value of the corresponding element exceeds the corresponding threshold (for example, the video duration is less than 30 seconds, the topic focus time exceeds 15 seconds, etc.) to filter out those that meet the needs (for example, those that meet the specific group Interest or request) video clips.
  • a plurality of video clips may be selected from a plurality of video clips that satisfy the first preset condition and are determined to be candidate video clips.
  • the first preset condition may include constraints on the total duration (ta) of the target video and the number of shots (Pn) contained in the target video, that is, it may be determined according to the total duration (ta) of the video segment and the number of shots (Pn).
  • Candidate video clips Exemplarily, if the first preset condition can be that the target video needs to contain 3 shots and the total duration is 40 seconds, then 3 video clips that each contain different shots and the total duration is 40 seconds can be filtered, such as 15 seconds Video segment 1, 15-second video segment 2, and 10-second video segment 3 are candidate video segments.
  • the first preset condition may include a constraint on a specific shot picture (P) of the video clip, that is, the candidate video clip may be determined according to the shot picture contained in the video clip.
  • P a specific shot picture
  • the target video needs to include the shots of "surfing on the sea”
  • one or more video clips with the shots of "surfing on the sea” may be filtered as candidate video clips.
  • the target video can be made to contain specific content to meet the interests or needs of users.
  • the first preset condition may simultaneously include constraints on the target video containing a specific object (F), the total duration of the target video (ta), and the number of shots (Pn) contained in the target video.
  • the included objects and video duration determine candidate video segments.
  • the first preset condition is that the target video needs to contain 3 shots and the logo including the " ⁇ area", and the total duration of the target video cannot exceed 70 seconds, you can first filter out the " ⁇ area”
  • the video clip 4 marked with ” is a candidate video clip, and then according to the duration of the video clip 4, such as 20 seconds, two video clips whose total video duration does not exceed 70 seconds are selected as candidate video frequency bands, such as 30 seconds respectively.
  • a specific lens frame has a predetermined degree of exposure to highlight the specific object.
  • the first preset condition may include constraints on the number of overlaps (Fn) of a specific topic in the target video and the focus time (St) of a specific topic in the target video, that is, the candidate video may be determined according to the number of occurrences of the topic and the focus duration. Fragment.
  • the first preset condition is that the target video needs to contain two specific themes that are the same or different, and the focus time of the specific topic in the target video exceeds 1 minute, a specific topic that also contains this number and the focus time can be filtered Video clips exceeding 1 minute or multiple video clips containing this number of specific themes and having a cumulative focus duration of more than 1 minute are regarded as candidate video clips.
  • the first preset condition may include constraints on the total duration of the target video (ta), the number of shots (Pn) contained in the target video, and the focus time (St) of a specific subject in the target video, that is, it may be based on the video
  • the length of time, the number of shots and the focus time of a specific subject determine the candidate video clips.
  • the first preset condition is that the total duration of the target video is 30 seconds, 3 shots need to be included, and the focus time of a specific subject in the target video is not less than 15 seconds.
  • the duration of the 3 lens images containing a specific subject is no less than 15 seconds (for example, 16 seconds, 18 seconds, 20 seconds, etc.), and the total duration of the 3 lens images is 30 seconds, for example, Screen 15-second video clip 1 (10 seconds for content containing a specific subject), 10-second video clip 2 (6 seconds for content containing a specific subject), and 5-second video clip 3 (which contains content on a specific subject) 3 seconds), which is a candidate video segment.
  • the total duration of the target video (ta) the number of shots (Pn) contained in the target video, and the focus time (St) of a specific subject in the target video
  • the specific shots of the specific subject have a predetermined degree of focus, so as to more highlight the specific subject in the target video.
  • the process of determining candidate video segments to generate the target video can determine a model: f(a,b,c,...,n) ⁇ y, where f in the model can be The first preset condition, a, b, c,..., n are multiple video clips, and y is the target video.
  • f in the model can be The first preset condition, a, b, c,..., n are multiple video clips, and y is the target video.
  • a candidate video segment can be determined, and a target video can be generated based on the candidate video segment.
  • the process of generating the target video may also determine a similar model.
  • the model may be executed by the processing device 112 to automatically generate a target video based on multiple video clips.
  • a machine learning model can be used to learn and train f in the model to determine the first preset condition or video configuration information that meets a specific requirement (for example, a playback effect).
  • the candidate video clips meeting the first preset condition may be one or more groups.
  • the first preset condition may include constraints on each shot segment (video clip) in the target video.
  • the video clip that meets the constraint may be divided into the shot segment according to the constraints of the specific shot segment.
  • the first preset condition is related to at least one of the multiple elements, and the first preset condition may be characterized as (The requirements for the requirements can also be understood as the requirements for the target video). In the case of multiple element constraints, the selection of video clips that meet each requirement can be in any reasonable order. No restrictions.
  • the processing device 112 may use a machine learning model to determine candidate video segments. For example, a trained machine learning model may be used to determine whether each of the multiple video clips contains a specific object, and a video clip that contains the specific object among the multiple video clips is determined as a candidate video clip.
  • the input of the trained machine learning model may be a video clip, and the output may be the object contained in the video clip, or the binary classification result of whether the video clip contains a specific object, this specification does not limit this.
  • the candidate video segment may be determined in other feasible ways, which is not limited in this specification.
  • the first preset condition further includes element constraint conditions of two or more specific elements in the plurality of elements.
  • the two or more specific elements may be different elements.
  • the element constraint condition may involve the priority of the two or more specific elements. For example, based on the difficulty of the judgment, the priority of the specific subject (F) is higher than the priority of the specific theme (S); based on the highlighting effect of the theme, the priority of the theme focus time (St) is higher than the number of repetitions of the theme (Sn) The priority etc.
  • the element constraint conditions may involve the order in which different elements are considered.
  • the element constraint condition may also involve compatibility and matching between different elements. For example, the 15-second total video duration (ta) is not compatible with the 20-second topic focus time (st).
  • the first preset condition may include the binding condition of the shot picture in the target video.
  • the binding condition may reflect the association relationship of at least two specified lens images in the target video.
  • the binding condition (also can be understood as an association relationship) may be that the specified lens image a and the specified lens image b must be in the target video Appear at the same time, or the designated lens picture a must appear before the designated lens picture b in the target video, etc.
  • the processing device 112 may determine a video clip containing a specified shot frame from a plurality of video clips, and combine the video clip containing the specified shot frame based on a binding condition to serve as a candidate video clip.
  • the binding condition is that the lens image a must appear before the lens image b in the target video
  • the sequence of a and shot b is used as a candidate video segment.
  • Combining shots and pictures that meet the binding conditions into a candidate video segment can help to process them as a whole in the subsequent processing process, so as to improve the efficiency of video synthesis.
  • the shots that meet the binding condition may not be combined into one candidate video segment, but exist in continuous or discontinuous (for example, interval) candidate video segments, respectively.
  • Step 830 Group one or more candidate video clips to determine at least one clip set.
  • the process 800 may be used to generate multiple (eg, target number) target videos at the same time.
  • the grouping can be understood as a combination of candidate video segments, and the corresponding step 830 can include combining the candidate video segments to generate a target number of segment sets.
  • Each segment set in the at least one segment set is a segment set that is combined by one or more candidate video segments and meets the first preset condition and other conditions in the video configuration information at the same time.
  • other conditions in the video configuration information may be the second preset condition.
  • the second preset condition is related to the content difference of the segment set/video segment.
  • the second preset condition may specifically include a constraint on the combination difference degree of the segment sets, that is, the difference degree of the combination of candidate videos of each segment set in at least one segment set is greater than a preset threshold.
  • the judgment of the second preset condition refer to the following figure 9 and related descriptions.
  • other conditions in the video configuration information may also include, but are not limited to, requirements for one or more combinations of frame, subtitle, hue, saturation, background music, etc. of the target video.
  • at least one fragment set may include the aforementioned video fragment 1, video fragment 2, and video fragment 3 combined fragment set 1, video fragment 4, video fragment 5, and video fragment 6 combined fragment set 2, video fragment 1, video fragment 2.
  • Step 840 Generate a target video based on each segment set in the at least one segment set.
  • step 840 may include synthesizing a target video based on the set of segments for each set of segments.
  • a target video For each segment set in at least one segment set (or a target number of segment sets), a target video can be synthesized based on the segment set.
  • at least one segment set can synthesize a corresponding number of target videos.
  • the video clips may be sorted and combined according to the sequence information included in the video configuration information to generate the target video.
  • the target video may be randomly synthesized based on the cohesion of the shots of each video clip in the clip set. For example, a video clip whose presentation content is daytime may be placed before a video clip whose presentation content is night.
  • the target video may be synthesized based on the promotional copy of the product or information to be promoted.
  • a video used to promote garbage classification its promotional copy can be to first show the possible consequences of unclassification, then show the benefits of classification, and finally show the classification method. You can follow the content of each video segment in the segment set according to the Synthesize the target video in the sequence of the content displayed in the promotional copy.
  • the synthesized target number of target videos may be delivered in batches or at the same time.
  • the first preset condition, the second preset condition, and/or other conditions may be adjusted based on the promotion effect of each target video.
  • the publicity effect can be obtained based on user feedback, broadcast volume, evaluation, feedback results (such as product sales, garbage classification results, etc.). This manual does not limit this, and the specific content can refer to the relevant descriptions in Figures 17 and 18 of this application. .
  • the second preset condition may include that the difference between any two segment sets in the at least one segment set is greater than a preset difference threshold.
  • the difference degree between any two fragment sets may also be referred to as the candidate video fragment combination difference degree between any two fragment sets.
  • Fig. 9 is an exemplary flowchart of determining a segment set according to some embodiments of the present application. Specifically, it relates to a method of screening out at least one segment set based on a second preset condition.
  • one or more steps in the process 900 may be stored as instructions in a storage device (for example, the database 140), and called by a processing device (for example, the processing device 112 in the server 110) and/or implement.
  • Step 910 Determine a combination difference degree of candidate video segments between each of the two or more segment sets and other segment sets.
  • the multimedia system 100 may randomly select multiple sets of candidate video clips that meet the first preset condition, and randomly combine them to form two or more clip sets.
  • one or more candidate video clips may be randomly selected from each group of candidate video clips corresponding to the same element to form a clip set. By repeating the steps of random selection, two or more fragment sets can be formed. Among them, random selection of video clips with qualitative restrictions can be performed first, and then other video clips are randomly selected. For example, according to the video configuration information, the last shot should be a product shot, and other shots have no order requirements but have content requirements.
  • the candidate shots can be grouped according to the content requirements of other shots, from each group Determine the video segment to get a combination of candidate video segments.
  • step 910 the difference in the combination of candidate video segments between each of the two or more segment sets and other segment sets can be achieved by comparing the combination of the segment sets, for example, comparing the combination of the various segment sets
  • the difference rate that is, the ratio of video clips that are different from each other in each clip set.
  • step 910 can also be implemented by assigning values to different video clips or using machine learning algorithms. For specific descriptions, reference may be made to subsequent related descriptions in FIG. 10 and FIG. 11.
  • the degree of difference between any two fragment sets may include a combination of candidate video fragments between any two fragment sets and/or candidate video fragments and borders, subtitles, and tones between the two fragment sets.
  • the degree of difference in combination of other conditions.
  • segment set 1 may include video segment 1, video segment 2, frame 1, subtitle 1
  • segment set 5 may include video segment 1, video segment 2, frame 2, subtitle 2, and the degree of difference between segment set 2 and segment set 5. It can be the difference between the frame and the caption, then the difference rate of the segment set 1 is 0% and the difference rate of the segment set 2 is 33%. At this time, the difference rate can be used to characterize the degree of difference.
  • Step 920 Use a segment set whose combination difference with other segment sets is higher than a preset threshold as at least one segment set.
  • Each fragment set can be filtered according to the combination difference degree of each fragment set in the aforementioned two or more fragment sets.
  • the filtering condition may be that the combination difference between each fragment set in at least one of the selected fragment sets is higher than a preset threshold ( For example, the difference rate is higher than 50%).
  • the screening order of the first preset condition and the second preset condition can be changed. For example, multiple candidate clip sets can be generated based on multiple video clips, so that the multiple candidate clip sets meet the second preset condition . Then, at least one segment set is selected from the multiple candidate segment sets based on the first preset condition; and then based on each segment set in the at least one target segment set, a target video is generated. For details, refer to FIG. 12 and its description.
  • the beneficial effects that the process 900 may bring include, but are not limited to: (1) The target number of fragment sets is determined based on the difference between the fragment sets, multiple target videos with different content performance effects can be determined, and the variety of generated target videos is improved. (2) No manual operation is required to generate the entire target video, which improves the efficiency of video synthesis. It should be noted that different embodiments may have different beneficial effects. In different embodiments, the possible beneficial effects may be any one or a combination of the above, or any other beneficial effects that may be obtained.
  • Fig. 10 is an exemplary flow chart for determining the degree of difference in a combination of fragment sets according to some embodiments of the present application.
  • one or more steps in the process 1000 may be stored as instructions in a storage device (for example, the database 140), and called by a processing device (for example, the processing device 112 in the server 110) and/or implement.
  • Step 1010 Assign an identification character to each of one or more candidate video clips.
  • different identification characters may be assigned to each candidate video segment.
  • the identification character can be determined according to the number of candidate video fragments. For example, when the number of candidate video fragments is small, an English letter can be assigned as the identification character . When the number of candidate video clips is large, an ASCI code can be assigned as an identification character. In particular, for specific candidate video clips that have special requirements or must appear in the target video, no identification characters can be assigned. For example, the last shot of the target video must be a product display video clip. If there is only one video clip, it can be No identification characters are assigned.
  • Step 1020 based on the identification characters of one or more candidate video segments, determine character strings corresponding to the segment set and other segment sets.
  • the character string of the segment set is a set of identification characters of each candidate video segment in the segment set. For example, if the identifying character of video segment 1 is A, the identifying character of video segment 2 is B, the identifying character of video segment 3 is C, and the identifying character of video segment 4 is D, then the character string corresponding to segment set 1 is ABC.
  • the fragment set may also include order requirements. For example, the fragment set 2 is a combination with the same content as the fragment set 1 in a different order, and the corresponding character string may be CAB.
  • Step 1030 Determine the edit distance of the character string corresponding to the segment set and other segment sets as the combined difference degree of the candidate video segments between the segment set and the other segment sets.
  • Edit distance can reflect the number of different characters between two strings. The smaller the editing distance, the smaller the number of corresponding different characters, and the smaller the difference between the two fragment sets. For example, if the character string corresponding to segment set 1 is ABC, and the character string corresponding to segment set 3 is ABCD, the edit distance between segment set 1 and segment set 3 is 1, that is, the degree of difference is 1. For strings with sequence requirements, characters with different sequences can also be included in the edit distance. In addition, to avoid repetition of content, the sequence difference can be calculated once as a whole. For example, the edit distance between segment set 1 and segment set 2 in step 1020 Can be 1.
  • the number of fragment sets can be any positive integer, for example, 1, 3, 5, 8, 10, etc.
  • the candidate video fragments may be randomly combined into N candidate fragment sets, and a target number of fragment sets are selected from the N candidate fragment sets based on the second preset condition. By screening the target number of segment sets that meet the second preset condition, a target number of target videos with different content presentation effects can be generated based on the target number of segment sets, thereby achieving different promotional effects.
  • Fig. 11 is an exemplary flow chart for determining the degree of difference in the combination of fragment sets according to some embodiments of the present application.
  • one or more steps in the process 1100 may be stored as instructions in a storage device (for example, the database 140), and called by a processing device (for example, the processing device 112 in the server 110) and/or implement.
  • Step 1110 Generate a segment feature corresponding to each candidate video segment based on the trained feature extraction model and candidate video segments in two or more segment sets.
  • the feature extraction model may also be referred to as a shot feature extraction model.
  • the candidate video segments in the two or more segment sets may be processed based on the trained feature extraction model to generate corresponding segment features, which specifically includes obtaining multiple candidate video segments included in each candidate video segment.
  • Video frames and determine one or more image features corresponding to each video frame, and then process the image features in the multiple video frames and the relationship between the image features in the multiple video frames based on the trained feature extraction model , Determine the segment feature corresponding to the candidate video segment.
  • the feature extraction processing can refer to processing the original information and extracting feature data, and the feature extraction processing can improve the expression of the original information to facilitate subsequent tasks.
  • the image characteristics corresponding to the video frame may include the shape information of the object (for example, the subject) in the video frame, the positional relationship information between multiple objects in the video frame, the color information of the object in the video frame, the completeness of the object in the video frame, or At least one of the brightness in the video frame.
  • the feature extraction process may use statistical methods (such as principal component analysis methods), dimensionality reduction techniques (such as linear discriminant analysis methods), feature normalization, data bucketing, and other methods.
  • statistical methods such as principal component analysis methods
  • dimensionality reduction techniques such as linear discriminant analysis methods
  • feature normalization data bucketing, and other methods.
  • the brightness value within 0-80 can be proportionally corresponding to [1,0,0]
  • the brightness value 80-160 can be proportionally corresponding to [0,1,0]. Values above 80 correspond to [0,0,1].
  • the feature extraction process can also rely on machine learning methods (such as using feature extraction models), which can automatically learn the collected information to form a predictable model, thereby obtaining higher accuracy.
  • the feature extraction model can be a generative model, a decision model, or a deep learning model in machine learning.
  • it can be a deep learning model using algorithms such as the yolo series algorithm, the FasterRCNN algorithm, or the EfficientDet algorithm.
  • the machine learning model can detect the set objects that need attention in each video frame.
  • the objects that need attention can include creatures (humans, animals, etc.), commodities (cars, decorations, cosmetics, etc.), backgrounds (mountains, roads, bridges, houses, etc.), etc.
  • the object to be paid attention to can be further set, for example, it can be set as a character or a product.
  • Multiple candidate video clips can be input into the machine learning model, and the machine learning model can output image features such as position information and brightness of objects in each candidate video clip.
  • the feature extraction model can be GoogLeNeT model, VGG model, ResNet model, etc.
  • the machine learning model By using the machine learning model to extract the features of the video frame, the image features can be determined more accurately.
  • the trained feature extraction model can be a sequence-based machine learning model, which can convert a variable-length input into a fixed-length vector expression for output. It can be understood that due to the different durations of different shots, the number of corresponding video frames is also different. After processing through the trained feature extraction model, it can be converted into a fixed-length vector for characterization, which is helpful for subsequent processing .
  • the sequence-based machine learning model may be a deep learning model (DNN), a recurrent neural network model (LSTM), or a bi-directional LSTM (Bi-directional Long Short-Term Memory) model, and the gate repeats
  • DNN deep learning model
  • LSTM recurrent neural network model
  • LSTM bi-directional Long Short-Term Memory
  • the gate repeats The unit GRU model, etc. and its combination are not limited in this specification.
  • the image features corresponding to the obtained video frames (such as the features of 1, 2, 3..., n) and their relationships (such as sequential sequence and/or chronological relationship) are input into the feature extraction model, and the feature extraction model can be Output the sequence of the coded hidden state at each time (such as h 1 ⁇ h n ), where h n contains all the information of the lens during this period of time.
  • the feature extraction model can convert multiple image features within a period of time (such as a candidate video segment corresponding to a certain shot) into a fixed-length vector expression h n (ie, shot feature).
  • h n ie, shot feature
  • the above steps and methods can be used to process separately to obtain the clip characteristics of different candidate video clips.
  • the candidate video segments are 1, 2, 3..., m respectively, and the corresponding segment features obtained are R c,1 , R c,2 ,...R c,i ...R c,m , in the following description follows this setting.
  • Step 1120 Generate a set feature vector corresponding to each set of fragments based on the characteristics of the fragments.
  • the collection feature vector corresponding to each segment collection may be generated based on the acquisition sequence and the segment characteristics of each candidate video segment in the segment collection. Exemplarily, vector splicing, vector concatenation, etc. may be used to obtain the set feature vector corresponding to the fragment set.
  • c set of segments corresponding to a set of feature vector R c ⁇ R c, 1 , R c, 2, ... R c, i ... R c, m ⁇ T; where the superscript T denotes the matrix transpose .
  • Step 1130 Determine the degree of similarity between each fragment set and other fragment sets based on the trained discriminant model and the set feature vector corresponding to each fragment set.
  • the multimedia system 100 may determine the similarity degree of the set feature vectors corresponding to any two fragment sets based on the trained discriminant model, so as to determine the degree of similarity between each fragment set and other fragment sets.
  • step 930 Assuming a, b, c ... k set of segments, each having a corresponding set of eigenvectors of R a, R b, R c ... R k.
  • one of the k fragment sets can be selected (for example, fragment set c), and the similarity between the set feature vector R c and the set feature vector of the other (k-1) fragment sets can be calculated to obtain
  • the similarity comparison result is regarded as the similarity degree.
  • step 1130 may also use a vector similarity coefficient to determine the degree of similarity between two fragment sets.
  • the similarity coefficient refers to the calculation of the similarity between samples using a calculation formula. The smaller the value of the similarity coefficient, the smaller the similarity between individuals and the greater the difference. When the similarity coefficient between the two fragment sets is large, it can be judged that the similarity between the two fragment sets is relatively high.
  • the similarity coefficient used includes but is not limited to simple matching similarity coefficient, Jaccard similarity coefficient, cosine similarity, adjusted cosine similarity, Pearson correlation coefficient, and the like.
  • Step 1140 Determine the degree of combined difference between each segment set and other segment sets based on the degree of similarity.
  • the similarity coefficient used in step 1130 refers to the calculation of the degree of similarity between samples using a calculation formula. The smaller the value of the similarity coefficient, the smaller the similarity between individuals and the greater the difference. When the similarity coefficient between the two fragment sets is large, it can be judged that the difference between the two fragment sets is small. In some embodiments, when the similarity coefficient is a real number within [0,1], the combined difference degree can be its reciprocal, inverse number, 1-similarity coefficient, etc.
  • At least one segment set may also be generated based on the set feature vector.
  • the multiple set feature vectors may be clustered based on a clustering algorithm to obtain multiple set cluster clusters, and at least one segment set is generated based on the clustering result.
  • the number of video segments of at least one segment set that needs to be obtained is P
  • the number of clusters actually obtained is Q. If the number of videos required by the segment set P is less than or equal to the number of cluster clusters actually obtained is Q, then P clusters are selected, and a candidate video is selected from each cluster; if the video required by the video set is recommended
  • the number P is greater than the number of cluster clusters actually obtained, which is Q, then several candidate videos far away from the cluster center can be selected from each cluster cluster, and P candidate video segments are randomly selected to form a segment set.
  • a density-based clustering algorithm (such as the DBSCAN density clustering algorithm) may be used to obtain multiple aggregate clusters. Specifically, determine the preset value of the required fragment set, that is, determine the number of fragment sets (the number is P); further, determine the neighborhood parameter ( ⁇ , MinPts) of the cluster, where ⁇ corresponds to the cluster cluster In the radius of the vector space, MinPts corresponds to the minimum number of samples required to form clusters, and the number of clusters is Q. The neighborhood parameters ( ⁇ , MinPts) are adjusted multiple times and the video feature vectors are clustered. Class processing until the obtained number of clusters Q is greater than or equal to the preset value P. At this time, a number of aggregate clusters equal in number to the preset value P are randomly selected; based on the aggregate cluster clusters, the segment set is determined.
  • the preset value of the required fragment set that is, determine the number of fragment sets (the number is P); further, determine the neighborhood parameter ( ⁇ , MinPts) of the cluster, where
  • the process 1100 may also be used to determine the similarity of multiple generated target videos (for example, the multiple target videos in the process 800 or 1200), and recommend the target video based on the similarity.
  • Fig. 12 is an exemplary flow chart of generating a video according to some embodiments of the present application.
  • one or more steps in the process 1200 may be stored in a storage device (for example, the database 140) as instructions, and called by a processing device (for example, the processing device 112 in the server 110) and/or implement.
  • step 1210 and step 1240 are the same as or similar to step 810 and step 840 in the process 800, and reference may be made to FIG. 8 and its related descriptions, which will not be repeated here. Steps 1220 and 1230 in the process 1200 are different from the process 800.
  • Step 1220 Randomly generate multiple candidate segment sets based on multiple video segments.
  • the multimedia system may randomly combine multiple video clips to generate a set of candidate clips.
  • the processing device 112 may randomly combine the partial video clips obtained in step 810 to generate M (M greater than or equal to the target number) candidate clip sets.
  • the processing device 112 may combine all the video clips obtained in step 810, and select M (M greater than or equal to the target number) combinations from them to determine the set of candidate clips, or determine all the combinations as candidate clips gather.
  • a candidate segment set may include one or more video segments.
  • multiple candidate segment sets satisfy the second preset condition.
  • the second preset condition includes that the video segment combination difference between any two candidate segment sets in the plurality of candidate segment sets is greater than the preset difference threshold.
  • the preset difference degree threshold can be any positive integer greater than 0, for example, 1, 2, and so on. For more content about the second preset condition, please refer to FIG. 9, FIG. 10 and related descriptions, which will not be repeated here.
  • Step 1230 Filter out at least one fragment set from the multiple candidate fragment sets based on the first preset condition.
  • the first preset condition may include, but is not limited to, the total duration of the target video, the number of shots included in the target video, the target video contains the specified shots, the target video contains the specific object, and a combination of one or more.
  • the processing device may determine at least one (for example, a target number) segment set based on the candidate segment set as a whole. For example, it is possible to filter a set of candidate fragments whose total duration of video fragments and/or the number of shots meets the first preset condition as one or more of the target number of fragment sets.
  • the processing device 112 may determine the target number of fragment sets based on the content contained in the video fragments in each candidate fragment set. For example, trained machine learning can be used to determine whether a video clip in the set of candidate clips contains a specific object, and based on the determination result, the set of candidate clips containing the specific object can be screened.
  • the input of the trained machine learning model can be a set of candidate fragments, or a video fragment in the set of candidate fragments.
  • the output can be whether the candidate fragment set contains a specific object, whether the video fragment contains a specific object, etc. This specification does not Do restrictions.
  • the target video that meets the requirements can be synthesized through one or more combination operations such as splitting, filtering, combining, cropping, and beautifying the initial video or the initial image.
  • the server 110 for example, the processing device 112 may obtain multiple video clips by splitting the initial video or initial image, and filter out one of the multiple video clips based on constraints such as video duration and video content. Or a plurality of candidate video segments, by combining the candidate video segments to obtain at least one segment set or a target number of segment sets whose differences meet a preset threshold, and generate the target number of target videos based on the segment set.
  • the server 110 may randomly combine multiple video clips obtained by splitting to obtain multiple candidate clip sets whose differences between each other meet a preset threshold, and collect candidate clips based on constraints such as video duration and video content.
  • the target number of fragment sets are filtered out, and the target number of target videos are generated based on the target number of fragment sets.
  • the video synthesis method and/or system provided in the embodiments of this specification can be used to synthesize promotional videos. For example, it can be based on pre-shot video files used for product, cultural, or public welfare, etc., through Splitting, screening, beautifying, compositing and other operations process video files to generate diversified promotional videos.
  • the target video usually has background music or theme music.
  • the background music or theme music is used as a kind of music to adjust the atmosphere. It can be inserted into the video to enhance the expression of emotions and achieve a The audience's immersive experience.
  • background music or theme music has time attributes, and elements such as the duration and rhythm of the background music or theme music can be used as time parameters in some embodiments of the present application.
  • Fig. 13 is an exemplary flowchart of adding audio material according to some embodiments of the present application.
  • one or more steps in the process 1300 may be stored as instructions in a storage device (for example, the database 140), and called by a processing device (for example, the processing device 112 in the server 110) and/or implement.
  • Step 1310 Obtain initial audio.
  • the initial audio (also referred to as music to be processed) may be imported by the user or selected by the user in the database 140.
  • the initial audio can have different types, such as warm, soothing, brisk, focused, agitated, angry, scary, etc.
  • the multimedia system 100 (for example, the processing device 112) may select different types of initial audio based on the subject, theme, video effect, or video configuration information of the target video.
  • the initial audio of a public welfare promotion video can be a warm and soothing initial audio.
  • multiple initial audios can be selected and the audios are connected end to end.
  • only part of the initial audio (such as a chorus) can be selected.
  • Step 1320 Mark the initial audio based on the rhythm to obtain at least one audio segmentation point.
  • marking based on rhythm can be based on the structure of the entire song, such as marking the intro, verse, and chorus, or it can be divided into more detailed songs, such as segmentation based on drums or beats. mark.
  • the marking granularity of the initial audio may be determined by the number of initial images and/or initial videos. Just as an example, assuming that the number of image and video materials is medium, after marking the initial audio according to the drum beats, some of the cut points cannot match the material, so the initial audio can be marked as the intro, verse and chorus first, and then the chorus part is marked as Drum marks to get the right number of cut points.
  • marking the initial audio based on the rhythm can be implemented by software (such as Adobe Audition, FL Studio, etc.) or plug-ins (such as audio wave plugin based on Vue.js, etc.).
  • the automatic marking of the initial audio can be realized by an audio rhythm analysis algorithm based on signal analysis. It should be noted that there are various audio tag processing methods, which are not limited in this embodiment.
  • Step 1330 Determine at least one video segmentation point of the target video based at least in part on the video configuration information
  • the video configuration information may be used to determine the video segmentation point of the target video.
  • different themes, different shots, and splicing methods can be determined according to the video configuration information.
  • At least one video segmentation point of the target video is determined according to different themes, different shots, and splicing methods, where the splicing time point of each joint can be used as the video segmentation point.
  • the target video involves shots including racing cars, tire changing, award ceremony, tire display, etc.
  • at least one video segmentation point of the target video may be determined based on the different shots, As an optional editing point.
  • a single selectable editing point can choose to add a video clip or not to add a material. Whether to add a material depends on the number of selectable editing points and the time interval between two selectable editing points. Just as an example, if no material is added at a single selectable editing point, the duration of the previous material or the next material can be appropriately extended. Since the optional cutting point is associated with the rhythm, adding material through the optional cutting point makes it easy to arrange the material while providing a good rhythm pattern and improving the effect of the target video. In some other embodiments, the optional cut point may also be used as the starting point or ending point of the target video.
  • Step 1340 Match at least one audio segmentation point with at least one video segmentation point.
  • the matching of the video clip with the optional cut points may be performed according to the interval time between the two optional cut points.
  • the interval between two editing points may be only a few seconds.
  • a threshold can be set. For example, if the interval between two editing points is less than the threshold (such as 3 seconds or 5 seconds, etc.), insert it into the video clip Static material or material with a short period of time.
  • the length of the video clips is different, and it may happen that some video clips cannot be matched with the optional cutting points due to time issues.
  • the video can be divided or changed, for example, the duration is 15s.
  • the video clip can be split to get a 10s material and a 5s material, and the split material can be matched with the optional cutting point.
  • the duration of the video clip is 22s, and the interval between the two optional editing points is 20s. At this time, the video clip can be played at an accelerated rate. After shortening the duration to 20s, insert the optional editing point.
  • a threshold (such as ⁇ 5% or ⁇ 10%) may be set for the variable speed of the video clip, and the video clips whose variable speed exceeds the threshold may be processed in a splicing manner.
  • the original audio track (such as background sound, monologue, etc.) may be included in the video clip. According to actual needs, the original audio track in the video clip can be removed, or the audio track can be kept in the target The videos are played at the same time, and there is no restriction in this application.
  • the process 1300 may be completed by an audio matching model.
  • audio By inputting the target video into the audio matching model, audio can be added to the target video.
  • the audio matching model can complete the operations of steps 1310-1340 in the process 1300.
  • the audio matching model may be a machine learning model, such as a neural network model, a deep learning model, etc.
  • the generated target video can be simultaneously played on different playback media.
  • exemplary playback media may include a horizontal video interface of a video website, a vertical video interface in a short video application of a mobile phone, an outdoor electronic large screen, and the like.
  • the method provided in this application can also post-process the target video after the target video is generated.
  • the target video is post-processed to meet at least one video output condition, and at least one video The output condition is related to the playback medium of the target video.
  • the at least one video output condition may include a video size condition
  • the post-processing may include cropping a picture of the target video according to the video size condition.
  • FIG. 14 is a schematic diagram of target video post-processing (screen cropping) shown in some embodiments of the present application.
  • the post-processing (screen cropping) of the target video can set the corresponding cropping frame based on the target video at different video sizes.
  • the center of the cropping frame coincides with the center of the screen of the target video for screen cropping, and then based on cropping.
  • Frame cropping Such a screen cropping method may make the main information of the target video missing after cropping (for example, the screen of the advertisement product is cropped).
  • the target video post-processing system of the present application can achieve the purpose of converting the size of the video screen by executing the method of target video cropping.
  • the target video screen cropping system can crop the video screen based on the information of the cropped subject and the preset screen size, which can make the main information of the target video after the screen cropping is not easy to be missing (such as the screen of the advertisement product is retained) .
  • the to-be-processed video 14 in FIG. 14 may be a target video, or may be one or more of an initial video, an initial image, and a video segment.
  • the process 1400 can be used alone, and the to-be-processed video 14 can be any video that needs to be processed (size changed).
  • the system for cropping the image of the to-be-processed video 14 can be integrated in the system 100 for generating the video, and the post-processing of the target video can be achieved through the processing device 112 (or a processing terminal).
  • the processing device 112 can be used to convert the screen size of the video in a variety of application scenarios.
  • the processing device 112 may be used to convert the screen size of the target video originally placed on an outdoor electronic screen to make it suitable for placement on a subway TV screen.
  • the processing device 112 may be used to adjust the screen size of a video shot by a mobile phone or a camera to a preferred size for playing on a video website.
  • the processing device 112 may be used to convert a horizontal screen video into a vertical screen video.
  • the processing device 112 can obtain The to-be-processed video 14 (horizontal screen video), and split the to-be-processed video 14 into multiple video segments 16 based on the model 12; the processing device 112 can identify the subject 15 in the to-be-processed video 14; the processing device 112 can be based on the subject 15 And the preset picture size of the vertical screen video is used to configure a cropping frame 13 for the to-be-processed video 14, and according to the cropping frame 13, the images 17 of multiple video clips 16 are respectively cropped, and then the cropped video clips 16 are re-spliced into Complete video to get vertical screen video.
  • the to-be-processed video 14 horizontal screen video
  • the processing device 112 can identify the subject 15 in the to-be-processed video 14
  • the processing device 112 can be based on the subject 15
  • the preset picture size of the vertical screen video is used to configure a
  • the model 12 may be included in the processing device 112.
  • the processing device 112 obtains the cropped subject 15 and/or the video clip 16 based on the model 12.
  • the model 12 may be a machine learning model, and the cropped subject 15 in the identified video clip 16, for example, the cropped subject 15 can be the aforementioned specific object and/or subject, and the cropped subject 15 can be a person, car, cosmetics, etc.
  • the model 12 may be stored in the processing device 112, and when the related functions of the model 12 need to be used, the processing device 112 performs an operation of calling the model 12.
  • the model 12 may refer to a collection of several methods performed based on the processing device 112. These methods can include a large number of parameters.
  • the parameters in the model 12 can be preset or can be dynamically adjusted. Some parameters can be obtained through training, and some parameters can be obtained during execution. For the specific description of the model in this manual, please refer to the relevant part of this manual.
  • the processing device 112 may perform post-processing on the target video by setting the cropping frame 13.
  • the cropping frame 13 can be understood as a cropping boundary determined according to the target size of the video to be processed to be converted.
  • the screens in the cropping frame 13 can be retained and the screens outside the cropping frame 13 can be deleted, so that the target video can be cropped to the size corresponding to each output requirement.
  • the processing device 112 may be a playback device for playing the target video to be trimmed. Therefore, the playback device for playing the target video to be trimmed can obtain the target video and base it on the device itself. The size of the playing video is used to crop the screen of the target video, and the target video after screen cropping is automatically played.
  • the processing device 112 may be a smart device (such as a computer, a mobile phone, a smart wearable device, etc.) capable of performing screen cropping operations of the target video, and the smart device may send the screen cropped target video to the user. It is a playback device that plays the target video to be cropped.
  • the process of cropping the target video screen as shown in FIG. 14 may be executed by the processing device 112.
  • the method for cropping a target video frame may be stored in a storage device (such as a storage device or memory of a processing terminal) in the form of a program or instruction.
  • the process 1400 of the method for cropping a target video frame may include the following steps:
  • Step 1410 Obtain the target video to be cropped.
  • the target video can be used as an advertising video.
  • the advertising video can be understood as a kind of video content that uses flexible creativity to lock the audience group associated with it, so as to disseminate information or market products to the audience group, etc. Purpose.
  • the target video may be presented to the audience in the form of a TV, an outdoor advertising screen, a web page or a pop-up window of an electronic device (such as a mobile phone, a computer, a smart wearable device, etc.).
  • Screen cropping can be understood as a way of cropping a video screen according to a preset screen size based on a preset size.
  • cropping the screen can set a cropping frame based on the main information in the screen and a preset screen size (which can be understood as the target screen size), and crop the screen based on the cropping frame.
  • the main information in the screen can include scenes, characters, commodities, and so on.
  • the target video generated in the foregoing steps can be directly obtained.
  • the process 1400 can run independently, that is, videos from other channels can be obtained as the to-be-processed video 14.
  • Step 1420 Determine one or more shot segments based on the target video.
  • Step 1430 Obtain the cropping subject information of each video segment contained in the target video, and the cropping subject information reflects the specific cropping subject of the video segment and the position information of the specific cropping subject.
  • the cropped subject may also be one of the aforementioned specific object and subject, and the method of obtaining may refer to the corresponding description.
  • the tailoring subject information can be determined by a machine learning method.
  • step 1430 can include obtaining the tailoring subject information in each shot segment (video segment) by using a machine learning model.
  • the machine learning model can identify the cropped subject in each of the shot segments, and the machine learning model can also obtain the cropped subject information while recognizing the cropped subject.
  • the cropped subject information can represent some information related to the cropped subject, and the cropped subject information is used to at least reflect the position of the cropped subject.
  • the cropping subject information only needs to include the location information and name information of the cropping subject.
  • the crop subject information may include position information, size information, name information, category information, etc. of the crop subject.
  • the position information of the cropped subject can understand the information of the position of the position in the screen of the target video, for example, it can be the information of the coordinates of the reference point.
  • the size information of the cropped subject may include actual size information of the cropped subject and information on the proportion of the cropped subject to the size of the target video frame.
  • the category information of the crop subject can be understood as the category of the crop subject.
  • the category information of the crop subject includes information about whether the subject of the crop is classified as a person or an object.
  • the category information of the crop subject can also include whether the subject is a skin care product, Information about toiletries or cars.
  • the cutting subject is shampoo
  • the name information of the cutting subject may be shampoo
  • the category information of the cutting subject may be toiletries.
  • the machine learning model may be a generative model, a decision model, or a deep learning model in machine learning.
  • it may be a deep learning model using algorithms such as the yolo series algorithm, the FasterRCNN algorithm, or the EfficientDet algorithm.
  • the machine learning model can detect the set objects that need attention in each video frame. Objects that need attention can include creatures (humans, animals, etc.), commodities (cars, decorations, cosmetics, etc.), backgrounds (mountains, roads, bridges, houses, etc.), and so on. Further, for the target video, the object to be paid attention to can be further set, for example, it can be set as a human face or a product. Multiple shot fragments can be input into the machine learning model, and the machine learning model can output data such as name information and position information of the cropped subject in each shot fragment.
  • the machine learning model can be trained based on a large number of training samples with logos. Specifically, the training samples with logos are input into the machine learning model, and the training is performed by a common method (for example, the gradient descent method). Update the relevant parameters of the machine learning model.
  • the training samples may be shot fragments and cropped subjects included in the shot fragments. The way to obtain the training samples can call the data in the memory and the database.
  • the identification of the sample may be whether the object in the shot segment is the subject of cropping. If yes, mark it as "1", otherwise mark it as "0".
  • the identification method may be manual marking, automatic marking by machine or other methods, which is not limited in this embodiment.
  • the cropping theme information of the target video can also be obtained.
  • the cropped subject information may be a specific subject of the aforementioned target video.
  • the cropping theme information of the target video can be the keyword information in the title or introduction of the target video to be processed, it can also be the tag information of the target video, it can also be user-defined information or a database. Information that has been stored in.
  • Step 1440 According to the preset picture size and the cropping subject information according to the video size condition, the picture of each video segment included in the target video is cropped.
  • step 1440 may include cropping the frame of the shot fragment according to the preset frame size and cropping subject information.
  • the picture size preset by the video size condition can be understood as the target size for cropping the picture of the target video.
  • the preset screen size may include the target aspect ratio of the screen, and may also include the target width and/or target height of the screen.
  • the aspect ratio and specific size of the cropping frame of each video frame are set according to the preset screen size, and each video frame is set based on the cropping frame of each video frame.
  • the picture outside the cropping frame is cropped and the picture inside the cropping frame is retained.
  • the user can manually input the preset screen size into the system according to the display and playback size of the target playback terminal of the target video.
  • the system can also automatically obtain the optimal size of the target playback terminal for the target video.
  • the optimal size of the data can be Stored in the device where the target video is to be played.
  • the cropping frame can be understood as the cropping boundary determined according to the target screen size of the screen cropping.
  • the cropping frame can be rectangular, parallelogram, circular, etc.
  • step 1440 may further include the following steps: determining the size and size of the cropping frame of several video frames in the video segment according to the cropping subject information and the preset picture size. Initial position; the initial position of the cropping box of several video frames is processed, and the final position corresponding to the cropping box of several video frames is determined; according to the final position of the cropping box, the picture of each video frame of the video clip is cropped to Keep the picture inside the cropping frame.
  • the final positions of the cropping boxes of several video frames are determined according to the initial positions of the cropping boxes of several video frames contained in the video clip.
  • the initial position of the cropping frame can be understood as the position of the cropping frame preliminarily determined based on the cropping subject information and the preset screen size, and the final position of the cropping frame can be understood as the new position determined after processing the initial position information.
  • the initial position information may include the initial coordinate information of the reference point, and the final position information may include the final coordinate information of the reference point.
  • each cropping subject and the subject when determining the size and initial position of the crop box of several video frames in the video clip according to the cropping subject information and the preset picture size, can be determined according to the subject information and cropping subject information. The relevance of the information, and then determine the initial position and size of the cropping frame according to the relevance, cropping subject information and the preset screen size.
  • the relevant description in the part of FIG. 16 please refer to the relevant description in the part of FIG. 16.
  • the cropping frame when the size and initial position of the cropping frame of several video frames are determined according to the cropping subject information and the preset picture size, the cropping frame can be determined according to the preset picture size
  • the initial position and size of the cropping frame are determined based on the position and size of the crop subject and the aspect ratio of the cropping frame, and then the cropping frame is scaled equally according to the preset screen size. For example, in the process of determining the aspect ratio of the cropping frame, if the preset picture size is 800 ⁇ 800, the aspect ratio of the cropping frame is set to 1:1. For another example, if the preset picture size is 720 ⁇ 540, the aspect ratio of the cropping frame is set to 4:3.
  • the initial position and size of the cropping frame can be determined based on the overlapping area of the cropping frame with the same aspect ratio and the area where the crop subject is located in the picture of the video frame.
  • the cropping frame and the screen within the cropping frame are reduced or enlarged in proportion to width and height, so that the size of the cropping frame meets the preset screen size, thereby preventing black borders from appearing in the cropped screen.
  • the picture size of the video frame is 1024 ⁇ 768, and the default picture size is 960 ⁇ 960. You can first determine a 768 ⁇ 768 cropping frame, and then crop the video frame according to the cropping frame. Enlarge the same scale to 960 ⁇ 960.
  • determining the final position corresponding to the crop box of several video frames in the video clip may specifically include: selecting several video frames from all the video frames contained in the video clip, and judging the interval of each pair (every two) Whether the distance between the reference points (such as the center point) of the cropping frame of the preset number of video frames is less than the preset distance; if the logarithm of the cropping frame less than the preset distance is greater than the preset logarithm, it can be understood as this The position of the crop subject in the video clip is relatively static.
  • the average position of the reference point of the crop box of all video frames contained in the video clip can be obtained, and the position of the crop box of each video frame can be adjusted based on the average position; If the logarithm of the crop frame that is less than the preset distance is less than the preset logarithm, it can be understood that the position of the crop subject in the video clip is dynamically changing. At this time, it can be based on the size of the crop frame of all video frames contained in the video clip.
  • the position of the reference point determines a smooth trajectory line, and the position of the crop box of each video frame is adjusted based on the trajectory line (for example, so that the reference point of the crop box of each video frame is located on the trajectory line).
  • the preset number of frames may be 2 frames, 3 frames, 5 frames, or the like. In other embodiments, a pair of video frames separated by a preset number of frames may also be a pair of adjacent video frames.
  • the reference point in this specification can be the center point of the cropping frame, the top left corner point of the rectangle, the bottom right corner of the rectangle, and so on. The reference point is preferably the center point of the cropping frame, so as to reduce changes in the relative positions of the cropping frame and each crop subject in the cropping frame when the position of the cropping frame is moved.
  • adjusting the position of the cropping frame of each video frame in the video clip may specifically include the following steps: smoothing the initial coordinate information of the reference point of the cropping frame of several video frames of the video clip according to time; According to the result of the smoothing process, the final coordinate information of the reference point is determined; the position of the reference point is determined based on the final coordinate information.
  • the initial coordinate information of the reference points of the crop frames of several video frames of the video clip is smoothed according to time, which may be linear regression processing on the coordinate values of the reference points.
  • linear regression processing please refer to the relevant description in Figure 15.
  • FIG. 15 is a schematic diagram of smoothing processing according to some embodiments of the present application.
  • smoothing the initial coordinate information of the reference point includes: performing linear regression processing to obtain the linear regression equation and its slope.
  • linear regression can be performed on the initial coordinate information of the reference point of each clipping frame based on time to obtain the linear regression equation, the fitted straight line segment (see Figure 15) and the slope of the linear regression equation; based on the fitted straight line segment and slope , To get the final coordinate information of the reference point of each cropping frame.
  • the absolute value of the slope can be compared with a slope threshold.
  • the position of the cropped subject in the video segment is considered to be relatively static, and the position of the midpoint of the fitted straight line segment is taken as the final position of the reference point of the cropping frame of each video frame; If the absolute value of the slope is greater than the slope threshold, the position of the cropped subject in the video segment is considered to be dynamically changing, and the position of each time point on the fitted straight line segment is taken as the reference for the cropping frame of each video frame at the corresponding time point The final position of the point.
  • the slope threshold can be set to 0.1, 0.05, or 0.01, etc. Those skilled in the art can set the slope threshold according to the actual situation of the target video. The slope threshold is set higher, such as 0.1.
  • linear regression processes a video segment consisting of 12 video frames.
  • the horizontal screen video is converted to the vertical screen video, so the vertical coordinate of the center of the cropping frame can be fixed at the center position 0.5, and only the horizontal coordinate needs to be smoothed.
  • the specific process of linear regression processing is as follows: 12 video frames corresponding to 12 time points 1, 2, 3, ..., 12, the initial relative position of the abscissa of the reference point of the cropping frame of each video frame is 0.91 ,0.87,0.83,0.74,0.68,0.61,0.55,0.51,0.43,0.39,0.37,0.34.
  • 12 data points are obtained.
  • the coordinates are: (1,0.91), (2,0.87), (3,0.83), (4,0.74), (5 ,0.68), (6,0.61), (7,0.55), (8,0.51), (9,0.43), (10,0.39), (11,0.37), (12,0.34).
  • the initial coordinate information of the reference points of the crop boxes of the multiple video frames included in each video segment is smoothed according to time, which may be a polynomial regression process on the coordinate values. Specifically, a polynomial regression can be performed on the coordinate value of the reference point of each cropping frame based on time to obtain a fitting curve. Then, the position of each time point on the fitting curve can be taken as the final position of the reference point of the crop frame of each video frame at the corresponding time point.
  • FIG. 16 is a flowchart of determining the size and position of the crop box of each video frame according to some embodiments of the present application.
  • one or more steps in the process 1600 may be stored as instructions in a storage device (for example, the database 140), and called by a processing device (for example, the processing device 112 in the server 110) and/or implement.
  • Step 1610 Determine the correlation between one or more cropped subjects in the cropped subject information and the subject information according to the subject information and the cropped subject information of the target video.
  • the subject information can be tailored to include a specific subject of the target video.
  • the degree of relevance between the tailored subject and the topic information can be used to indicate the degree of relevance between the two; the higher the degree of relevance, the higher the degree of relevance.
  • the correlation between "steering wheel cover” and “car interior” is greater than the correlation between “car door” and “car interior”; the correlation between “car door” and “car interior” is greater than that of "hand cream” "Relevance to "car interior”.
  • the correlation between the cropped subject and the topic information can be obtained based on the interpretation text of the two.
  • the explanatory text can be a text description of the tailored subject or subject information.
  • the explanatory text of "car interior” is "car interior mainly refers to the car products used in the interior modification of the car, involving all aspects of the car interior, such as the car steering wheel. Covers, car seat cushions, car floor mats, car perfumes, car pendants, interior decorations, storage boxes, etc. are all car interior products.”
  • the explanatory text of "steering wheel cover” is "steering wheel finger cover on the steering wheel.
  • the steering wheel cover is very decorative.”
  • the interpretation text of the tailored subject and the subject information can be pre-stored in the system, or it can be obtained from the Internet in real time based on the tailored subject’s name and subject information.
  • the representation vector of the explanatory text can be obtained based on a text embedding model such as wor18vec, and the correlation between the cropped subject and the topic information can be obtained based on the distance of the representation vector.
  • the crop subject can also be confirmed by the candidate crop subject method.
  • the candidate crop subject is similar to the aforementioned candidate subject and can be input by the user or determined from each video clip through the method of machine learning.
  • the method of determining the subject of cropping can refer to the method of determining a specific subject based on the candidate subject in the foregoing process, that is, using a machine learning model to obtain the candidate cropped subject in each of the video clips, and then according to the subject information of the target video , Selecting the one or more specific cropped subjects from the candidate cropped subjects.
  • Step 1620 Determine multiple candidate cropping frames corresponding to at least one video frame according to the preset picture size and the specific cropping subject.
  • At least one candidate cropping frame in each video frame, can be set according to a preset picture size and a specific cropping subject. In a video frame that does not contain any subject to be cropped, only one candidate cropping frame can be set, and the candidate cropping frame is centered by default. In a video frame containing at least one crop subject, multiple candidate cropping frames can be set, and the positions and/or sizes of the reference points of the multiple candidate cropping frames are different, and the aspect ratios of the multiple candidate cropping frames are the same.
  • Step 1630 Score at least one candidate cropping frame according to the cropping subject information and the correlation degree.
  • each crop subject based on the correlation between each crop subject in the candidate crop frame and the subject of the target video, score each crop subject first, determine the score of each crop subject, and then calculate the score of the candidate crop frame .
  • the correlation between the cropped subject and the video theme can be used as the weight, multiplied by the score of the corresponding cropped subject, and then summed to obtain the score of each candidate cropping frame.
  • the score of each cropped subject may be the ratio of the area occupied by the cropped subject to the total area of the video frame.
  • the subject of a video is “toilet care products”.
  • An optional cropping box in a certain frame of the video contains the complete cropped subject: shampoo 1, shampoo 2, and face 1.
  • the correlations between shampoo 1, shampoo 2, and face 1 and "toilet products” are 0.86, 0.86, and 0.45, respectively.
  • the cut subject scores of shampoo 1, shampoo 2, and face 1 are 0.35, 0.1, and 0.1, respectively. .
  • Step 1640 Determine the size and position of the crop frame of at least one video frame based on the scoring result of the candidate crop frame.
  • the method for determining the size and position of the cropping frame of the video frame may be that the position of the reference point of the candidate cropping frame with the highest score in the video segment shall prevail.
  • the position of the reference point of the cropping frame is taken as the final position of the reference point of the cropping frame of all video frames in the video clip, and the size of the candidate cropping frame is taken as the size of the cropping frame of all video frames in the video clip.
  • the method for determining the size and position of the crop frame of the video frame may also be to select the candidate crop frame with the top Y rank in the score of each video frame, and calculate the value of the Y candidate crop frames.
  • the average position of the reference point is taken as the position of the crop frame of the video frame, and the size of the candidate crop frame with the highest score is taken as the size of the crop frame of the video frame.
  • the value of Y can be selected as 3, 4, 5, or 8, etc., and those skilled in the art can determine the value of Y according to the number of candidate cropping frames in each video frame.
  • the size and position of the crop frame are determined based on the subject information of the cropped subject and the correlation between the subject information of the target video and the subject of cropping.
  • the target video loses as little main information as possible (information related to the subject information of the target video).
  • the method 1400 for cropping a target video frame may further include: step 1450: splicing the cropped shot fragments into a new target video in a predetermined order.
  • the predetermined sequence may be the sequence determined by the sequence information of the video configuration information, or may be a new splicing sequence set by the user.
  • each initial video and/or initial image can be cropped to be the same size.
  • this application may generate multiple target videos for delivery, and optimize the video generation algorithm of this application according to the feedback results of different videos.
  • the aforementioned multiple target videos may be generated based on different audiences, and correspondingly, the target videos are delivered to a specific audience group, where the audience refers to the target video delivery group.
  • the specific audience group may be a group of people of a specific age, gender, and behavior characteristics.
  • the specific age may refer to younger age (for example, if the proportion of users who are 15-40 years old in the delivery platform is 80%, the user is considered to be younger), middle-aged, aging, and so on.
  • the gender can be characterized by a male to female ratio (for example, a male to female ratio of 1:3).
  • the behavior characteristics may include browsing habits, shopping preferences, and so on. For example, what kind of videos the users of the platform prefer to browse.
  • the delivery time may be shorter, for example, less than 1 week, longer, for example, more than one week, etc.
  • the launch period can be peak period (618 promotion period, double eleven period), idle period (non-promotion period), etc.
  • the platform placement position can be, for example, whether it is a homepage recommendation.
  • the characteristics of the released platform can include the type of platform (online platform (APP), offline platform (airport, subway, etc.), APP type (video playback, music playback, learning, etc.)), platform traffic characteristics (such as Large flow) and so on.
  • Fig. 17 is a flowchart of generating a target video according to a target video audience group shown in some embodiments of the present application.
  • one or more steps in the process 1700 may be stored in a storage device (for example, the database 140) in the form of instructions, and called by a processing device (for example, the processing device 112 in the server 110) and/or implement.
  • Step 1710 Obtain the audience acceptance of multiple video clips.
  • the video clip may have obvious relevance to the video audience.
  • a video clip containing Ultraman may be popular with a child audience, while a middle-aged audience may not be interested.
  • Audience acceptance is an indicator that describes the relevance of video clips and video audiences.
  • the audience acceptance of the video clip can be realized by inputting the video clip or the elements of the video clip (for example, each tag ID and tag value) into a trained machine learning model.
  • the machine learning model can be determined according to the delivery effect of each video clip under different audiences, and the relevant description of the delivery effect can refer to the subsequent step 1740.
  • Step 1720 For a specific audience group, determine candidate fragments whose audience acceptance is higher than the threshold from a plurality of video fragments according to corresponding demographic conditions, so as to generate a target video.
  • audience acceptance may be specifically expressed as a specific tag of the video clip and its corresponding tag value.
  • the tag ID of the audience acceptance of female audiences may be #61
  • the tag ID of the audience acceptance of male audiences It can be #62
  • the corresponding tag value is the specific audience acceptance.
  • the audience acceptance can be quantitatively described.
  • the tag value corresponding to the audience acceptance can be any real number within [-1,1], where a positive value means like, a negative value means dislike, and 0 Indicates no interest, the greater the absolute value of the value, the higher the degree.
  • the threshold in step 1720 may be a value in the range of tag values that represents a preference, such as 0.5.
  • the audience acceptance can be qualitatively described.
  • the tag value corresponding to the audience acceptance can be -3, -2, -1,0,1,2,3.
  • it can pass four Bit binary tag expression, where the first two digits of the tag value indicate dislike, the last two digits of the tag value indicate like, and the tag value of 0 indicates no interest.
  • the threshold may be a value indicating a preference, for example, the first two digits of the tag value are 00, and the last two digits of the tag value are greater than 01.
  • the first preset condition in the process 800 may include that the audience acceptance in step 1720 is higher than the threshold.
  • the implementation of step 1720 can refer to the related description of step 820 in the process 800.
  • Step 1730 Obtain the delivery effect feedback of the target video, and adjust at least one of demographic conditions or audience acceptance according to the delivery effect feedback.
  • the delivery effect may be determined by at least one of related indicators such as the click rate, the completion rate, the number of replays, and the number of viewers of the target video. For example, a higher completion rate of a target video can be considered that the audience likes the video, and a higher number of replays can indicate a higher degree of likeness.
  • the delivery effect of each video segment of the target video can be determined. For example, it can be based on the ratio of the switching amount of each video segment (referring to the user stopping the playback of this video and switching to the next video in the video segment) and the playback volume It is determined that the user likes it, the higher the ratio, the more the user likes the video clip.
  • the delivery effect of the video clip may be determined according to the skipped part of the user after the broadcast or the user of the rebroadcast, where the playback effect of the skipped video clip is poor.
  • the delivery effect may be input to the machine learning model in step 1710 as feedback to realize the re-determination of demographic characteristics or audience acceptance.
  • the corresponding tag value of the corresponding video segment can be directly modified according to the delivery effect of each video segment.
  • the delivery effect of the target video can be estimated based on the aforementioned determined delivery effect and the machine learning model, and the target video with the delivery effect higher than a preset value can be delivered.
  • the estimated effect data of the target video may be determined based on the element effect parameter of at least one advertisement element of the creative advertisement.
  • Advertising elements can be understood as the component units of advertising creativity, which can specifically include the main elements of the lens (the aforementioned specific objects, subjects, cropped subjects, etc.), and decorative elements (background pictures, models, copywriting, trademarks, product pictures, and/or promotional logos, etc.) ) And element presentation methods (animation actions, AE templates, etc.).
  • the data related to the delivery effect of the advertisement creative may also include click-through rate, exposure rate, conversion rate, return on investment, and the like.
  • Click-through rate can be understood as the click-through rate of online advertising creatives, that is, the actual number of clicks on the advertisement.
  • Exposure rate can be understood as the number of times an ad creative is displayed per unit time. For example, the number of views of a certain network media is 3000 people per day. If an advertisement exclusively occupies a certain advertising space, then the daily advertising exposure will be 3000. If the advertising space displays 3 advertisements in turn, then the daily advertising exposure will be 3000/3.
  • Conversion rate can be understood as the ratio of the number of times an advertisement is clicked and converted to further effects (such as purchase and payment orders) to the number of advertisement clicks.
  • the return on investment can be understood as the ratio of the return on advertising to the cost of input.
  • the delivery effect data of the placed advertising creative may include the effect data of a preset time period, such as one day, one week, one month, or one quarter.
  • the placement effect data of the advertisement creative that has been placed may also include the placement effect change trend of the advertisement creative over time, season, and platform audience characteristics.
  • FIG. 18 is a flowchart of confirming the estimated effect data of the target video according to some embodiments of the present application.
  • one or more steps in the process 1800 may be stored as instructions in a storage device (for example, the database 140), and called by a processing device (for example, the processing device 112 in the server 110) and/or implement.
  • Step 1810 Obtain an advertisement element effect prediction model.
  • the advertising element effect prediction model may be a model that can score each advertising element.
  • the advertising element effect prediction model can be a trained machine learning model. Through the trained model, the delivery effect data of several advertising creatives containing advertising elements and several advertising creatives containing advertising elements are used as feature input The advertising element effect prediction model can get the score of the advertising element.
  • the advertising element effect estimation model can be determined based on the aforementioned specific audience group and the specific theme of the target video.
  • the specific audience group may be women, and the specific theme of the target video may correspond to the mouthwash advertisement.
  • Advertising element effect prediction model The advertising element effect prediction model of cleaning and/or washing and nursing daily necessities products for women.
  • Step 1820 Input the advertisement element marked with at least one element tag into the advertisement element effect prediction model, and determine the element effect parameter of the advertisement element.
  • the at least one element tag includes the relationship between the advertisement element and the creative advertisement, where the element effect parameter of the advertisement element may refer to the contribution amount of the advertisement element in a certain period of time.
  • the element tags may be equivalent to the various elements in the foregoing step 230, where the advertisement element and the creative advertisement may be the relationship between a specific object, subject, or cropped subject in the video clip and a specific subject of the target video, which may be represented by correlation .
  • the relationship between the advertisement element and the creative advertisement may also include the relationship between the advertisement element and a specific occasion (such as a subway station, a railway station, a giant screen in a city center), and a specific time (such as Double Eleven, Valentine's Day).
  • step 1820 may first determine a number of published advertisement creatives containing advertisement elements. Based on the placement effect data of the advertisements that have been placed, determine the placement effect data of several placed advertising creatives that contain advertising elements; the server will include the average, median, and accumulation of the placement effect data of several placed advertising creatives that contain advertising elements The sum or the weighted cumulative sum is used as the element effect parameter of the advertisement element.
  • the published advertisement creatives containing the advertisement element a are M1, M2, M3, and M4.
  • the average click-through rates of M1, M2, M3, and M4 are 1,000, 2,000, 500, and 3500, respectively.
  • the element effect parameter of the advertisement element may include data obtained by numerical statistical calculation through the placement effect data of the advertisement creative that contains the advertisement element, which can intuitively reflect the contribution of the advertisement element over a period of time.
  • the element effect parameter of each advertisement element may be determined through a placement experiment.
  • orthogonal experimental algorithms can be used to calculate the smallest set of advertising creatives that contain the most advertising elements, so that the most advertising elements can be obtained by placing the least creatives data. Further, you can also classify advertising elements first, and specify that all advertising elements contained in certain categories of advertising elements (such as models or copywriting, etc.) need to be delivered, and use the orthogonal experiment algorithm to calculate the specified types of advertising elements. The minimum advertising creative collection of all advertising elements.
  • Step 1830 based on the element effect parameter of the at least one advertisement element, determine an advertisement element that meets the expectation among the at least one advertisement element, and the element effect parameter of the advertisement element that meets the expectation is greater than the parameter threshold.
  • Step 1840 Determine the proportion of advertisement elements that meet the expectations in at least one advertisement element in the creative advertisement.
  • the proportion of advertisement elements that meet expectations in at least one advertisement element in the creative advertisement may be the proportion of each expected advertisement element in the combination of video clips occupying the target video.
  • Step 1850 Determine the estimated effect data of the target video based on the ratio.
  • the target video can be estimated based on the delivery effect data of each advertising element in the target video and the delivery effect data of several advertising creatives containing the advertising elements to obtain the estimated effect data of the target video.
  • Fig. 19 is an exemplary flowchart of a model training process according to some embodiments of the present application.
  • one or more steps in the process 1900 may be stored as instructions in a storage device (for example, the database 140), and called by a processing device (for example, the processing device 112 in the server 110) and/or implement.
  • the process 1900 may be used to train the first initial model.
  • Step 1910 Obtain the first training set.
  • the first training set refers to the training sample set used to train the first initial model.
  • the first training set includes multiple video pairs, where each video pair includes a corresponding image feature corresponding to the first sample video, an image feature corresponding to the second sample video, and label values corresponding to the two sample videos. Obtaining the corresponding image features from the first sample video and the second sample video can be obtained in a characteristic extraction process.
  • the tag value in the video pair reflects the degree of similarity between the first sample video and the second sample video.
  • the label value in the sample set can be manually annotated, or the video pair can be automatically annotated by the corresponding machine learning model.
  • the degree of similarity of each video pair can be obtained from the trained classifier model.
  • the method for obtaining the first training set may be from an image collector such as a camera, a camera, a smart phone, or the terminal device 130. In some embodiments, the method of obtaining the first training set may be to directly read from a storage system that stores a large number of pictures. In some embodiments, the first training set may also be obtained in any other manner, which is not limited in this embodiment.
  • the first initial model can be understood as an untrained neural network model or an untrained neural network model.
  • the first initial model may be or include the trained feature extraction model and/or the initial model corresponding to the discriminant model described in the process 1100.
  • Each layer of the initial model can be set with initial parameters, and the parameters can be adjusted continuously during the training process until the training is completed.
  • Step 1920 Based on the first training set, train a first initial model through multiple iterations to generate a trained first neural network model.
  • the first neural network model may be or include the trained feature extraction model and/or the discriminant model described in the process 1100. Each iteration further includes the following steps.
  • Step 1921 Use the updated first feature extraction model to process the image feature corresponding to the first sample video in the video pair to obtain the corresponding first segment feature.
  • Step 1922 Use the updated second feature extraction model to process the image feature corresponding to the second sample video in the same video pair to obtain the second segment feature.
  • Step 1923 Use the updated discriminant model to process the first segment feature and the second segment feature to generate a discrimination result, and the discrimination result is used to reflect the degree of similarity between the first segment feature and the second segment feature.
  • Step 1924 Determine whether to perform the next iteration or determine the trained first neural network model based on the discrimination result and the label value.
  • a loss function can be constructed based on the discriminant result and the sample label, and the model parameters can be updated based on the loss function backpropagation.
  • the training sample label data can be expressed as y 1
  • the discrimination result can be expressed as The calculated loss function value is expressed as Loss 1 .
  • different loss functions can be selected according to the type of the model, such as the mean square error loss function or the cross entropy loss function as the loss function, which is not limited in this specification.
  • a gradient backpropagation algorithm can be used to update the model parameters.
  • the backpropagation algorithm compares the prediction results of a specific training sample with the label data, and determines the update range of each weight of the model.
  • the backpropagation algorithm is used to determine the change of the loss function relative to each weight (also called gradient or error derivative), which is recorded as
  • the gradient backpropagation algorithm can pass the value of the loss function through the output layer, pass it back to the hidden layer and the input layer layer by layer, and determine the correction value (or gradient) of the model parameter of each layer in turn.
  • the correction value (or gradient) of the model parameters of each layer includes multiple matrix elements (such as gradient elements), which correspond to the model parameters one-to-one, and each gradient element reflects the correction direction (increase or decrease) of the parameter, and Correction amount.
  • the discriminant model after the discriminant model is returned to complete the gradient, it further transfers the model parameters to the first segment feature extraction model, and the second segment feature extraction model returns the model parameters layer by layer to complete a round of iterative update .
  • the joint training of the first segment feature extraction model, the second segment feature extraction model, and the discriminant model adopts a unified loss function for training, and the training efficiency is higher.
  • the next iteration it is possible to determine whether to perform the next iteration or determine the trained first neural network model based on the discrimination result and the label value.
  • the criterion for judgment may be whether the number of iterations has reached the preset number of iterations, whether the updated model meets the preset performance index threshold, etc., or whether an instruction to terminate training is received. If it is determined that the next iteration is required, the next iteration can be performed based on the first part of the updated model during the current iteration. In other words, the updated model obtained in the current iteration will be used as the initial model for the next iteration in the next iteration. If it is determined that the next iteration is not necessary, the updated model obtained during the current iteration can be used as the final trained model.
  • 20A-20E are schematic diagrams of video synthesis systems according to some embodiments of the present application.
  • the multimedia system 2000 may include an acquisition module 2010, a configuration module 2020, and a generation module 2030.
  • the obtaining module 2010 may be used to obtain multiple video clips.
  • the configuration module 2020 may be used to obtain video configuration information.
  • the generating module 2030 may be configured to generate a target video based on the at least part of the video clip and the video configuration information.
  • the generating module 2030 may also be referred to as a target video generating module.
  • step 210 may be implemented by the acquisition module 2010, step 220 may be implemented by the configuration module 2020, and step 230 may be implemented by the generation module 2030.
  • the acquisition module 2010 may further include a media acquisition module 2011, a segmentation module 2013, and a material processing module 2015.
  • the material processing module 2015 also includes a video processing module 2015a and an image processing module 2015b.
  • the configuration module 2020 may further include an identification module 2021, where the identification module 2021 may also be referred to as a subject acquisition module.
  • the generating module 2030 may also include a screening module 2031, a combination module 2033, and a video synthesis module 2035.
  • the multimedia system 2000 may further include a post-processing module 2040, where the post-processing module 2040 may include a cropping module 2041, an effect estimation module 2043.
  • the media acquisition module 2011 may be used to acquire the initial video or initial image to implement steps 310, 610, and 810 related to the initial video or initial image.
  • the media acquisition module 2011 may also be used to acquire initial audio to implement step 1310.
  • the segmentation module 2013 may be used to segment the video file according to the lens frame, so as to implement steps 320, steps 420 to 440, step 1330 and other steps related to segmentation of the lens frame.
  • the segmentation module 2013 may also be used to determine the clipping point of the audio file, so as to implement step 1320.
  • the material processing module 2015 can be used to generate video clips, and can also be used to process segmented video materials, such as rendering, beautifying, and applying video templates. It can also be used to combine different types of materials. For example, in step 1340, The audio file is merged with the video file.
  • the material processing module 2015 may specifically include a video processing module 2015a for processing video materials corresponding to the initial video and a picture processing module 2015b for processing image materials corresponding to the initial image.
  • the configuration module 2020 may further include an identification module 2021 for identifying a subject.
  • the identification module 2021 can be combined with other modules, and the corresponding subject can be changed to corresponding content, for example, combined with the cropping module 2041, and the corresponding subject can be a cropped subject.
  • the screening module 2031 may be specifically configured to screen video files according to conditions. For example, in step 820, candidate video clips may be screened from clips according to a first preset condition. The screening module 2031 can also determine the initial image, the initial video, and the video segment related to the target video according to whether the subject is included.
  • the combining module 2033 may be specifically configured to generate a segment combination according to the candidate video segments, and the combining module 2033 may also be configured to determine a set of segments used to generate the target video according to a second preset condition.
  • the video synthesis module 2035 may specifically generate the target video according to the segment set.
  • the post-processing module 2040 may be used to reprocess the target video after the target video is generated, for example, to target the target video to a specific audience.
  • the cropping module 2041 may be used to modify the size of the video file, for example, modify the size of the target video according to the size of the playback medium.
  • the effect estimation module 2043 is used to estimate the playback effect of the target video.
  • the media acquisition module 2011 can also be regarded as the acquisition module 2010. It is understandable that the above modules can be combined according to actual needs to implement different methods. For example, as shown in FIG. 20C, the media acquisition module 2011, the subject acquisition module (ie, the recognition module 2021), the video processing module 2015a, the image processing module 2015b, and the target video generation module (ie, the generation module 2030) can be combined to achieve utilization Video material and image material generate video. For another example, as shown in FIG.
  • the acquisition module 2010, the splitting module (ie, the splitting module 2013), the filtering module 2031, the combining module 2033, and the video synthesis module 2035 can be combined to realize the splitting and reorganization of the video file.
  • the acquisition module 2010, the segmentation module 2013, the recognition module 2021, and the cropping module 2041 can be combined to achieve precise cropping of a specific video file.
  • a computer storage medium may contain a propagated data signal containing a computer program code, for example on a baseband or as part of a carrier wave.
  • the propagated signal may have multiple manifestations, including electromagnetic forms, optical forms, etc., or suitable combinations.
  • the computer storage medium may be any computer readable medium other than the computer readable storage medium, and the medium may be connected to an instruction execution system, device, or device to realize communication, propagation, or transmission of the program for use.
  • the program code located on the computer storage medium can be transmitted through any suitable medium, including radio, cable, fiber optic cable, RF, or similar medium, or any combination of the above medium.
  • the computer program codes required for the operation of each part of this manual can be written in any one or more programming languages, including object-oriented programming languages such as Java, Scala, Smalltalk, Eiffel, JADE, Emerald, C++, C#, VB.NET, Python Etc., conventional programming languages such as C language, VisualBasic, Fortran2003, Perl, COBOL2002, PHP, ABAP, dynamic programming languages such as Python, Ruby and Groovy, or other programming languages.
  • the program code can run entirely on the user's computer, or as an independent software package on the user's computer, or partly on the user's computer and partly on a remote computer, or entirely on the remote computer or processing equipment.
  • the remote computer can be connected to the user's computer through any network form, such as a local area network (LAN) or a wide area network (WAN), or connected to an external computer (for example, via the Internet), or in a cloud computing environment, or as a service Use software as a service (SaaS).
  • LAN local area network
  • WAN wide area network
  • SaaS service Use software as a service
  • numbers describing the number of ingredients and attributes are used. It should be understood that such numbers used in the description of the embodiments use the modifiers "approximately”, “approximately” or “substantially” in some examples. Retouch. Unless otherwise stated, “approximately”, “approximately” or “substantially” indicates that the number is allowed to vary by ⁇ 20%.
  • the numerical parameters used in the description and claims are approximate values, and the approximate values can be changed according to the required characteristics of individual embodiments. In some embodiments, the numerical parameter should consider the prescribed effective digits and adopt the method of general digit retention. Although the numerical ranges and parameters used to confirm the breadth of the ranges in some embodiments of this specification are approximate values, in specific embodiments, the setting of such numerical values is as accurate as possible within the feasible range.

Abstract

A system and a method for generating a video. Said method comprises acquiring a plurality of video segments (210). Said method further comprises acquiring video configuration information (220), the video configuration information comprising one or more configuration features of at least some video segments among the plurality of video segments, and the configuration features including at least one of a content feature or an arrangement feature. Said method further comprises generating a target video on the basis of at least some video segments and the video configuration information (230).

Description

一种生成视频的系统和方法System and method for generating video
交叉引用cross reference
本申请要求在2020年6月23日提交的中国专利申请202010578632.1,2020年7月29日提交的中国专利申请202010741962.8,2020年7月28日提交的中国专利申请202010738213.X,以及2021年5月10日提交的中国专利申请202110503297.3的优先权,其全部内容通过引用的方式被合并在本文中。This application requires Chinese patent application 202010578632.1 filed on June 23, 2020, Chinese patent application 202010741962.8 filed on July 29, 2020, Chinese patent application 202010738213.X filed on July 28, 2020, and May 2021 The priority of the Chinese patent application 202110503297.3 filed on the 10th is incorporated herein by reference in its entirety.
技术领域Technical field
本申请涉及视频处理,特别涉及一种生成视频的系统和方法。This application relates to video processing, and in particular to a system and method for generating video.
背景技术Background technique
随着多媒体技术的发展,视频已成为人们最常接触的信息媒介之一。但在视频裁剪处理、镜头组合、终端播放的过程中仍存在较多问题。例如,通过系统自动进行镜头组合的过程存在内容单一化、重复度高、主题聚焦度低等问题。又例如,同一视频在不同播放终端播放时,存在目标对象(例如,人物、产品)被遮盖或显示不全等情况。因此,亟需提供一种生成视频的方法和系统,以满足多样化的视频制作和播放需求。With the development of multimedia technology, video has become one of the most frequently contacted information media. However, there are still many problems in the process of video cropping, lens combination, and terminal playback. For example, the process of automatic lens combination through the system has problems such as simplification of content, high repetition, and low subject focus. For another example, when the same video is played on different playback terminals, there are situations in which target objects (for example, characters, products) are covered or displayed incompletely. Therefore, there is an urgent need to provide a method and system for generating videos to meet the diverse needs of video production and playback.
发明内容Summary of the invention
本申请的一个方面提供一种生成视频的系统。所述系统包括存储有一组指令的至少一个存储介质以及至少一个处理器被配置用于与所述至少一个存储介质通信,其中当执行所述指令集时,所述至少一个处理器用于执行一个或多个操作,所述操作包括:获取多个视频片段。获取视频配置信息,所述视频配置信息包括所述多个视频片段中至少部分视频片段的 一个或多个配置特征,所述配置特征包括内容特征或排布特征中的至少一个。基于所述至少部分视频片段和所述视频配置信息,生成一个目标视频。One aspect of the present application provides a system for generating video. The system includes at least one storage medium storing a set of instructions and at least one processor configured to communicate with the at least one storage medium, wherein when the instruction set is executed, the at least one processor is used to execute one or Multiple operations, the operations include: obtaining multiple video clips. Obtain video configuration information, where the video configuration information includes one or more configuration features of at least some of the multiple video clips, and the configuration features include at least one of content features or arrangement features. Based on the at least part of the video segment and the video configuration information, a target video is generated.
本申请的另一个方面提供一种生成视频的方法。所述方法由包括至少一个存储器和至少一个处理器的处理设备执行,所述方法包括:获取多个视频片段。获取视频配置信息,所述视频配置信息包括所述多个视频片段中至少部分视频片段的一个或多个配置特征,所述配置特征包括内容特征或排布特征中的至少一个。基于所述至少部分视频片段和所述视频配置信息,生成一个目标视频。Another aspect of the present application provides a method of generating a video. The method is executed by a processing device including at least one memory and at least one processor, and the method includes: acquiring a plurality of video clips. Obtain video configuration information, where the video configuration information includes one or more configuration features of at least some of the multiple video clips, and the configuration features include at least one of content features or arrangement features. Based on the at least part of the video segment and the video configuration information, a target video is generated.
本申请的另一方面提供一种非暂时性计算机可读介质。所述非暂时性计算机可读介质包括用于确定最佳策略的至少一组指令,其中当由计算设备的一个或以上处理器执行时,所述至少一组指令使所述计算设备用于执行方法,所述方法包括:获取多个视频片段。获取视频配置信息,所述视频配置信息包括所述多个视频片段中至少部分视频片段的一个或多个配置特征,所述配置特征包括内容特征或排布特征中的至少一个。基于所述至少部分视频片段和所述视频配置信息,生成一个目标视频。Another aspect of the present application provides a non-transitory computer readable medium. The non-transitory computer-readable medium includes at least one set of instructions for determining an optimal strategy, wherein when executed by one or more processors of a computing device, the at least one set of instructions causes the computing device to execute Method, the method includes: acquiring a plurality of video clips. Obtain video configuration information, where the video configuration information includes one or more configuration features of at least some of the multiple video clips, and the configuration features include at least one of content features or arrangement features. Based on the at least part of the video segment and the video configuration information, a target video is generated.
在一些实施例中,所述获取多个视频片段包括:获取初始图像或初始视频中的至少一种。对所述初始图像或初始视频进行编辑处理,得到所述多个视频片段。In some embodiments, the acquiring multiple video clips includes acquiring at least one of an initial image or an initial video. Perform editing processing on the initial image or initial video to obtain the multiple video clips.
在一些实施例中,所述对初始图像或初始视频进行编辑处理,得到所述多个视频片段包括:获取所述初始图像或初始视频中每对相邻图像或视频帧的特征。确定所述每对相邻图像或视频帧的相似度。基于每对相邻图像或视频帧的相似度,识别片段边界。基于所述片段边界,将所述初始图像或初始视频进行分割,得到所述多个视频片段。In some embodiments, the performing editing processing on the initial image or the initial video to obtain the plurality of video clips includes: acquiring the characteristics of each pair of adjacent images or video frames in the initial image or the initial video. Determine the similarity of each pair of adjacent images or video frames. Based on the similarity of each pair of adjacent images or video frames, segment boundaries are identified. Based on the segment boundary, the initial image or the initial video is divided to obtain the multiple video segments.
在一些实施例中,所述多个视频片段中的每个视频片段为一个镜头片段。In some embodiments, each video segment of the plurality of video segments is a shot segment.
在一些实施例中,所述对初始图像或初始视频进行编辑处理,得到所述多个视频片段包括:确定所述初始图像或初始视频的主体信息,所述主体信息至少包括主体和主体位置。基于所述主体信息对所述初始图像或初始视频进行编辑处理,得到所述多个视频片段。In some embodiments, the editing of the initial image or the initial video to obtain the multiple video clips includes: determining subject information of the initial image or the initial video, the subject information including at least the subject and the subject position. Perform editing processing on the initial image or initial video based on the subject information to obtain the multiple video clips.
在一些实施例中,所述确定所述初始图像或初始视频中的主体信息包括:获取一个主体信息确定模型。通过将所述初始图像或初始视频输入所述主体信息确定模型,确定所述主体信息。In some embodiments, the determining the subject information in the initial image or the initial video includes: obtaining a subject information determination model. The subject information is determined by inputting the initial image or the initial video into the subject information determination model.
在一些实施例中,所述基于所述主体信息对所述初始图像或初始视频进行编辑处理包括:基于所述主体信息,识别所述初始图像或初始视频中主体的外轮廓。根据所述主体的外轮廓,对所述初始图像或初始视频进行裁剪或缩放。In some embodiments, the editing of the initial image or the initial video based on the subject information includes: recognizing the outline of the subject in the initial image or the initial video based on the subject information. According to the outer contour of the subject, the initial image or initial video is cropped or zoomed.
在一些实施例中,所述视频配置信息包括第一预设条件和第二预设条件。In some embodiments, the video configuration information includes a first preset condition and a second preset condition.
在一些实施例中,所述基于所述至少部分视频片段和所述视频配置信息,生成一个目标视频包括:基于所述第一预设条件从所述多个视频片段中获取一个或多个候选视频片段。对所述一个或多个候选视频片段进行分组以确定至少一个片段集合。基于所述至少一个片段集合中的每个片段集合,生成一个目标视频。In some embodiments, the generating a target video based on the at least part of the video clips and the video configuration information includes: obtaining one or more candidates from the multiple video clips based on the first preset condition Video clips. The one or more candidate video segments are grouped to determine at least one segment set. Based on each segment set in the at least one segment set, a target video is generated.
在一些实施例中,所述第一预设条件与多个要素中的至少一个相关,所述多个要素包括目标视频包含特定对象,目标视频包含特定主题,目标视频的总时长,目标视频所包含的镜头画面数量,目标视频所包含特定的镜头画面,目标视频中特定主题的重叠数量,或目标视频中特定主题的聚焦时间。In some embodiments, the first preset condition is related to at least one of a plurality of elements, the plurality of elements including that the target video contains a specific object, the target video contains a specific subject, the total duration of the target video, and the target video The number of shots included, the specific shots contained in the target video, the number of overlaps of a specific subject in the target video, or the focusing time of a specific subject in the target video.
在一些实施例中,所述第一预设条件包括所述至少一个要素的值大于相应的阈值。In some embodiments, the first preset condition includes that the value of the at least one element is greater than a corresponding threshold.
在一些实施例中,所述第一预设条件还包括所述多个要素中两个或以上特定要素之间的要素约束条件。In some embodiments, the first preset condition further includes element constraint conditions between two or more specific elements in the plurality of elements.
在一些实施例中,所述第一预设条件包括目标视频中镜头画面的绑定条件,所述绑定条件反映至少两个特定镜头画面在目标视频中的关联关系,所述基于第一预设条件从所述多个视频片段中获取一个或多个候选视频片段包括:从所述多个视频片段中确定包含指定镜头画面的视频片段。基于所述绑定条件将包含指定镜头画面的视频片段组合,以作为一个候选视频片段。In some embodiments, the first preset condition includes a binding condition of a shot frame in the target video, and the binding condition reflects the association relationship of at least two specific shot frames in the target video, and the binding condition is based on the first preset condition. It is assumed that obtaining one or more candidate video clips from the plurality of video clips includes: determining a video clip containing a specified shot picture from the plurality of video clips. Based on the binding condition, the video clips including the specified shots are combined to serve as a candidate video clip.
在一些实施例中,所述至少一个片段集合包括两个及以上片段集合,所述两个及以上片段集合满足所述第二预设条件,所述第二预设条件与所述两个及以上片段集合之间的候选视频片段的组合差异度相关。In some embodiments, the at least one fragment set includes two or more fragment sets, and the two or more fragment sets satisfy the second preset condition, and the second preset condition is consistent with the two and The combination difference degree of the candidate video clips between the above clip sets is related.
在一些实施例中,所述对所述一个或多个候选视频片段进行分组以确定所述至少一个片段集合包括:确定所述两个及以上片段集合中的每个片段集合与其他片段集合之间的候选视频片段的组合差异度。将与其他片段集合的组合差异度高于预设阈值的片段集合作为所述至少一个片段集合。In some embodiments, the grouping the one or more candidate video clips to determine the at least one clip set includes: determining the difference between each of the two or more clip sets and other clip sets. The degree of difference in the combination of candidate video clips between. A segment set whose combination difference with other segment sets is higher than a preset threshold is used as the at least one segment set.
在一些实施例中,所述确定所述两个及以上片段集合中的每个片段集合与其他片段集合之间的候选视频片段的组合差异度包括:为所述一个或多个候选视频片段中的每一个赋予一个标识字符。基于所述一个或多个候选视频片段的标识字符,确定对应于所述片段集合与其他片段集合的字符串。将所述片段集合与其他片段集合对应的字符串的编辑距离确定为所述片段集合与其他片段集合之间的候选视频片段的组合差异度。In some embodiments, the determining the degree of difference in combination of candidate video segments between each of the two or more segment sets and other segment sets includes: Each of the is assigned an identifying character. Based on the identification characters of the one or more candidate video segments, a character string corresponding to the segment set and other segment sets is determined. The edit distance of the character strings corresponding to the segment set and the other segment sets is determined as the combination difference degree of the candidate video segments between the segment set and the other segment sets.
在一些实施例中,所述确定所述两个及以上片段集合中的每个片段集合与其他片段集合之间的候选视频片段的组合差异度包括:基于训练好的特征提取模型以及所述两个及以上片段集合中的候选视频片段,生成 每个候选视频片段对应的片段特征。基于所述片段特征生成每个片段集合对应的集合特征向量。基于训练好的判别模型以及所述每个片段集合对应的集合特征向量确定每个片段集合与其他片段集合之间的相似程度。基于所述相似程度确定每个片段集合与其他片段集合之间的组合差异度。In some embodiments, the determining the combination difference of candidate video segments between each of the two or more segment sets and other segment sets includes: based on a trained feature extraction model and the two Generate a segment feature corresponding to each candidate video segment from the candidate video segments in the set of more than one segment. Based on the segment features, a set feature vector corresponding to each segment set is generated. The degree of similarity between each segment set and other segment sets is determined based on the trained discriminant model and the set feature vector corresponding to each segment set. Based on the degree of similarity, the degree of combined difference between each segment set and other segment sets is determined.
在一些实施例中,所述基于训练好的判别模型以及所述每个片段集合对应的集合特征向量确定每个片段集合与其他片段集合之间的相似程度包括:基于聚类算法对所述集合特征向量进行聚类,获得多个集合聚类簇。In some embodiments, the determining the degree of similarity between each fragment set and other fragment sets based on the trained discriminant model and the set feature vector corresponding to each fragment set includes: comparing the set based on a clustering algorithm The feature vector is clustered, and multiple clusters are obtained.
在一些实施例中,所述特征提取模型是基于序列的机器学习模型,所述基于训练好的特征提取模型以及所述两个及以上片段集合中的候选视频片段,生成所述候选视频片段对应的片段特征包括:获取每一个候选视频片段包含的多个视频帧。确定每个视频帧对应的一个或多个图像特征。基于训练好的特征提取模型处理所述多个视频帧中的图像特征以及多个视频帧中图像特征之间的相互关系,确定候选视频片段对应的片段特征。In some embodiments, the feature extraction model is a sequence-based machine learning model, and the candidate video segment corresponding to the candidate video segment is generated based on the trained feature extraction model and the candidate video segments in the two or more segment sets The segment features include: obtaining multiple video frames contained in each candidate video segment. Determine one or more image features corresponding to each video frame. Based on the trained feature extraction model, the image features in the multiple video frames and the relationship between the image features in the multiple video frames are processed to determine the segment feature corresponding to the candidate video segment.
在一些实施例中,所述视频帧对应的图像特征包括视频帧中对象的形状信息、视频帧中多个对象之间的位置关系信息、视频帧中对象的色彩信息、视频帧中对象的完整程度或视频帧中的亮度中的至少一个。In some embodiments, the image features corresponding to the video frame include the shape information of the object in the video frame, the positional relationship information between multiple objects in the video frame, the color information of the object in the video frame, and the integrity of the object in the video frame. At least one of the degree or the brightness in the video frame.
在一些实施例中,所述基于所述至少部分视频片段和所述视频配置信息,生成一个目标视频包括:基于所述多个视频片段生成多个候选片段集合,所述多个候选片段集合满足第二预设条件。基于第一预设条件从所述多个候选片段集合中筛选出至少一个片段集合。基于所述至少一个目标片段集合中的每个片段集合,生成一个目标视频。In some embodiments, the generating a target video based on the at least part of the video fragments and the video configuration information includes: generating a plurality of candidate fragment sets based on the plurality of video fragments, and the plurality of candidate fragment sets satisfy The second preset condition. At least one segment set is selected from the plurality of candidate segment sets based on the first preset condition. Based on each segment set in the at least one target segment set, a target video is generated.
在一些实施例中,所述视频配置信息进一步包括序列信息,所述基于所述至少一个片段集合中的每个片段集合,生成一个目标视频包括: 基于所述序列信息,将所述每个片段合集中的候选视频片段进行排序组合,生成一个目标视频。In some embodiments, the video configuration information further includes sequence information, and the generating a target video based on each segment set in the at least one segment set includes: based on the sequence information, combining each segment The candidate video clips in the collection are sorted and combined to generate a target video.
在一些实施例中,,所述视频配置信息进一步包括美化参数,所述美化参数包括滤镜参数、动画参数、布局参数中的至少一个。In some embodiments, the video configuration information further includes beautification parameters, and the beautification parameters include at least one of filter parameters, animation parameters, and layout parameters.
在一些实施例中,本方法进一步包括:基于所述视频配置信息,获取文字层、背景层或装饰层以及载入参数。根据所述载入参数确定所述文字层、所述背景层以及所述装饰层在所述目标视频中的布局情况。In some embodiments, the method further includes: obtaining a text layer, a background layer or a decoration layer and loading parameters based on the video configuration information. The layout of the text layer, the background layer, and the decoration layer in the target video is determined according to the loading parameters.
在一些实施例中,本方法进一步包括:对所述多个视频片段进行归一化处理。In some embodiments, the method further includes: normalizing the plurality of video clips.
在一些实施例中,本方法进一步包括:获取初始音频。对所述初始音频基于节奏进行标记得到至少一个音频切分点。至少部分基于所述视频配置信息确定所述目标视频的至少一个视频切分点。将所述至少一个音频切分点与所述至少一个视频切分点进行匹配。基于匹配结果将切分后的音频与所述目标视频进行合成。In some embodiments, the method further includes: obtaining initial audio. At least one audio segmentation point is obtained by marking the initial audio based on the rhythm. At least one video segmentation point of the target video is determined based at least in part on the video configuration information. The at least one audio segmentation point is matched with the at least one video segmentation point. Based on the matching result, the segmented audio is synthesized with the target video.
在一些实施例中,本方法进一步包括:对所述目标视频进行后处理以满足至少一个视频输出条件,所述至少一个视频输出条件与所述目标视频的播放媒介相关。In some embodiments, the method further includes: performing post-processing on the target video to satisfy at least one video output condition, and the at least one video output condition is related to a playback medium of the target video.
在一些实施例中,所述至少一个视频输出条件包括视频尺寸条件,所述对目标视频进行后处理包括:根据所述视频尺寸条件对所述目标视频的画面进行裁剪。In some embodiments, the at least one video output condition includes a video size condition, and the post-processing of the target video includes: cropping a frame of the target video according to the video size condition.
在一些实施例中,所述根据所述视频尺寸条件对所述目标视频的画面进行裁剪包括:获取所述目标视频包含的各个视频片段的裁剪主体信息,所述裁剪主体信息反应该视频片段的特定裁剪主体以及所述特定裁剪主体的位置信息。根据所述视频尺寸条件对应的预设画面尺寸以及所述裁剪主体信息,对所述目标视频所包含的各个视频片段的画面进行裁剪。In some embodiments, the cropping of the screen of the target video according to the video size condition includes: obtaining cropping subject information of each video segment included in the target video, where the cropping subject information reflects the The specific cropped subject and the position information of the specific cropped subject. According to the preset picture size corresponding to the video size condition and the cropping subject information, the picture of each video segment included in the target video is cropped.
在一些实施例中,所述根据所述视频尺寸条件对应的预设画面尺寸以及所述裁剪主体信息,对所述目标视频所包含的各个视频片段的画面进行裁剪包括:对于所述目标视频所包含的每个视频片段,根据所述裁剪主体信息和所述预设画面尺寸,确定所述视频片段中至少一个视频帧的裁剪框的尺寸和初始位置。对所述至少一个视频帧的裁剪框的初始位置进行处理,确定所述至少一个视频帧的裁剪框对应的最终位置。根据所述裁剪框的最终位置,将所述视频片段包含的各个视频帧的画面进行裁剪,以保留所述裁剪框内的画面。In some embodiments, the cropping the screen of each video segment included in the target video according to the preset screen size corresponding to the video size condition and the cropping subject information includes: For each included video segment, the size and initial position of the cropping frame of at least one video frame in the video segment are determined according to the cropping subject information and the preset picture size. The initial position of the cropping frame of the at least one video frame is processed, and the final position corresponding to the cropping frame of the at least one video frame is determined. According to the final position of the cropping frame, the pictures of each video frame included in the video clip are cropped to preserve the pictures in the cropping frame.
在一些实施例中,所述对所述至少一个视频帧的裁剪框的初始位置进行处理,确定所述至少一个视频帧的裁剪框对应的最终位置包括:将所述视频片段的所述至少一个视频帧的裁剪框的参考点的初始坐标信息根据时间进行平滑处理。根据平滑处理的结果,确定所述参考点的最终坐标信息。基于所述最终坐标信息确定所述参考点的位置。In some embodiments, the processing the initial position of the cropping frame of the at least one video frame, and determining the final position corresponding to the cropping frame of the at least one video frame includes: The initial coordinate information of the reference point of the cropping frame of the video frame is smoothed according to time. According to the result of the smoothing process, the final coordinate information of the reference point is determined. The position of the reference point is determined based on the final coordinate information.
在一些实施例中,所述将所述参考点的初始坐标信息进行平滑处理包括:对所述参考点的初始坐标进行线性回归处理,以获取线性回归方程及其斜率。In some embodiments, the smoothing of the initial coordinate information of the reference point includes: performing linear regression processing on the initial coordinate of the reference point to obtain a linear regression equation and its slope.
在一些实施例中,所述根据平滑处理的结果,确定所述参考点的最终坐标信息包括:将所述斜率的绝对值与一个斜率阈值进行比较。响应于所述斜率的绝对值小于所述斜率阈值,将线性回归方程的趋势线的中点的位置作为每个所述视频帧的裁剪框的参考点的最终位置。响应于所述斜率的绝对值大于或等于所述斜率阈值,将所述线性回归方程的趋势线上每个视频帧的时间点所对应的位置作为该视频帧的裁剪框的参考点的最终位置。In some embodiments, the determining the final coordinate information of the reference point according to the result of the smoothing process includes: comparing the absolute value of the slope with a slope threshold. In response to the absolute value of the slope being less than the slope threshold, the position of the midpoint of the trend line of the linear regression equation is taken as the final position of the reference point of the crop frame of each video frame. In response to the absolute value of the slope being greater than or equal to the slope threshold, the position corresponding to the time point of each video frame on the trend line of the linear regression equation is used as the final position of the reference point of the crop frame of the video frame .
在一些实施例中,所述根据所述裁剪主体信息和所述预设画面尺寸,确定所述视频片段中至少一个视频帧的裁剪框的尺寸和位置,包括: 根据所述目标视频的主题信息和所述裁剪主体信息,确定所述裁剪主体信息中的一个或多个特定裁剪主体与所述主题信息之间的相关度。根据所述预设画面尺寸和所述特定裁剪主体,确定与所述至少一个视频帧对应的至少一个备选裁剪框。根据所述裁剪主体信息和所述相关度,为所述至少一个备选裁剪框打分。基于所述备选裁剪框的打分结果,确定所述至少一个视频帧的裁剪框的尺寸和位置。In some embodiments, the determining the size and position of the cropping frame of at least one video frame in the video clip according to the cropping subject information and the preset picture size includes: according to the subject information of the target video And the cropping subject information to determine the correlation between one or more specific cropping subjects in the cropping subject information and the subject information. According to the preset picture size and the specific cropping subject, at least one candidate cropping frame corresponding to the at least one video frame is determined. Score the at least one candidate cropping frame according to the cropping subject information and the correlation degree. Based on the scoring result of the candidate cropping frame, the size and position of the cropping frame of the at least one video frame are determined.
在一些实施例中,所述获取所述目标视频包含的各个视频片段的裁剪主体信息包括:使用机器学习模型获取每个所述视频片段中的候选裁剪主体。根据所述目标视频的主题信息,从所述候选裁剪主体中挑选出所述一个或多个特定裁剪主体。In some embodiments, the obtaining the cropping subject information of each video segment included in the target video includes: using a machine learning model to obtain the candidate cropping subject in each of the video segments. According to the subject information of the target video, the one or more specific crop subjects are selected from the candidate crop subjects.
在一些实施例中,本方法进一步包括:将所述目标视频向特定受众人群投放。In some embodiments, the method further includes: delivering the target video to a specific audience group.
在一些实施例中,所述特定受众人群符合特定受众特征条件,所述基于所述至少部分视频片段和所述视频配置信息,生成一个目标视频包括:获取所述多个视频片段的受众接受度。对于所述特定的受众群体,根据对应的受众特征条件从所述多个视频片段中确定受众接受度高于阈值的候选片段,用以生成所述目标视频。In some embodiments, the specific audience group meets specific demographics conditions, and the generating a target video based on the at least part of the video clips and the video configuration information includes: obtaining audience acceptance of the plurality of video clips . For the specific audience group, a candidate segment whose audience acceptance is higher than a threshold is determined from the plurality of video segments according to corresponding demographic conditions, so as to generate the target video.
在一些实施例中,本方法进一步包括:获取所述目标视频的投放效果反馈,并根据所述投放效果反馈调整所述受众特征条件或受众接受度中的至少一个。In some embodiments, the method further includes: obtaining a delivery effect feedback of the target video, and adjusting at least one of the demographic condition or audience acceptance according to the delivery effect feedback.
在一些实施例中,所述投放效果反馈与所述目标视频的完播率、重播次数或观看人数中的至少一个相关。In some embodiments, the delivery effect feedback is related to at least one of the completion rate, the number of replays, or the number of viewers of the target video.
在一些实施例中,所述目标视频包括一个创意广告,所述方法进一步包括基于所述创意广告的至少一个广告元素的元素效果参数,确定所述目标视频的预估效果数据。In some embodiments, the target video includes a creative advertisement, and the method further includes determining the estimated effect data of the target video based on an element effect parameter of at least one advertisement element of the creative advertisement.
在一些实施例中,基于所述创意广告的至少一个广告元素的元素效果参数,确定所述目标视频的预估效果数据,包括:获取广告元素效果预估模型。将标记有至少一个元素标签的所述广告元素输入所述广告元素效果预估模型,确定所述广告元素的元素效果参数,所述至少一个元素标签包括所述广告元素与所述创意广告的关系。基于所述至少一个广告元素的元素效果参数,在所述至少一个广告元素中确定符合预期的广告元素,所述符合预期的广告元素的元素效果参数大于参数阈值。确定所述符合预期的广告元素占所述创意广告中所述至少一个广告元素的比例。基于所述比例确定所述目标视频的所述预估效果数据。In some embodiments, determining the estimated effect data of the target video based on the element effect parameter of at least one advertisement element of the creative advertisement includes: obtaining an advertisement element effect estimation model. The advertisement element marked with at least one element tag is input into the advertisement element effect prediction model to determine the element effect parameter of the advertisement element, and the at least one element tag includes the relationship between the advertisement element and the creative advertisement . Based on the element effect parameter of the at least one advertisement element, an advertisement element that meets expectations is determined in the at least one advertisement element, and the element effect parameter of the advertisement element that meets the expectation is greater than a parameter threshold. Determine the proportion of the advertisement elements that meet the expectations in the at least one advertisement element in the creative advertisement. The estimated effect data of the target video is determined based on the ratio.
本说明书的另一方面提供一种生成视频的系统,所述系统包括获取模块,用于获取多个视频片段。配置模块,用于获取视频配置信息,所述视频配置信息包括所述多个视频片段中至少部分视频片段的一个或多个配置特征,所述配置特征包括内容特征或排布特征中的至少一个。生成模块,用于基于所述至少部分视频片段和所述视频配置信息,生成一个目标视频。Another aspect of this specification provides a system for generating a video. The system includes an acquisition module for acquiring a plurality of video clips. The configuration module is configured to obtain video configuration information, the video configuration information including one or more configuration features of at least some of the multiple video clips, and the configuration feature includes at least one of a content feature or an arrangement feature . The generating module is configured to generate a target video based on the at least part of the video clip and the video configuration information.
本说明书的另一方面提供一种计算机可读存储介质,所述存储介质存储计算机指令,当计算机读取存储介质中的计算机指令后,计算机执行如前所述的方法。Another aspect of this specification provides a computer-readable storage medium that stores computer instructions. After the computer reads the computer instructions in the storage medium, the computer executes the aforementioned method.
另外的特征将在接下来的描述中部分地阐述,并且对于本领域技术人员在查阅下文和附图时将部分地变得显而易见,或者可以通过示例的生产或操作而被学习。本申请的特征可以通过实践或使用在下面讨论的详细示例中阐述的方法、手段和组合的各个方面来实现和获得Additional features will be partially explained in the following description, and will become partly obvious to those skilled in the art when referring to the following and the drawings, or they can be learned through example production or operation. The features of this application can be achieved and obtained by practicing or using the methods, means and combinations set forth in the detailed examples discussed below
附图描述Description of the drawings
本说明书将以示例性实施例的方式进一步说明,这些示例性实施例将通过附图进行详细描述。这些实施例并非限制性的,在这些实施例中,相同的编号表示相同的结构,其中:This specification will be further described in the form of exemplary embodiments, and these exemplary embodiments will be described in detail with the accompanying drawings. These embodiments are not restrictive. In these embodiments, the same number represents the same structure, in which:
图1是根据本说明书的一些实施例所示的生成视频的系统的场景的示意图;Fig. 1 is a schematic diagram of a scene of a system for generating a video according to some embodiments of this specification;
图2是根据本申请的一些实施例所示的生成视频的示例性流程图;Fig. 2 is an exemplary flow chart of generating a video according to some embodiments of the present application;
图3是根据本申请的一些实施例所示的确定视频片段的方法的示例性流程图;Fig. 3 is an exemplary flowchart of a method for determining a video segment shown in some embodiments of the present application;
图4是根据本申请的一些实施例所示的对初始图像或初始视频的编辑方法的示例性流程图;Fig. 4 is an exemplary flow chart of a method for editing an initial image or an initial video according to some embodiments of the present application;
图5是根据本申请的另一些实施例所示的编辑初始图像或初始视频进行的方法的示例性流程图;Fig. 5 is an exemplary flowchart of a method for editing an initial image or an initial video according to other embodiments of the present application;
图6是根据本申请的另一些实施例所示的生成视频的方法的示例性流程图;Fig. 6 is an exemplary flowchart of a method for generating a video shown in other embodiments of the present application;
图7是根据本申请的另一些实施例所示的生成视频的方法的示例性流程图;Fig. 7 is an exemplary flowchart of a method for generating a video shown in other embodiments of the present application;
图8是根据本申请的一些实施例所示的生成目标视频的方法的示例性流程图;Fig. 8 is an exemplary flowchart of a method for generating a target video shown in some embodiments of the present application;
图9是根据本申请的一些实施例所示的确定片段集合的方法的示例性流程图;Fig. 9 is an exemplary flowchart of a method for determining a fragment set according to some embodiments of the present application;
图10是根据本申请的一些实施例所示的一种确定组合差异度的方法的示例性流程图;Fig. 10 is an exemplary flowchart of a method for determining a degree of combination difference according to some embodiments of the present application;
图11是根据本申请的一些实施例所示的另一种确定组合差异度的方法的示例性流程图;FIG. 11 is an exemplary flowchart of another method for determining the degree of combination difference according to some embodiments of the present application;
图12是根据本申请的另一些实施例所示的生成视频的方法的示例性流程图;FIG. 12 is an exemplary flowchart of a method for generating a video according to other embodiments of the present application;
图13是根据本申请的一些实施例所示的添加音频素材的示例性流程图;Fig. 13 is an exemplary flowchart of adding audio material according to some embodiments of the present application;
图14是根据本申请的一些实施例所示的画面裁剪方法的应用场景图;FIG. 14 is an application scene diagram of a screen cropping method according to some embodiments of the present application;
图15是根据本申请的一些实施例所示的平滑处理的方法的示意图;Fig. 15 is a schematic diagram of a smoothing method according to some embodiments of the present application;
图16是根据本申请的一些实施例所示的确定各个视频帧的裁剪框的尺寸和位置的方法的流程图;FIG. 16 is a flowchart of a method for determining the size and position of the crop box of each video frame according to some embodiments of the present application;
图17是根据本申请的一些实施例所示的基于受众生成目标视频的方法的流程图;Fig. 17 is a flowchart of a method for generating a target video based on an audience according to some embodiments of the present application;
图18是根据本申请的一些实施例所示的目标视频的预估效果数据的确认的流程图;FIG. 18 is a flowchart of confirming the estimated effect data of the target video according to some embodiments of the present application;
图19是根据本申请的一些实施例所示的模型训练过程的示例性流程图;以及Fig. 19 is an exemplary flowchart of a model training process according to some embodiments of the present application; and
图20A~20E是根据本申请的一些实施例所示的视频合成系统的示意图。20A to 20E are schematic diagrams of a video synthesis system according to some embodiments of the present application.
具体实施方式detailed description
为了更清楚地说明本说明书实施例的技术方案,下面将对实施例描述中所需要使用的附图作简单的介绍。显而易见地,下面描述中的附图仅仅是本说明书的一些示例或实施例,对于本领域的普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图将本说明书应用于其它类似情景。除非从语言环境中显而易见或另做说明,图中相同标号代表相同结构或操作。In order to more clearly describe the technical solutions of the embodiments of the present specification, the following will briefly introduce the accompanying drawings that need to be used in the description of the embodiments. Obviously, the drawings in the following description are only some examples or embodiments of this specification. For those of ordinary skill in the art, this specification can also be applied to these drawings without creative work. Other similar scenarios. Unless it is obvious from the language environment or otherwise stated, the same reference numerals in the figures represent the same structure or operation.
应当理解,本文使用的“系统”、“装置”、“单元”和/或“模组”是用于区分不同级别的不同组件、元件、部件、部分或装配的一种方法。然而,如果其他词语可实现相同的目的,则可通过其他表达来替换所述词语。It should be understood that the “system”, “device”, “unit” and/or “module” used herein is a method for distinguishing different components, elements, parts, parts, or assemblies of different levels. However, if other words can achieve the same purpose, the words can be replaced by other expressions.
如本说明书和权利要求书中所示,除非上下文明确提示例外情形,“一”、“一个”、“一种”和/或“该”等词并非特指单数,也可包括复数。一般说来,术语“包括”与“包含”仅提示包括已明确标识的步骤和元素,而这些步骤和元素不构成一个排它性的罗列,方法或者设备也可能包含其它的步骤或元素。As shown in this specification and claims, unless the context clearly indicates exceptions, the words "a", "an", "an" and/or "the" do not specifically refer to the singular, but may also include the plural. Generally speaking, the terms "including" and "including" only suggest that the clearly identified steps and elements are included, and these steps and elements do not constitute an exclusive list, and the method or device may also include other steps or elements.
虽然本说明书对根据本说明书的实施例的系统中的某些模块或单元做出了各种引用,然而,任何数量的不同模块或单元可以被使用并运行在客户端和/或服务器上。所述模块仅是说明性的,并且所述系统和方法的不同方面可以使用不同模块。Although this specification makes various references to certain modules or units in the system according to the embodiments of this specification, any number of different modules or units can be used and run on the client and/or server. The modules are only illustrative, and different modules may be used for different aspects of the system and method.
本说明书中使用了流程图用来说明根据本说明书的实施例的系统所执行的操作。应当理解的是,前面或后面操作不一定按照顺序来精确地执行。相反,可以按照倒序或同时处理各个步骤。同时,也可以将其他操作添加到这些过程中,或从这些过程移除某一步或数步操作。In this specification, a flowchart is used to explain the operations performed by the system according to the embodiment of this specification. It should be understood that the preceding or following operations are not necessarily performed exactly in order. Instead, the steps can be processed in reverse order or at the same time. At the same time, you can also add other operations to these processes, or remove a step or several operations from these processes.
互联网和内容创作(如传媒、广告)等行业,在日常工作中会需要大量生成各类视频,利用人工对各类素材进行筛选、裁剪,然后基于软件将各类素材进行拼接后渲染,该方式效率不高且对人员需求量大。随着多媒体素材越来越多,同一视频的视频元素越来越多,筛选和处理的过程会越来越困难,此时进一步出现效率降低的问题。In industries such as the Internet and content creation (such as media and advertising), a large number of videos need to be generated in daily work, and various materials are manually screened and cropped, and then various materials are spliced and rendered based on software. This method The efficiency is not high and the demand for personnel is large. With more and more multimedia materials and more and more video elements of the same video, the process of screening and processing will become more and more difficult. At this time, the problem of reduced efficiency will further appear.
根据本申请一些实施例中提出一种多媒体系统。所述多媒体系统可以获取多个视频片段以及视频配置信息。所述视频配置信息可以是基于脚本信息和/或视频模板生成。所述视频配置信息可以用于确定所述多个视频片段中至少部分视频片段的一个或多个配置特征。所述配置特征包括内容特征或排布特征中的至少一个。其中内容特征可以包括视频片段或最终生成视频(又称目标视频)包含特定主体(对象)、特定主题、特定镜头画面、特定音频等。所述排布特征可以包括视频片段尺寸、目标对象在视频 片段中的布局、视频片段的时长、视频片段在目标视频中的特定位置、视频片段的视觉效果特征等。所述多媒体系统可以基于所述至少部分视频片段和所述视频配置信息,生成目标视频。所述多媒体系统可以实现自动化处理并生成目标视频,效率高且节省人力成本。According to some embodiments of the present application, a multimedia system is proposed. The multimedia system can obtain multiple video clips and video configuration information. The video configuration information may be generated based on script information and/or video templates. The video configuration information may be used to determine one or more configuration features of at least some of the multiple video clips. The configuration feature includes at least one of content feature or arrangement feature. The content features may include video clips or the final generated video (also known as the target video) including a specific subject (object), a specific theme, a specific shot, a specific audio, and so on. The arrangement feature may include the size of the video segment, the layout of the target object in the video segment, the duration of the video segment, the specific position of the video segment in the target video, the visual effect characteristics of the video segment, and the like. The multimedia system may generate a target video based on the at least part of the video segment and the video configuration information. The multimedia system can realize automatic processing and generate target videos, and has high efficiency and saves labor costs.
图1是根据本申请的一些实施例所示的多媒体系统的场景的示意图。Fig. 1 is a schematic diagram of a scene of a multimedia system according to some embodiments of the present application.
多媒体系统100可以用于媒体、广告、互联网等,能快速且有针对性的生成用于投放的目标视频。多媒体系统100可以包括服务器110、网络120、终端设备130、数据库140和其他数据源150。The multimedia system 100 can be used for media, advertising, the Internet, etc., and can quickly and targetedly generate target videos for delivery. The multimedia system 100 may include a server 110, a network 120, a terminal device 130, a database 140, and other data sources 150.
服务器110与终端设备130可以通过网络120相连,也可以直接连接;数据库140可以与服务器110通过网络120相连,也可以直接连接于服务器110或者处于服务器110的内部。数据库140、其他数据源150可与网络120连接以与多媒体系统100的一个或多个组件通讯。多媒体系统100的一个或多个组件可通过网络120访问存储于终端设备130、数据库140和其他数据源150中的数据或指令。The server 110 and the terminal device 130 may be connected through the network 120 or directly; the database 140 may be connected to the server 110 through the network 120, or may be directly connected to the server 110 or located inside the server 110. The database 140 and other data sources 150 can be connected to the network 120 to communicate with one or more components of the multimedia system 100. One or more components of the multimedia system 100 can access data or instructions stored in the terminal device 130, the database 140, and other data sources 150 through the network 120.
多媒体系统100中各个部件可以集成在同一设备中,上述通讯关系可以通过设备的内部总线实现。多媒体系统100中至少部分组件可以集成在同一设备中,各设备间可以通过各设备的通讯端口进行连接,以使多媒体系统100中各个部件通讯连接,从而实现上述通讯关系。The various components in the multimedia system 100 can be integrated in the same device, and the above-mentioned communication relationship can be realized through the internal bus of the device. At least part of the components in the multimedia system 100 can be integrated in the same device, and each device can be connected through the communication port of each device, so that the various components in the multimedia system 100 can be communicatively connected, thereby realizing the aforementioned communication relationship.
服务器110可以用于管理资源以及处理来自本系统至少一个组件或外部数据源(例如,云数据中心)的数据和/或信息。在一些实施例中,服务器110可以是单一服务器或服务器组。该服务器组可以是集中式或分布式的(例如,服务器110可以是分布式系统),可以是专用的也可以由其他设备或系统同时提供服务。在一些实施例中,服务器110可以是区域的或者远程的。在一些实施例中,服务器110可以在云平台上实施,或者 以虚拟方式提供。仅作为示例,所述云平台可以包括私有云、公共云、混合云、社区云、分布云、内部云、多层云等或其任意组合。The server 110 may be used to manage resources and process data and/or information from at least one component of the system or an external data source (for example, a cloud data center). In some embodiments, the server 110 may be a single server or a group of servers. The server group may be centralized or distributed (for example, the server 110 may be a distributed system), it may be dedicated, or other devices or systems may provide services at the same time. In some embodiments, the server 110 may be regional or remote. In some embodiments, the server 110 may be implemented on a cloud platform or provided in a virtual manner. For example only, the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud, a distributed cloud, an internal cloud, a multi-layer cloud, etc., or any combination thereof.
在一些实施例中,服务器110可包含处理设备112。处理设备112可以处理从其他设备或系统组成部分中获得的数据和/或信息。处理器可以基于这些数据、信息和/或处理结果执行程序指令,以执行一个或多个本申请中描述的功能。在一些实施例中,处理设备112可以包含一个或多个子处理设备(例如,单核处理设备或多核多芯处理设备)。仅作为示例,处理设备112可以包括中央处理器(CPU)、专用集成电路(ASIC)、专用指令处理器(ASIP)、图形处理器(GPU)、物理处理器(PPU)、数字信号处理器(DSP)、现场可编程门阵列(FPGA)、可编辑逻辑电路(PLD)、控制器、微控制器单元、精简指令集电脑(RISC)、微处理器等或以上任意组合。In some embodiments, the server 110 may include a processing device 112. The processing device 112 may process data and/or information obtained from other devices or system components. The processor may execute program instructions based on these data, information, and/or processing results to perform one or more functions described in this application. In some embodiments, the processing device 112 may include one or more sub-processing devices (for example, a single-core processing device or a multi-core and multi-core processing device). For example only, the processing device 112 may include a central processing unit (CPU), an application specific integrated circuit (ASIC), an application specific instruction processor (ASIP), a graphics processing unit (GPU), a physical processor (PPU), a digital signal processor ( DSP), Field Programmable Gate Array (FPGA), Editable Logic Circuit (PLD), Controller, Microcontroller Unit, Reduced Instruction Set Computer (RISC), Microprocessor, etc. or any combination of the above.
网络120可以连接系统的各组成部分和/或连接系统与外部资源部分。网络120使得各组成部分之间,以及与系统之外其他部分之间可以进行通讯,促进数据和/或信息的交换。在一些实施例中,网络120可以是有线网络或无线网络中的任意一种或多种。例如,网络120可以包括电缆网络、光纤网络、电信网络、互联网、局域网络(LAN)、广域网络(WAN)、无线局域网络(WLAN)、城域网(MAN)、公共交换电话网络(PSTN)、蓝牙网络、紫蜂网络(ZigBee)、近场通信(NFC)、设备内总线、设备内线路、线缆连接等或其任意组合。各部分之间的网络连接可以是采用上述一种方式,也可以是采取多种方式。在一些实施例中,网络可以是点对点的、共享的、中心式的等各种拓扑结构或者多种拓扑结构的组合。在一些实施例中,网络120可以包括一个或以上网络接入点。例如,网络120可以包括有线或无线网络接入点,例如基站和/或网络交换点120-1、120-2、…,通过这些接入点,系统的一个或多个组件可连接到网络120上以交换数据 和/或信息。The network 120 may connect various components of the system and/or connect the system and external resource parts. The network 120 enables communication between various components and with other parts outside the system, and facilitates the exchange of data and/or information. In some embodiments, the network 120 may be any one or more of a wired network or a wireless network. For example, the network 120 may include a cable network, a fiber optic network, a telecommunications network, the Internet, a local area network (LAN), a wide area network (WAN), a wireless local area network (WLAN), a metropolitan area network (MAN), and a public switched telephone network (PSTN) , Bluetooth network, ZigBee network (ZigBee), near field communication (NFC), device bus, device line, cable connection, etc. or any combination thereof. The network connection between the various parts can be in one of the above-mentioned ways, or in multiple ways. In some embodiments, the network may be a variety of topological structures such as point-to-point, shared, and centralized, or a combination of multiple topologies. In some embodiments, the network 120 may include one or more network access points. For example, the network 120 may include wired or wireless network access points, such as base stations and/or network exchange points 120-1, 120-2, ..., through which one or more components of the system can be connected to the network 120 To exchange data and/or information.
终端设备130指用于数据查询和/或多媒体展示的一个或多个终端设备或软件。在一些实施例中,使用终端设备130的可以是一个或多个用户,可以包括直接使用服务的用户,也可以包括其他相关用户。在一些实施例中,终端设备130可以是移动设备130-1、平板计算机130-2、膝上型计算机130-3其他具有输入和/或输出功能的设备中的一种或其任意组合。The terminal device 130 refers to one or more terminal devices or software used for data query and/or multimedia display. In some embodiments, the terminal device 130 may be one or more users, may include users who directly use the service, or may include other related users. In some embodiments, the terminal device 130 may be one or any combination of the mobile device 130-1, the tablet computer 130-2, the laptop computer 130-3 and other devices with input and/or output functions.
在一些实施例中,终端设备130还可以包括可以用于输入和/或获取数据或信息的用户终端。在一些实施例中,用户可以通过用户终端110生成或获取原始视频或原始图像。例如,用户可以使用用户终端的摄像头录制影像或拍照并存储为原始视频或原始图像,或通过用户终端从视频类软件下载原始视频。在一些实施例中,用户可以通过用户终端输入目标视频的约束条件(例如,视频配置信息)。在一些实施例中,用户可以通过用户终端获取或浏览合成的目标视频。In some embodiments, the terminal device 130 may also include a user terminal that can be used to input and/or obtain data or information. In some embodiments, the user may generate or obtain the original video or original image through the user terminal 110. For example, the user can use the camera of the user terminal to record an image or take a picture and store it as an original video or original image, or download the original video from video software through the user terminal. In some embodiments, the user may input the constraint condition of the target video (for example, video configuration information) through the user terminal. In some embodiments, the user can obtain or browse the synthesized target video through the user terminal.
数据库140可以用于存储数据和/或指令。数据库140在单个中央服务器、通过通信链路连接的多个服务器或多个个人设备中实现。在一些实施例中,数据库140可包括大容量存储器、可移动存储器、挥发性读写存储器(例如,随机存取存储器RAM)、只读存储器(ROM)等或以上任意组合。示例性的,大容量储存器可以包括磁盘、光盘、固态磁盘等。在一些实施例中,数据库140可在云平台上实现。The database 140 may be used to store data and/or instructions. The database 140 is implemented in a single central server, multiple servers connected through a communication link, or multiple personal devices. In some embodiments, the database 140 may include mass memory, removable memory, volatile read-write memory (for example, random access memory RAM), read-only memory (ROM), etc., or any combination thereof. Exemplarily, the mass storage device may include magnetic disks, optical disks, solid-state disks, and the like. In some embodiments, the database 140 may be implemented on a cloud platform.
其他数据源150可以是用于为所述系统提供其他信息的一个或多个来源。其他数据源150可以是一个或多个设备,可以是直接获取初始图像或初始视频的摄像设备,可以是一个或多个应用程序接口,可以是一个或多个数据库查询接口,可以是一个或多个基于协议的信息获取接口,可以是其他可获取信息的方式,可以是上述方式两种或多种的组合。信息源所提供的信息,可以是在提取信息时已存在的,也可以是在提取信息时临 时生成的,也可以是上述方式的组合。在一些实施例中,其他数据源150可以用于为系统提供图片、视频、音乐等多媒体信息。 Other data sources 150 may be one or more sources used to provide other information for the system. The other data source 150 can be one or more devices, can be a camera device that directly obtains the initial image or initial video, can be one or more application program interfaces, can be one or more database query interfaces, and can be one or more A protocol-based information acquisition interface can be other methods of acquiring information, or a combination of two or more of the above methods. The information provided by the information source may already exist when the information is extracted, or it may be generated temporarily when the information is extracted, or a combination of the above methods. In some embodiments, other data sources 150 may be used to provide multimedia information such as pictures, videos, and music to the system.
图2是根据本申请的一些实施例所示的生成视频的示例性流程图。在一些实施例中,流程200中的一个或以上步骤可以作为指令的形式存储在存储设备(例如,数据库140)中,并且被处理设备(例如,服务器110中的处理设备112)调用和/或执行。如图2所示,流程200可以包括:Fig. 2 is an exemplary flow chart of generating a video according to some embodiments of the present application. In some embodiments, one or more steps in the process 200 may be stored as instructions in a storage device (for example, the database 140), and called by a processing device (for example, the processing device 112 in the server 110) and/or implement. As shown in Figure 2, the process 200 may include:
步骤210,获取多个视频片段。Step 210: Obtain multiple video clips.
视频片段可以指一段由视频帧组成的视频。每个视频片段可以是构成视频的图像序列中的子序列。一个视频片段可以为3秒、4秒、或5秒等的短视频。视频帧可以理解为按时间间隔将连续影像分解得到对应的帧图像。例如,每帧图像之间的时间间隔可以示例性地设置为1/24s(也可称1秒内间隔获得24帧图像)。A video segment may refer to a video composed of video frames. Each video segment may be a sub-sequence in the sequence of images constituting the video. A video clip can be a short video of 3 seconds, 4 seconds, or 5 seconds. The video frame can be understood as decomposing the continuous image according to the time interval to obtain the corresponding frame image. For example, the time interval between each frame of images can be exemplarily set to 1/24s (it can also be said that 24 frames of images are obtained within 1 second).
在一些实施例中,一个视频片段可以是或包括一个或多个镜头片段。所述镜头片段可以指两个剪辑点之间的由视频帧组成的连续画面。例如,一个镜头片段可以是摄像机从启动到静止这期间不间断摄取的一段画面的总和。示例性地,若视频文件中第一个画面是海边,接着画面切换为一个女孩在喝酸奶,再接着切换为女孩在海上冲浪,则女孩在喝酸奶为一个镜头片段,其前面一段海边的画面为另一个镜头片段,其后面一段女孩在海上冲浪的画面为另一个镜头片段。为更清楚的理解,本说明书实施例中,将以一个视频片段中包含一个镜头画面为例进行描述。In some embodiments, a video segment may be or include one or more shot segments. The shot segment may refer to a continuous picture composed of video frames between two editing points. For example, a lens segment can be the sum of a segment of pictures taken continuously by the camera from start to standstill. Exemplarily, if the first picture in the video file is a seaside, and then the picture is switched to a girl drinking yogurt, and then switched to a girl surfing on the sea, then the girl drinking yogurt is a scene fragment, and a picture of the seaside in front of it It is another scene, and the scene behind it of a girl surfing on the sea is another scene. For a clearer understanding, in the embodiments of this specification, a video segment including a shot frame will be used as an example for description.
在一些实施例中,数据库140和/或其他数据源150可以存储有所述多个视频片段,步骤210可以通过直接从数据库140和/或其他数据源150中获取所述多个视频片段实现。In some embodiments, the database 140 and/or other data sources 150 may store the multiple video clips, and step 210 may be implemented by directly obtaining the multiple video clips from the database 140 and/or other data sources 150.
在一些实施例中,数据库140和/或其他数据源150可以存储有未经处理的初始数据。所述初始数据可以包括初始图像(也可以称为待处理 图像)和/或初始视频(也可以称为待处理视频)。步骤210可以通过对初始数据进行处理(例如确定剪辑点并进行剪辑)以得到所述多个视频片段。示例性地,可以基于视频文件(例如,初始视频和/或初始图像)中包含的镜头片段生成视频片段。例如,一个视频文件包含5个镜头片段,则可以生成5个视频片段。在一些实施例中,可以通过人工或机器对视频文件进行拆分,生成多个视频片段。例如,用户基于视频文件中镜头画面的数量手动剪辑,或利用训练好的机器学习模型根据预设的条件(如镜头片段数量、时长等)将视频文件拆分为多个视频片段。在一些替代性实施例中,处理设备112还可以基于时间窗获取从视频文件中截取的多个视频片段,本说明书对视频片段的拆分手段不做限制。关于确定所述多个视频片段的方法可以参考图3-图7及其相关描述。In some embodiments, the database 140 and/or other data sources 150 may store unprocessed initial data. The initial data may include an initial image (also may be referred to as a to-be-processed image) and/or an initial video (also may be referred to as a to-be-processed video). Step 210 can obtain the multiple video clips by processing the initial data (for example, determining a cutting point and editing). Exemplarily, a video segment may be generated based on a shot segment contained in a video file (for example, an initial video and/or an initial image). For example, if a video file contains 5 shots, 5 video clips can be generated. In some embodiments, the video file may be split manually or by machine to generate multiple video clips. For example, the user manually edits based on the number of shots in the video file, or uses a trained machine learning model to split the video file into multiple video clips according to preset conditions (such as the number of shots, duration, etc.). In some alternative embodiments, the processing device 112 may also obtain multiple video clips intercepted from the video file based on the time window, and this specification does not limit the means for splitting the video clips. For the method of determining the multiple video clips, reference may be made to FIGS. 3-7 and related descriptions.
步骤220,获取视频配置信息。Step 220: Obtain video configuration information.
所述视频配置信息是指最终生成的视频(也称为目标视频)及组成所述目标视频的各视频片段的配置相关的信息。所述视频配置信息可以反映对目标视频及组成所述目标视频的各视频片段的内容或形式的要求。组成目标视频的各视频片段可以是所述多个视频片段中的至少部分视频片段。在一些实施例中,视频配置信息可以包括组成目标视频的各视频片段(即所述至少部分视频片段)的一个或多个配置特征。The video configuration information refers to the finally generated video (also referred to as the target video) and information related to the configuration of each video segment constituting the target video. The video configuration information may reflect requirements on the content or form of the target video and the video segments that make up the target video. Each video segment composing the target video may be at least a part of the multiple video segments. In some embodiments, the video configuration information may include one or more configuration features of each video segment (that is, the at least part of the video segment) constituting the target video.
在一些实施例中,所述配置特征可以包括内容特征和/或排布特征。所述内容特征与所述至少部分视频片段的内容相关。例如,所述内容特征可以包括该视频片段或目标视频包含特定主体(对象)、特定主题、特定镜头画面、特定音频等。所述排布特征与所述至少部分视频片段的呈现形式相关。例如,所述排布特征可以包括视频片段尺寸、视频片段的时长、视频片段在目标视频中的特定位置、视频片段的视觉效果特征等。In some embodiments, the configuration features may include content features and/or arrangement features. The content feature is related to the content of the at least part of the video clip. For example, the content feature may include that the video clip or target video contains a specific subject (object), a specific theme, a specific shot, a specific audio, and so on. The arrangement feature is related to the presentation form of the at least part of the video segment. For example, the arrangement feature may include the size of the video segment, the duration of the video segment, the specific position of the video segment in the target video, the visual effect feature of the video segment, and so on.
所述特定主体也可以称为特定对象,例如,特定主体可以是各个 视频片段中的包含的特定对象,也可以是多个对象中与目标视频的特定主题相关的特定对象。示例性的,所述特定主体可以是产品(电子产品、日用品、装饰品等)、生物(人、动物等)、标志(例如,商标、地域标识等)或景观(山、房子等)等中的一个或多个。The specific subject may also be referred to as a specific object. For example, the specific subject may be a specific object included in each video clip, or may be a specific object related to a specific theme of the target video among multiple objects. Exemplarily, the specific subject may be products (electronic products, daily necessities, decorations, etc.), creatures (humans, animals, etc.), signs (for example, trademarks, regional signs, etc.), or landscapes (mountains, houses, etc.), etc. One or more of.
所述特定主题可以是该视频片段的主要内容。所述主题可以由视频片段的标题或简介中的关键词信息,标签信息,用户自定义信息或者数据库中已存储的信息确定。所述特定主题可以由特定内容和视频类型构成,例如,在香水广告中,香水为特定内容,广告为视频类型。所述特定内容还可以包括特定动作(如直播、晚会等),特定日期(如情人节、儿童节、双十一等)等。在一些实施例中,确定特定主题的方法可以是用户自定义或用户从列表选择。例如,用户直接输入目标视频的特定主题为“汽车内饰的广告”。又例如,用户可以先选择目标视频为广告视频后,然后从产品分类列表中选择“洗护用品”->“洗发水”。在一些实施例中,还可以是通过模型进行识别确定特定主题。例如,当用户将视频(包括用于生成目标视频的视频片段、初始视频、初始图像等视频素材)上传且未指定或选择特定主题时,则可以基于目标检测等图像识别技术,获得每个视频帧的图像中的所有物体,然后将出现频次最多或占权重最大的物体作为该广告视频的默认特定主题。例如,在各个视频帧的画面中,汽车轮胎的图像所占面积最大,则该目标视频的默认主题可以设定为“汽车”或“汽车轮胎”。The specific topic may be the main content of the video clip. The topic can be determined by keyword information in the title or introduction of the video clip, tag information, user-defined information, or information stored in a database. The specific theme may be composed of specific content and video type. For example, in a perfume advertisement, perfume is a specific content, and the advertisement is a video type. The specific content may also include specific actions (such as live broadcast, evening parties, etc.), specific dates (such as Valentine's Day, Children's Day, Double Eleven, etc.) and so on. In some embodiments, the method for determining a specific topic may be user-defined or user-selected from a list. For example, the user directly enters the specific theme of the target video as "advertising for car interiors." For another example, the user may first select the target video as the advertisement video, and then select "Shampoo" -> "Shampoo" from the product category list. In some embodiments, it is also possible to identify a specific topic through a model. For example, when a user uploads a video (including video clips used to generate the target video, initial video, initial image and other video materials) and does not specify or select a specific topic, each video can be obtained based on image recognition technology such as target detection All objects in the image of the frame, and then the objects that appear most frequently or occupy the most weight are used as the default specific theme of the advertising video. For example, in the picture of each video frame, the image of the car tire occupies the largest area, then the default theme of the target video can be set to "car" or "car tire".
所述特定镜头画面是指包含特定画面的视频帧或视频帧序列。例如,特定镜头画面可以包括儿童喝牛奶,模特使用化妆品,在沙滩上打排球,在双十一购物节浏览手机购物页面等。在一些实施例中,所述特定镜头画面可以与特定主体或与特定主题相关。例如,当特定主题为牙齿保护时,宣传视频中可以包含“刷牙”这一特定动作的镜头画面。又例如,特定主体为香水时,广告视频中可以包含“喷香水”这一特定动作的镜头画面。The specific shot picture refers to a video frame or a sequence of video frames containing a specific picture. For example, specific shots can include children drinking milk, models using cosmetics, playing volleyball on the beach, and browsing mobile shopping pages on the Double Eleven Shopping Festival. In some embodiments, the specific lens frame may be related to a specific subject or a specific subject. For example, when the specific topic is tooth protection, the promotional video can include shots of the specific action of "brushing teeth". For another example, when the specific subject is perfume, the advertisement video may include a shot of the specific action of "spraying perfume".
所述特定音频可以包含一段特定的声音。例如,所述特定音频可以包含对话、独白、主题音乐、背景音乐、或者其他特定声音(例如,风声、雨声、鸟鸣声、刹车声、撞击声等)。其中,所述主题音乐和/或背景音乐可以有不同的类型,例如舒缓、轻快、专注、激进等。The specific audio may include a specific sound. For example, the specific audio may include dialogue, monologue, theme music, background music, or other specific sounds (for example, wind, rain, bird song, brake sound, impact sound, etc.). Wherein, the theme music and/or background music can be of different types, such as soothing, brisk, focused, aggressive, and so on.
所述视频片段尺寸可以是视频片段中视频帧的宽和高的尺寸。例如,所述视频片段中视频帧的尺寸为512×384,1024×768,1024×1024,1920×1080,2560×2560等。在一些实施例中,所述视频片段尺寸也可以是视频片段中视频帧的宽高比。例如,所述视频片段中视频帧的宽高比为9:16,1:1,5:4,4:3,16:9等。The size of the video segment may be the width and height of the video frame in the video segment. For example, the size of the video frame in the video segment is 512×384, 1024×768, 1024×1024, 1920×1080, 2560×2560, etc. In some embodiments, the size of the video segment may also be the aspect ratio of the video frame in the video segment. For example, the aspect ratio of the video frame in the video segment is 9:16, 1:1, 5:4, 4:3, 16:9, etc.
所述视频片段的时长是指播放所述视频片段所需要的时间长度。例如,3秒,5秒,30秒,2分钟等。The duration of the video segment refers to the length of time required to play the video segment. For example, 3 seconds, 5 seconds, 30 seconds, 2 minutes, etc.
所述视频片段在目标视频中的特定位置可以是指视频片段在目标视频中所处的特定范围。在一些实施例中,所述特定范围可以用时间来表示。例如,某个视频片段可以在目标视频中2分15秒至2分30秒的位置。在一些实施例中,所述特定范围也可以用视频帧范围来表示。例如,某个视频片段可以在目标视频中第1至50帧的位置。在一些实施例中,所述特定范围还可以用相对其他视频片段的位置来表示。例如,某个视频片段可以在目标视频中第3视频片段和第5视频片段之间的位置。The specific position of the video segment in the target video may refer to a specific range of the video segment in the target video. In some embodiments, the specific range may be expressed in terms of time. For example, a certain video segment can be in the target video from 2 minutes 15 seconds to 2 minutes 30 seconds. In some embodiments, the specific range may also be represented by a video frame range. For example, a certain video segment can be in the position of frames 1 to 50 in the target video. In some embodiments, the specific range may also be represented by a position relative to other video clips. For example, a certain video segment may be located between the third video segment and the fifth video segment in the target video.
所述视频片段的视觉效果特征可以用于描述对视频片段的执行的与其视觉效果相关的操作。所述操作可以包括美化、归一化、模板装饰等。所述美化可以包括滤镜、动画等提升视频效果的操作。The visual effect characteristics of the video segment may be used to describe operations performed on the video segment and related to its visual effect. The operations may include beautification, normalization, template decoration, and so on. The beautification may include operations such as filters and animations to enhance the video effect.
在一些实施例中,所述视频配置信息可以由用户确定,或根据系统默认设置确定等。在一些实施例中,数据库140和/或其他数据源150可以存储有所述视频配置信息,步骤220可以通过直接从数据库140和/或其他数据源150中获取所述视频配置信息实现,例如,确定目标视频的特定 主题/特定主体后,根据目标视频的特定主题/特定主体从数据库140和/或其他数据源150中获取对应的视频配置信息。In some embodiments, the video configuration information may be determined by the user, or determined according to system default settings. In some embodiments, the database 140 and/or other data sources 150 may store the video configuration information, and step 220 may be implemented by directly obtaining the video configuration information from the database 140 and/or other data sources 150, for example, After determining the specific subject/specified subject of the target video, the corresponding video configuration information is obtained from the database 140 and/or other data sources 150 according to the specific subject/special subject of the target video.
在一些实施例中,所述视频配置信息可以基于脚本信息和/或视频模板确定。例如,多媒体系统100通过对脚本信息和/或视频模板进行解析,确定所述视频配置信息。在一些实施例中,所述脚本信息和/或视频模板可以预存在数据库140和/或其他数据源150,可以用户通过输入目标视频的相关信息后自动从数据库140和/或其他数据源150获取或确定对应的脚本信息和/或视频模板。例如,用户输入特定香水的图片后,系统100可以自动识别目标视频的特定主题“香水广告”并从数据库140和/或其他数据源150中调用与香水广告相关的脚本信息和/或视频模板。In some embodiments, the video configuration information may be determined based on script information and/or video templates. For example, the multimedia system 100 determines the video configuration information by analyzing script information and/or video templates. In some embodiments, the script information and/or video template may be pre-stored in the database 140 and/or other data sources 150, and the user may automatically obtain information from the database 140 and/or other data sources 150 after inputting relevant information of the target video. Or determine the corresponding script information and/or video template. For example, after the user inputs a picture of a specific perfume, the system 100 can automatically recognize the specific theme "perfume advertisement" of the target video and call the script information and/or video template related to the perfume advertisement from the database 140 and/or other data sources 150.
所述脚本信息是指确定一个视频(例如,目标视频)中各视频片段和/或视频帧的画面内容(例如,特定主体、特定背景、特定动作等或其组合)、画面时长、画面景别(例如,全景、近景、中景、特写等)、镜头变化、出现时间等的信息。脚本信息可以定义视频片段的内容和/或排布,例如,对于广告类型的目标视频,其脚本信息可以确定该目标视频包括视频片段:①观众互动视频片段、②使用感受视频片段、③产品卖点视频片段、④使用方法视频片段、⑤功效作用视频频段以及⑥动作指引视频片段。可以通过解析脚本信息确定对各个要素(例如,标视频包含特定对象,目标视频包含特定主题,目标视频的总时长,目标视频所包含的镜头画面数量,目标视频所包含特定的镜头画面,目标视频中特定主题的重叠数量,或目标视频中特定主题的聚焦时间等)的要求。The script information refers to determining the screen content (for example, a specific subject, a specific background, a specific action, etc., or a combination thereof) of each video segment and/or video frame in a video (for example, a target video), a screen duration, and a screen scene (For example, panorama, close-up, medium-range, close-up, etc.), lens change, appearance time, etc. Script information can define the content and/or arrangement of video clips. For example, for a target video of an advertisement type, the script information can determine that the target video includes video clips: ①viewer interactive video clips, ②use experience video clips, ③product selling points Video clips, ④How to use video clips, ⑤Function effect video frequency bands, and ⑥Action guidance video clips. It can be determined by analyzing the script information for each element (for example, the target video contains a specific object, the target video contains a specific theme, the total duration of the target video, the number of shots contained in the target video, the specific shots contained in the target video, and the target video The number of overlaps of a specific theme in the target video, or the focus time of a specific theme in the target video, etc.).
在一些实施例中,脚本信息还可以包括各个镜头的排列顺序,如广告类型的目标视频的脚本信息中,其包括的视频片段可以以预设顺序排列,例如按①~⑥的顺序进行排列。可以基于所述顺序,生成目标视频。In some embodiments, the script information may further include the sequence of each shot. For example, in the script information of the target video of the advertisement type, the video clips included in the script information may be arranged in a preset order, for example, in the order of ①~⑥. The target video can be generated based on the sequence.
所述脚本信息可以具有不同类型。在一些实施例中,所述脚本信息可以是通用型脚本信息。所述通用型脚本信息可以适用于不同的主体(例如产品)或主题。例如,所述通用型脚本信息可以依次包括观众互动、使用感受、产品卖点、使用方法、功效作用以及行动指引,或者特定主题、产品卖点、使用方法、使用感受、功效作用以及产品性价比,又或者适用场景/人群、产品卖点、功效作用、产品性价比以及行动指引等多个视频片段或镜头。The script information can be of different types. In some embodiments, the script information may be general-purpose script information. The general-purpose script information may be applicable to different subjects (for example, products) or themes. For example, the general-purpose script information may sequentially include audience interaction, usage experience, product selling point, usage method, function function, and action guide, or a specific theme, product selling point, use method, use feeling, function function, and product cost performance, or Multiple video clips or shots such as applicable scenes/crowds, product selling points, efficacy, product cost-effectiveness, and action guidelines.
在一些实施例中,所述脚本信息可以与主体(例如,产品)或主题相关。例如,对于美妆产品,相应的脚本信息可以依次包括特定主题、功效作用、外观设计、产品成分以及行动指引,或者观众互动、产品/品牌推荐、适用场景/人群、功效作用1、功效作用2、产品成分以及观众互动,又或者目标人群痛点1、目标人群痛点2、产品/品牌推荐、产品成分、功效作用、特定主题、使用感受以及特定主题等多个视频片段或镜头。又例如,对于食品,相应的脚本信息可以依次包括食用感受、食品属性、烹调方法/过程、产品配料以及食用感受,或者产品推荐、烹调方法/过程、食用感受、品牌宣传以及产品效用,又或者品牌宣传、产品推荐、包装设计、烹调方法/过程、食用感受以及行动指引。又例如,对于母婴产品,相应的脚本信息可以依次包括特定主题、外观介绍、功效作用、产品属性1、制作工艺、产品属性2、行动指引以及外观介绍,或者产品性价比、产品/品牌推荐、适用场景/人群、产品质地以及产品/品牌推荐,又或者观众互动、产品属性、功效作用以及产品/品牌推荐等多个视频片段或镜头。In some embodiments, the script information may be related to the subject (eg, product) or topic. For example, for beauty products, the corresponding script information may in turn include specific themes, effects, design, product ingredients, and action guidelines, or audience interaction, product/brand recommendations, applicable scenarios/populations, efficacy 1, efficacy 2 , Product components and audience interaction, or target group pain points 1, target group pain points 2, product/brand recommendations, product ingredients, efficacy, specific themes, usage experience, and specific themes and other video clips or shots. For another example, for food, the corresponding script information may in turn include eating experience, food attributes, cooking method/process, product ingredients, and eating experience, or product recommendation, cooking method/process, eating experience, brand promotion, and product utility, or Brand promotion, product recommendation, packaging design, cooking method/process, eating experience and action guidelines. For another example, for maternal and child products, the corresponding script information may include specific themes, appearance introduction, efficacy, product attributes 1, production process, product attributes 2, action guidelines and appearance introduction, or product cost performance, product/brand recommendation, Applicable scenes/crowd, product texture, product/brand recommendation, or multiple video clips or shots such as audience interaction, product attributes, efficacy, and product/brand recommendation.
所述视频模板可以是指对视频的形式包装。所述视频模板可以包括片头、片尾、水印、字幕、标题、边框、滤镜等。所述片头/片尾是指分别添加在视频的开头和结尾,用于营造气氛、烘托气势、吸引眼球、呈现作品名称、拍摄者、产品信息等的一段影音素材。水印是指附加到视频上的 图案,体现公司、产品等信息或个性化设计。字幕是指以文字形式显示在视频中的对话、产品/主题介绍等非影像内容。标题是指标明视频内容的简短语句。边框是指围绕视频页面周围的一个或多个特定形状(例如,带状)图案。滤镜是指用来实现图像的特殊显示效果的操作。The video template may refer to the form packaging of the video. The video template may include credits, credits, watermarks, subtitles, titles, borders, filters, etc. The title/end of the title refers to a piece of audio-visual material added at the beginning and end of the video to create an atmosphere, heighten the momentum, attract eyeballs, present the title of the work, the photographer, and product information. Watermark refers to the pattern attached to the video, which reflects the company, product and other information or personalized design. Subtitles refer to non-visual content such as dialogues, product/topic introductions, etc. displayed in the video in the form of text. The title is a short sentence indicating the content of the video. The frame refers to one or more patterns of specific shapes (for example, strips) surrounding the video page. Filter refers to the operation used to realize the special display effect of the image.
在一些实施例中,视频模板可以是Adobe After Effects(AE)软件中的模板素材,该软件为视频制作领域常用软件,在此不过多赘述。In some embodiments, the video template may be a template material in Adobe After Effects (AE) software, which is commonly used software in the field of video production, and will not be repeated here.
在一些实施例中,视频模板可以与主体(例如,产品、模特等)、主题(例如,公益、娱乐、教育、生活、购物等)、视频效果(例如,宣传/广告效果)等相关。在一些实施例中,数据库140中和/或其他数据源150可以预设有多个对应不同主体、主题、视频效果等的视频模板,在确定目标视频的特定主题、特定主体、视频效果后,从数据库140和/或其他数据源150中调用对应的视频模板。在一些实施例中,用户通过基于不同的主体、主题和/或视频效果等,预设视频模板的片头、片尾、水印、字幕、标题、边框和/或滤镜,确定所述视频模板。In some embodiments, the video template may be related to the subject (e.g., product, model, etc.), theme (e.g., charity, entertainment, education, life, shopping, etc.), video effect (e.g., promotion/advertising effect), etc. In some embodiments, the database 140 and/or other data sources 150 may be preset with multiple video templates corresponding to different subjects, themes, video effects, etc. After determining the specific subject, specific subject, and video effect of the target video, The corresponding video template is called from the database 140 and/or other data sources 150. In some embodiments, the user determines the video template by presetting the title, ending, watermark, subtitle, title, border, and/or filters of the video template based on different subjects, themes, and/or video effects.
在一些实施例中,所述视频配置信息可以包括与目标视频所包含的视频片段的内容相关的至少一个预设条件(例如,第一预设条件,第二预设条件等)。所述第一预设条件可以与所述配置特征中的内容特征的多个要素中的至少一个相关。所述多个要素可以包括目标视频包含特定对象,目标视频包含特定主题,目标视频的总时长,目标视频所包含的镜头画面数量,目标视频所包含特定的镜头画面,目标视频中特定主题的重叠数量,或目标视频中特定主题的聚焦时间等。第一预设条件可以通过对所述多个要素中的至少一个进行约束实现对所述多个视频片段的筛选。例如,所述第一预设条件可以包括对目标视频最后一个镜头限制为产品展示镜头,以从所述多个视频片段中选取包含产品展示镜头的视频片段作为目标视频的最后一个视频片段。关于第一预设条件的具体内容可以参考流程800及其 相关描述。In some embodiments, the video configuration information may include at least one preset condition (for example, a first preset condition, a second preset condition, etc.) related to the content of the video segment included in the target video. The first preset condition may be related to at least one of the multiple elements of the content feature in the configuration feature. The multiple elements may include that the target video contains a specific object, the target video contains a specific theme, the total duration of the target video, the number of shots contained in the target video, the specific shots contained in the target video, and the overlap of specific themes in the target video The number, or the focus time of a specific topic in the target video, etc. The first preset condition may implement screening of the multiple video clips by restricting at least one of the multiple elements. For example, the first preset condition may include limiting the last shot of the target video to a product display shot, and selecting a video clip containing the product display shot from the multiple video clips as the last video clip of the target video. For the specific content of the first preset condition, refer to the process 800 and its related description.
所述第二预设条件可以与所述配置特征中的内容特征的差异度特征相关。例如,所述第二预设条件可以与片段集合的差异度相关。所述片段集合是指对满足特定条件的视频片段(例如满足第一预设条件的视频片段),也可以称为候选视频片段,进行分组而形成的一个或多个集合。所述第二预设条件可以通过对片段集合的差异度进行约束,实现对片段集合的筛选。所述目标视频可以基于筛选的片段集合生成。关于第二预设条件的具体内容可以参考流程900及其相关描述。The second preset condition may be related to the difference degree feature of the content feature in the configuration feature. For example, the second preset condition may be related to the difference degree of the segment set. The segment set refers to one or more sets formed by grouping video segments that meet specific conditions (for example, video segments that meet a first preset condition), which may also be called candidate video segments. The second preset condition may implement the screening of the fragment set by restricting the degree of difference of the fragment set. The target video may be generated based on a filtered set of segments. For the specific content of the second preset condition, refer to the process 900 and its related descriptions.
在一些实施例中,所述视频配置信息包括序列信息。所述序列信息可以与所述配置特征的排布特征相关。即所述序列信息可以确定目标视频中各视频片段的排布。例如,在生成目标视频时,可以基于所述序列信息将片段合集中的候选视频片段进行排序,生成所述目标视频。关于序列信息的具体内容可以参考流程800及其相关描述。In some embodiments, the video configuration information includes sequence information. The sequence information may be related to the arrangement feature of the configuration feature. That is, the sequence information can determine the arrangement of each video segment in the target video. For example, when generating the target video, the candidate video clips in the clip collection may be sorted based on the sequence information to generate the target video. For the specific content of the sequence information, refer to the process 800 and its related descriptions.
在一些实施例中,所述视频配置信息包括美化参数。所述美化参数可以与所述配置特征的排布特征相关。通过美化参数对目标视频进行美化以获得更好的视觉效果。例如,所述美化参数可以包括滤镜参数、动画参数、布局参数中的至少一个。所述美化参数可以用于美化所述多个视频片段中的至少部分视频片段。在一些实施例中,所述美化参数也可以直接用于所述目标视频、初始图像、初始视频等。In some embodiments, the video configuration information includes beautification parameters. The beautification parameter may be related to the arrangement feature of the configuration feature. Beautify the target video by beautifying parameters to obtain better visual effects. For example, the beautification parameters may include at least one of filter parameters, animation parameters, and layout parameters. The beautification parameter may be used to beautify at least part of the video clips in the plurality of video clips. In some embodiments, the beautification parameter may also be directly used for the target video, initial image, initial video, and so on.
步骤230,基于至少部分视频片段和所述视频配置信息,生成一个目标视频。Step 230: Generate a target video based on at least part of the video segment and the video configuration information.
多媒体系统100可以基于所述视频配置信息,从所述多个视频片段中确定至少一个片段集合从而生成所述目标视频。例如,多媒体系统100可以基于第一预设条件从所述多个视频片段中获取一个或多个候选视频片段,对所述一个或多个候选视频片段进行分组以确定至少一个片段集合, 所述至少一个片段集合满足第二预设条件,基于所述至少一个片段集合中的每个片段集合,根据对应的序列信息,生成一个目标视频。又例如,多媒体系统100可以基于所述多个视频片段生成多个候选片段集合,所述多个候选片段集合满足第二预设条件;基于第一预设条件从所述多个候选片段集合中筛选出至少一个片段集合;以及基于所述至少一个目标片段集合中的每个片段集合,根据对应的序列信息,生成一个目标视频。所述目标视频的类型可以包括但不限于广告视频、宣传视频、视频网络日志(vlog)、娱乐短视频等类型。关于基于第一预设条件和第二预设条件,生成目标视频的具体描述可以流程800或1200及其相关描述。The multimedia system 100 may determine at least one segment set from the multiple video segments based on the video configuration information to generate the target video. For example, the multimedia system 100 may obtain one or more candidate video fragments from the multiple video fragments based on a first preset condition, and group the one or more candidate video fragments to determine at least one set of fragments. At least one segment set satisfies the second preset condition, and based on each segment set in the at least one segment set, a target video is generated according to the corresponding sequence information. For another example, the multimedia system 100 may generate a plurality of candidate fragment sets based on the plurality of video fragments, the plurality of candidate fragment sets satisfying a second preset condition; and based on the first preset condition, from the plurality of candidate fragment sets At least one segment set is filtered out; and based on each segment set in the at least one target segment set, a target video is generated according to the corresponding sequence information. The type of the target video may include, but is not limited to, advertising video, promotional video, video web log (vlog), entertainment short video, and the like. Regarding the specific description of generating the target video based on the first preset condition and the second preset condition, the process 800 or 1200 and related descriptions can be used.
多媒体系统100可以基于美化参数,对所述多个视频片段中的至少部分视频片段或目标视频进行美化,以获得更好的视觉效果。在一些实施例中,多媒体系统100还可以对所述多个视频片段中的至少部分视频片段或目标视频执行常规视频处理,例如,剪裁、缩放、基于模板进行编辑等。The multimedia system 100 may beautify at least a part of the video clips or the target video in the plurality of video clips based on the beautification parameters to obtain a better visual effect. In some embodiments, the multimedia system 100 may also perform conventional video processing on at least part of the video clips or the target video in the plurality of video clips, for example, cropping, zooming, editing based on templates, and so on.
在一些实施例中,多媒体系统100可以进一步对所述目标视频进行后处理以满足至少一个视频输出条件。所述至少一个视频输出条件与所述目标视频的播放媒介相关。例如,所述至少一个视频输出条件为视频尺寸条件。所述视频尺寸条件可以基于视频播放媒介的尺寸来确定。对所述目标视频进行后处理可以包括根据所述视频尺寸条件对所述目标视频的画面进行裁剪。关于对目标视频的后处理的具体描述可以参考图14-16及其相关描述。In some embodiments, the multimedia system 100 may further perform post-processing on the target video to satisfy at least one video output condition. The at least one video output condition is related to a playback medium of the target video. For example, the at least one video output condition is a video size condition. The video size condition may be determined based on the size of the video playback medium. Post-processing the target video may include cropping a frame of the target video according to the video size condition. For a detailed description of the post-processing of the target video, refer to Figures 14-16 and related descriptions.
图3是根据本申请的一些实施例所示的根据初始图像或初始视频确定视频片段的方法的示例性流程图。在一些实施例中,流程300中的一个或以上步骤可以作为指令的形式存储在存储设备(例如,数据库140)中,并且被处理设备(例如,服务器110中的处理设备112)调用和/或执行。 如图3所示,流程300可以包括:Fig. 3 is an exemplary flowchart of a method for determining a video segment according to an initial image or an initial video according to some embodiments of the present application. In some embodiments, one or more steps in the process 300 may be stored as instructions in a storage device (for example, the database 140), and called by a processing device (for example, the processing device 112 in the server 110) and/or implement. As shown in FIG. 3, the process 300 may include:
步骤310:获取初始图像或初始视频中的至少一种。Step 310: Obtain at least one of an initial image or an initial video.
初始视频可以指动态影像,所述动态影像可以由一系列视频帧构成。例如,初始视频可以包括MPEG、AVI、ASF、MOV、3GP、WMV、DivX、XviD、RM、RMVB、FLV/F4V等各种格式的视频文件。在一些实施例中,初始视频还可以包含与动态影像对应的音频文件(音轨)。在一些实施例中,初始视频可以包括宣传视频、个人录制视频、影音影像、网络视频、广告片段、产品demo、电影以及含有相关产品、模特的短片或电影等。The initial video may refer to a dynamic image, and the dynamic image may be composed of a series of video frames. For example, the initial video may include video files in various formats such as MPEG, AVI, ASF, MOV, 3GP, WMV, DivX, XviD, RM, RMVB, FLV/F4V, etc. In some embodiments, the initial video may also include audio files (audio tracks) corresponding to the moving images. In some embodiments, the initial video may include promotional videos, personal recording videos, audiovisual images, network videos, advertisement clips, product demos, movies, and short films or movies containing related products and models.
初始图像可以指静态影像,例如,初始图像可以包括bmp、jpg、png、tif、gif、pcx、tga、exif、fpx、svg、psd、cdr、pcd、dxf、ai、raw等各种格式的图片文件。在一些实施例中,初始图像可以包括相机拍摄的照片、广告图、产品渲染图和海报等。The initial image can refer to a static image, for example, the initial image can include pictures in various formats such as bmp, jpg, png, tif, gif, pcx, tga, exif, fpx, svg, psd, cdr, pcd, dxf, ai, raw, etc. document. In some embodiments, the initial image may include photos taken by the camera, advertising images, product renderings, posters, and the like.
在一些实施例中,初始图像和初始图像可以由摄像或视频/图像处理设备获得,并存储于数据库140和/或其他数据源150,具体可以通过从数据库140和/或其他数据源150调用对应的初始图像和/或初始图像实现步骤310。In some embodiments, the initial image and the initial image may be obtained by camera or video/image processing equipment and stored in the database 140 and/or other data source 150. Specifically, the corresponding image may be called from the database 140 and/or other data source 150. The initial image of and/or the initial image implements step 310.
在一些实施例中,初始视频和初始图像可以是网络公共素材资源,例如各开放数据库中的图像资源与视频资源,步骤310可以通过获取公开素材的方式实现。在一些实施例中,多媒体系统100还可以通过其他直接或间接的方式获取初始视频和初始图像。例如,多媒体系统100直接获取用户上传的视频文件或图像文件,或基于用户输入的链接获取视频文件或图像文件。In some embodiments, the initial video and the initial image may be network public material resources, such as image resources and video resources in various open databases, and step 310 may be implemented by obtaining public materials. In some embodiments, the multimedia system 100 may also obtain the initial video and the initial image through other direct or indirect methods. For example, the multimedia system 100 directly obtains a video file or an image file uploaded by a user, or obtains a video file or an image file based on a link input by the user.
步骤320,对初始图像或初始视频进行编辑处理,得到多个视频片段。Step 320: Perform editing processing on the initial image or the initial video to obtain multiple video clips.
在一些实施例中,步骤320可以通过确定初始视频或初始图像的片段边界,并基于所述片段边界对初始视频或初始图像进行切分或分组,确定初始视频或初始图像的多个镜头片段,再将镜头片段作为所述多个视频片段。例如,若初始视频为烹饪视频,则可以将始视频中烹饪的步骤、菜品出品的步骤以及品尝的步骤切分为不同的镜头片段,再将各个镜头片段作为所述多个视频片段。具体的基于初始视频确定视频片段的内容可以参考本申请图4及其相关描述。In some embodiments, step 320 may determine the initial video or the initial image segment boundary, and segment or group the initial video or initial image based on the segment boundary to determine multiple shot segments of the initial video or initial image. Then take the shot fragment as the multiple video fragments. For example, if the initial video is a cooking video, the cooking steps, dish production steps, and tasting steps in the initial video can be divided into different shots, and then each shot can be used as the multiple video clips. For specific determination of the content of the video segment based on the initial video, reference may be made to FIG. 4 of this application and its related description.
在一些实施例中,为提高视频片段的配置特征描述的准确性,每一个视频片段可以为一个镜头片段。所述初始视频或初始图像往往包含多个镜头片段,由此,需要对初始图像或初始视频进行拆分。可以理解的是,视频片段也可以包含多个镜头片段。在一些实施例中,对于包含多个镜头片段的视频片段可以直接根据镜头片段的数量对初始视频或初始图像进行拆分。在一些实施例中,可以先将视频片段拆分为多个只含有一个镜头片段的视频片段,然后根据视频片段的约束条件(例如绑定条件)将多个视频片段拼接作为一个视频片段。In some embodiments, in order to improve the accuracy of the description of the configuration feature of the video segment, each video segment may be a shot segment. The initial video or the initial image often includes multiple shots, and therefore, the initial image or the initial video needs to be split. It is understandable that a video clip may also include multiple shots. In some embodiments, the initial video or the initial image can be split directly according to the number of shot fragments for a video fragment containing multiple shot fragments. In some embodiments, the video segment may be split into multiple video segments containing only one shot segment, and then the multiple video segments can be spliced into one video segment according to the constraint conditions (for example, binding conditions) of the video segments.
在一些实施例中,一个初始视频可以仅包括一个镜头片段,则整个在初始视频作为一个镜头片段,输出为一个视频片段。在一些实施例中,一个初始视频可以由多个镜头片接拼接而形成,两个相邻的镜头片段的连接处的一个或多个连续的视频帧可以称为片段边界(也可以称为镜头边界帧)。在一些实施例中,可以以镜头片段为单元对初始视频进行分割,以得到各个视频片段。在将初始视频分割为多个镜头片段时,可以在片段边界处进行分割,即将片段边界作为剪辑点,将初始视频拆分出多个视频片段。In some embodiments, an initial video may include only one shot segment, and the entire initial video is treated as a shot segment and output as a video segment. In some embodiments, an initial video may be formed by splicing multiple shots, and one or more consecutive video frames at the junction of two adjacent shots may be called a fragment boundary (also may be called a shot Boundary frame). In some embodiments, the initial video may be segmented in units of shot segments to obtain each video segment. When splitting the initial video into multiple shot segments, you can split it at the segment boundary, that is, use the segment boundary as a cutting point to split the initial video into multiple video segments.
图4是根据本申请的一些实施例所示的对初始图像或初始视频的编辑方法的示例性流程图,具体涉及对初始图像或初始视频进行拆分。在一些实施例中,流程400中的一个或以上步骤可以作为指令的形式存储在 存储设备(例如,数据库140)中,并且被处理设备(例如,服务器110中的处理设备112)调用和/或执行。如图4所示,该流程400具体可以包括以下步骤:Fig. 4 is an exemplary flowchart of a method for editing an initial image or an initial video according to some embodiments of the present application, which specifically involves splitting the initial image or the initial video. In some embodiments, one or more steps in the process 400 may be stored as instructions in a storage device (for example, the database 140), and called by a processing device (for example, the processing device 112 in the server 110) and/or implement. As shown in FIG. 4, the process 400 may specifically include the following steps:
步骤410,获取初始图像或初始视频中每对相邻图像或视频帧的特征。Step 410: Obtain the characteristics of each pair of adjacent images or video frames in the initial image or the initial video.
在一些实施例中,可以使用图像嵌入模型来获取初始图像或初始视频(例如,广告视频)中每对相邻图像或视频帧(也可以理解为多个视频帧以及视频帧)的特征信息。在一些实施例中,对初始图像采用的处理方法与初始视频类似,为便于理解,后续描述以初始视频为例进行说明,不构成对本申请范围的限定。多媒体系统100可以将初始视频输入图像嵌入模型中。图像嵌入模型可以提取组成初始视频的各个视频帧的图像,并提取视频帧的图像的特征,生成各个视频帧的图像对应的向量。在一些实施例中,可以将已经提取出的视频帧的图像输入图像嵌入模型中,图像嵌入可以对应输出各个视频帧的图像的向量。In some embodiments, the image embedding model may be used to obtain the characteristic information of each pair of adjacent images or video frames (also can be understood as multiple video frames and video frames) in the initial image or initial video (for example, an advertisement video). In some embodiments, the processing method adopted for the initial image is similar to that of the initial video. For ease of understanding, the following description takes the initial video as an example for illustration, which does not constitute a limitation on the scope of the present application. The multimedia system 100 can embed the initial video input image in the model. The image embedding model can extract the images of each video frame that constitutes the initial video, and extract the characteristics of the image of the video frame, and generate the vector corresponding to the image of each video frame. In some embodiments, the image input image of the video frame that has been extracted may be embedded in the model, and the image embedding may correspond to the vector outputting the image of each video frame.
在另一些实施例中,视频帧的特征信息可以基于imagenet图片库预训练的mobilenet模型(如mobilenetV2模型)获得。mobilenetV2模型可以较为准确而快速地提取每个视频帧的图像的特征。例如,可以将各个视频帧输入mobilenetV2模型中,通过mobilenetV2模型可以输出每个视频帧对应的归一化的1280维向量。在一些实施例中,还可以通过其他具有类似功能的机器学习模型来获得视频帧的特征信息,例如GoogLeNeT模型、VGG模型、ResNet模型等,本申请对此不做限制。通过使用机器学习模型来提取视频帧的特征,可以更准确地确定镜头边界帧,从而实现镜头片段的准确分割,以使得后续对初始图像或初始视频的画面裁剪时能够更方便的操作,避免将初始图像或初始视频的主要信息裁剪掉。In other embodiments, the feature information of the video frame can be obtained based on a mobilenet model (such as a mobilenetV2 model) pre-trained by the imagenet picture library. The mobilenetV2 model can extract the image features of each video frame more accurately and quickly. For example, each video frame can be input into the mobilenetV2 model, and the normalized 1280-dimensional vector corresponding to each video frame can be output through the mobilenetV2 model. In some embodiments, other machine learning models with similar functions can also be used to obtain the feature information of the video frame, such as GoogLeNeT model, VGG model, ResNet model, etc., which is not limited in this application. By using the machine learning model to extract the features of the video frame, the camera boundary frame can be determined more accurately, so as to realize the accurate segmentation of the shot fragment, so that the subsequent cropping of the initial image or the initial video can be more convenient to operate and avoid The main information of the original image or original video is cut off.
步骤420,确定每对相邻图像或视频帧的相似度。Step 420: Determine the similarity of each pair of adjacent images or video frames.
在一些实施例中,可以通过根据视频帧的特征信息,分别计算各个视频帧与从多个视频帧中预选定的视频帧之间的相似度实现。In some embodiments, this can be achieved by separately calculating the similarity between each video frame and a video frame preselected from a plurality of video frames according to the feature information of the video frame.
在一些实施例中,可以将两个视频帧的特征向量的内积作为这两个视频帧之间的相似度。在一些实施例中,计算各个相邻图像之间(例如视频帧与从多个视频帧中预选定的视频帧之间)的相似度可以是计算各个视频帧与其前和/或后相邻的视频帧之间的相似度,也可以是计算各个视频帧与其前和/或后间隔预设间隔帧数的视频帧之间的相似度。In some embodiments, the inner product of the feature vectors of two video frames may be used as the similarity between the two video frames. In some embodiments, calculating the similarity between each adjacent image (for example, between a video frame and a video frame preselected from a plurality of video frames) may be calculating each video frame and its previous and/or back neighbors The similarity between video frames can also be calculated by calculating the similarity between each video frame and the video frame with a preset number of frames before and/or after it.
步骤430,基于每对相邻图像或视频帧的相似度,识别片段边界。Step 430: Identify the segment boundary based on the similarity of each pair of adjacent images or video frames.
示例性地,所述片段边界可以包括硬切边界帧或软切边界帧。Exemplarily, the segment boundary may include a hard-cut boundary frame or a soft-cut boundary frame.
在一些实施例中,识别片段边界可以包括确定镜头片段的硬切边界帧。如果相邻的两个镜头片段间未使用过渡效果,相邻的两个镜头片段的相邻两个视频帧直接跳变,则相邻的这两个视频帧可以理解为硬切边界帧。在确定硬切边界帧的过程中,可以计算各个视频帧与其前和/或后相邻的视频帧之间的相似度,如果两个相邻的视频帧之间的相似度低于相似度阈值,则确定这两个相邻的视频帧为硬切边界帧。In some embodiments, identifying the segment boundary may include determining the hard-cut boundary frame of the shot segment. If the transition effect is not used between two adjacent lens segments, and the two adjacent video frames of the two adjacent lens segments jump directly, the two adjacent video frames can be understood as hard-cut boundary frames. In the process of determining the hard-cut boundary frame, the similarity between each video frame and its previous and/or subsequent adjacent video frames can be calculated, if the similarity between two adjacent video frames is lower than the similarity threshold , The two adjacent video frames are determined to be hard-cut boundary frames.
在一些实施例中,识别片段边界可以包括确定镜头片段的软切边界帧。如果相邻的两个镜头片段间使用了过渡效果,相邻的两个镜头片段的相邻视频帧不会直接跳变,则两个镜头片段之间用于过渡的若干分视频帧可以理解为软切边界帧。软切边界帧可以通过以下方法进行确定:In some embodiments, identifying the segment boundary may include determining the soft cut boundary frame of the shot segment. If a transition effect is used between two adjacent lens fragments, and the adjacent video frames of the two adjacent lens fragments will not jump directly, the several sub-video frames used for transition between the two lens fragments can be understood as Soft cut boundary frame. The soft cut boundary frame can be determined by the following methods:
首先,可以通过计算各个视频帧与其前和/或后间隔预设间隔帧数的视频帧之间的相似度来确定候选分割区域。在确定候选分割区域的过程中,预设间隔帧数可以设定为2帧、3帧或5帧等。如果计算出两个视频帧之间的相似度小于预设阈值,则将这两个视频帧之间的视频帧作为候选分割区域,而这两个视频帧作为候选分割区域的边界帧。例如,预设间隔帧数为2帧,则可以计算第10帧和第14帧之间的相似度,如果该相似度小 于相似度阈值,则将第12帧和第13帧作为候选分割区域,第10帧和第14帧为候选分割区域的边界帧。然后,还可以进一步对候选分割区域进行融合,即将重叠的候选分割区域合并到一起。如第12帧和第13帧为候选分割区域,第13帧和第14帧也为候选分割区域,则将第12、13和14帧合并为一个候选分割区域。First, the candidate segmentation area can be determined by calculating the similarity between each video frame and the video frame with a preset number of frames before and/or after. In the process of determining the candidate segmentation area, the preset interval frame number can be set to 2 frames, 3 frames, 5 frames, and so on. If it is calculated that the similarity between the two video frames is less than the preset threshold, the video frame between the two video frames is used as the candidate segmentation area, and the two video frames are used as the boundary frame of the candidate segmentation area. For example, if the preset interval frame number is 2 frames, the similarity between the 10th frame and the 14th frame can be calculated. If the similarity is less than the similarity threshold, the 12th and 13th frames are used as candidate segmentation regions. The 10th and 14th frames are the boundary frames of the candidate segmentation area. Then, the candidate segmentation regions can be further merged, that is, the overlapping candidate segmentation regions are merged together. If the 12th and 13th frames are candidate segmentation regions, and the 13th and 14th frames are also candidate segmentation regions, then the 12th, 13th, and 14th frames are merged into one candidate segmentation region.
由于前述步骤可能混入处于同一个镜头片段内但是画面变化较剧烈的一些视频帧,此后,还可以再对候选分割区域进行进一步筛选。在对候选分割区域进行筛选的过程中,可以基于候选分割区域内的相似度S1以及候选分割区域外的相似度S2来对候选分割区域进行筛选。Since the foregoing steps may be mixed into some video frames that are in the same shot segment but the screen changes drastically, after that, the candidate segmentation regions can be further screened. In the process of screening the candidate segmentation area, the candidate segmentation area may be filtered based on the similarity S1 within the candidate segmentation area and the similarity S2 outside the candidate segmentation area.
计算分割区域内的相似度S1的方法可以是:计算候选分割区域的边界帧与位于候选分割区域内且与候选分割区域边界帧间隔预设间隔帧数的视频帧的相似度,以获得候选分割区域内相似度S1。例如,候选分割区域为第12帧和第13帧,预设间隔帧数为2,则计算第11帧与第13帧的相似度以及第12帧与第14帧的相似度,将两个相似度的最小值作为分割区域内的相似度S1。The method of calculating the similarity S1 in the segmented area may be: calculating the similarity between the boundary frame of the candidate segmented area and the video frame located in the candidate segmented area and separated from the boundary frame of the candidate segmented area by a preset number of frames to obtain the candidate segmentation. The similarity within the region S1. For example, if the candidate segmentation area is the 12th frame and the 13th frame, and the preset interval frame number is 2, the similarity between the 11th frame and the 13th frame and the similarity between the 12th frame and the 14th frame are calculated, and the two similar The minimum value of the degree is taken as the similarity S1 in the segmented area.
计算候选分割区域外的相似度S2的方法可以是:计算候选分割区域内前部的视频帧与在其前间隔预设间隔帧数的视频帧之间的相似度,以及计算候选分割区域内后补的视频帧与在后间隔预设间隔帧数的视频帧之间的相似度,以获取候选分割区域外相似度S2。例如,候选分割区域为第12帧和第13帧,预设间隔帧数为2,则计算第10帧与第12帧的相似度,以及第13帧与第15帧的相似度,将两个相似度的最小值作为分割区域外的相似度S2。如果S2大于S1的值超过阈值,则将该候选分割区域认定为最终的分割区域,并基于最终的分割区域进行镜头片段的分割操作。The method of calculating the similarity S2 outside the candidate segmentation area may be: calculating the similarity between the video frame in the front part of the candidate segmentation area and the video frame with a preset number of interval frames before it, and calculating the back end of the candidate segmentation area. The similarity between the complemented video frame and the video frame with a preset number of frames in the subsequent interval is used to obtain the similarity S2 outside the candidate segmentation area. For example, if the candidate segmentation area is the 12th frame and the 13th frame, and the preset interval frame number is 2, the similarity between the 10th frame and the 12th frame, and the similarity between the 13th frame and the 15th frame are calculated, and the two The minimum value of the similarity is taken as the similarity S2 outside the divided area. If the value of S2 is greater than S1 and exceeds the threshold, the candidate segmentation area is deemed to be the final segmentation area, and the segmentation operation of the shot segment is performed based on the final segmentation area.
步骤440,基于片段边界,将初始图像或初始视频进行分割,得到多个视频片段。Step 440: Based on the segment boundaries, divide the initial image or the initial video to obtain multiple video segments.
可以通过将片段边界作为剪辑点将初始视频进行拆分,以使拆分出的各个视频片段包含一个镜头片段。The initial video can be split by using the segment boundary as the cutting point, so that each split video segment contains a shot segment.
需要说明的是,对于初始图像也可以基于图4所示的方法进行处理,进而得到初始图像中确定不重复的图像,作为对应的视频片段或经其他编辑处理得到对应的视频片段。It should be noted that the initial image can also be processed based on the method shown in FIG. 4 to obtain images that are determined to be non-repetitive in the initial image, as corresponding video segments or other editing processing to obtain corresponding video segments.
在一些实施例中,为方便基于特定对象对视频文件进行处理,本申请还可以根据目标视频的特定主题确定与所述特定主题相关的特定主体(也可以称为对象)。此处的特定主体可以理解为与特定主题(也可以表述为目标视频主题、裁剪主题等)相关的、包含在视频帧/镜头片段中的对象或主要物体,可以包括生物(人、动物等)、商品(汽车、日用品、装饰品、化妆品等)、背景(山、路、桥、房子等)等,例如,当目标视频的特定主题是广告宣传时,特定对象可以包括人物、物品或标志等一种或多种的组合。具体的,人物可以为活动/产品代言人,物品可以为对应的产品,标志可以为产品商标、地域标识等。In some embodiments, in order to facilitate the processing of the video file based on a specific object, the present application may also determine a specific subject (also referred to as an object) related to the specific theme according to the specific theme of the target video. The specific subject here can be understood as the object or main object contained in the video frame/shot fragment related to the specific theme (also can be expressed as the target video theme, cropping theme, etc.), and can include living things (humans, animals, etc.) , Merchandise (cars, daily necessities, decorations, cosmetics, etc.), background (mountains, roads, bridges, houses, etc.), etc., for example, when the specific theme of the target video is advertising, the specific object can include people, objects, or signs, etc. One or more combinations. Specifically, the person can be an event/product spokesperson, the item can be a corresponding product, and the logo can be a product trademark, a regional identification, etc.
可以理解的是,在本申请的不同流程中,特定对象可以表述为不同的名称,例如,在前述流程200中,特定对象可以表述为特定主体。在后续流程500中,特定对象可以表述为主体。在后续流程1400中,特定对象可以表述为裁剪主体。It can be understood that in different processes of the present application, specific objects can be expressed as different names. For example, in the aforementioned process 200, specific objects can be expressed as specific subjects. In the subsequent process 500, a specific object can be expressed as a subject. In the subsequent process 1400, the specific object may be expressed as the cropped subject.
在一些实施例中,根据特定主题确定特定主体的过程可以通过基于特定主题从候选主体中确定特定主体实现。具体的,可以基于视频片段中所包含的多个候选主体与特定主题的关联度自动进行挑选特定主体。例如将各个候选主体与特定主题的关联度进行排序,然后挑选出排名前X位的候选主体,X可以设定为1、2或4等。所述前X位的候选主体即可确定为视频片段的特定主体。候选主体与特定主题的关联度可以参考流程1600中裁剪主体信息中的一个或多个裁剪主体与主题信息之间的相关度的相关 描述。In some embodiments, the process of determining a specific subject based on a specific subject can be achieved by determining a specific subject from candidate subjects based on the specific subject. Specifically, the specific subject can be automatically selected based on the degree of relevance between the multiple candidate subjects included in the video clip and the specific subject. For example, rank the relevance of each candidate subject to a specific topic, and then select the top X candidate subjects. X can be set to 1, 2, or 4, etc. The top X candidate subjects can be determined as the specific subjects of the video clip. The degree of association between the candidate subject and a specific topic can refer to the description of the correlation between one or more cropped subjects in the cropped subject information in the process 1600 and the subject information.
在一些实施例中,候选主体可以是预先设定在数据库140中的主体。候选主体可以专门针对一类特定主题进行设定。例如,针对化妆品类广告视频,候选主体可以是商品、人脸(包括眼睛、鼻子、嘴巴等五官)。In some embodiments, the candidate subject may be a subject set in the database 140 in advance. Candidate subjects can be set specifically for a specific type of subject. For example, for a cosmetics advertisement video, the candidate subject may be a commodity or a human face (including five senses such as eyes, nose, and mouth).
在一些实施例中,候选主体也可以是各个视频帧/镜头片段中的主体,获取候选主体的过程可以通过机器学习模型确定每个视频片段的候选主体实现,例如,通过机器学习模型确定每个视频片段的视频帧中的主要内容,将该主要内容作为候选主体,然后根据语义分析确定候选主体与特定主题的相关度,进而确定各个特定主体。该方法可以使得所挑选出的特定主体与特定主题关联度较高。In some embodiments, the candidate subject can also be the subject in each video frame/shot segment. The process of obtaining the candidate subject can be achieved by determining the candidate subject of each video segment through a machine learning model, for example, determining each subject through a machine learning model. For the main content in the video frame of the video clip, the main content is used as a candidate subject, and then the correlation between the candidate subject and the specific subject is determined according to semantic analysis, and then each specific subject is determined. This method can make the selected specific subject have a higher degree of relevance to the specific topic.
在一些实施例中,可以根据特定主题确定视频片段的特定主体,例如,目标视频的主题信息为口红,视频片段中可能包含一个或多个主体,机器学习模型识别出的各个视频片段的主体作为候选主体,其可以包括人脸的鼻子、人脸的嘴巴、人脸的眼睛、商品(口红)、树木、马路和房子,基于特定主题为口红,可以进一步挑选出人脸的嘴巴和商品(口红)这些与口红相关性较高的候选主体来最终作为对应的视频片段的特定主体。In some embodiments, the specific subject of the video clip can be determined based on a specific subject. For example, the subject information of the target video is lipstick, and the video clip may contain one or more subjects. The subject of each video clip identified by the machine learning model is used as Candidate subjects, which can include the nose of the human face, the mouth of the human face, the eyes of the human face, commodities (lipstick), trees, roads, and houses. Based on a specific theme as lipstick, the mouth and commodities of the face can be further selected (lipstick). ) These candidate subjects with a high correlation with lipstick will eventually become the specific subjects of the corresponding video clips.
在一些实施例中,目标视频可能包括由多个视频主题,例如,为凸显漱口水的作用,可以在漱口水的广告视频前拼接牙齿保护宣传视频。对应地,不同的视频主题可能包含不同的特定主体。对于每个视频文件可以根据该视频片段所包含的特定主体,以及特定主体与特定主题(不同的视频主题)的对应关系,确定特定主题的重叠数量,例如,前述拼接了牙齿保护宣传的漱口水广告视频中,效果展示相关的镜头可以同时与两个特定主题有关,例如,效果展示的相关镜头中可以包括不刷牙的效果展示、用牙具刷牙的效果展示,用漱口水的效果展示,在宣传牙齿保护的同时突出漱口水的清洁效果。此外,基于多个视频主题,可以通过统计对应特定主题 的聚焦时间(将该特定主题作为视频片段的焦点或主要内容)确定目标视频中特定主题的聚焦时间。In some embodiments, the target video may include multiple video themes. For example, in order to highlight the role of the mouthwash, a tooth protection promotion video may be spliced before the advertisement video of the mouthwash. Correspondingly, different video themes may contain different specific subjects. For each video file, the amount of overlap of a specific theme can be determined according to the specific subject contained in the video clip and the corresponding relationship between the specific subject and the specific theme (different video themes), for example, the aforementioned mouthwash with dental protection promotion. In the advertisement video, the lens related to the effect display can be related to two specific themes at the same time. For example, the related lens of the effect display can include the effect of non-brushing, the effect of brushing with toothbrushes, the effect of mouthwash, and the effect of While protecting the teeth, it also highlights the cleaning effect of the mouthwash. In addition, based on multiple video topics, the focus time of a specific topic in the target video can be determined by counting the focus time corresponding to a specific topic (using the specific topic as the focus or main content of the video clip).
图5是根据本申请的一些实施例所示的编辑初始图像或初始视频进行的方法的示例性流程图。在一些实施例中,流程500中的一个或以上步骤可以作为指令的形式存储在存储设备(例如,数据库140)中,并且被处理设备(例如,服务器110中的处理设备112)调用和/或执行。Fig. 5 is an exemplary flowchart of a method for editing an initial image or an initial video according to some embodiments of the present application. In some embodiments, one or more steps in the process 500 may be stored as instructions in a storage device (for example, the database 140), and called by a processing device (for example, the processing device 112 in the server 110) and/or implement.
步骤510,确定初始图像或初始视频的主体信息,主体信息至少包括主体和主体位置。Step 510: Determine the subject information of the initial image or the initial video. The subject information includes at least the subject and the position of the subject.
初始图像或初始视频通常包括一个或多个用于突出主题的主体。在一些实施例中,主体具体可以是出现在视频片段中的各个物体中与目标视频主题最为相关的物体。在一些实施例中,主体也可是视频帧中占据画面比例最大的物体。示例性的,主体可以是产品(电子产品、日用品、装饰品等)、生物(人、动物等)或景观(山、房子等)等中的一个或多个。为了便于描述,在本部分中,以主体为一个,且主体为模特进行描述。The initial image or initial video usually includes one or more subjects for highlighting the subject. In some embodiments, the subject may specifically be the object most relevant to the target video theme among the various objects appearing in the video clip. In some embodiments, the subject may also be the object occupying the largest aspect ratio in the video frame. Exemplarily, the subject may be one or more of products (electronic products, daily necessities, decorations, etc.), creatures (humans, animals, etc.), or landscapes (mountains, houses, etc.), and the like. For ease of description, in this section, the main body is one and the main body is the model.
在一些实施例中,主体可以是人为导入,例如,用户可以从数据库140或终端设备130中选取主体。继续采用模特作为主体作为示例,用户希望生成与模特相关的目标视频时,模特可以作为目标视频的主体(特定主题),对应的视频片段、初始视频、初始图像应当包含该主体,用户在数据库140中选取该模特后,利用处理器对初始图像或初始视频进行进一步处理以确定该模特在各个视频帧中的位置。选取主体可以通过上传特定的图像、在视频帧中手动选择特定的图像、语义识别以及类似方法实现,例如,用户输入“模特”的文本内容后,多媒体系统100可以自动通过语义识别识别出各个初始视频、初始图像中的“模特图像”作为主体。In some embodiments, the subject may be manually imported. For example, the user may select the subject from the database 140 or the terminal device 130. Continue to use the model as the main body as an example. When the user wants to generate a target video related to the model, the model can be used as the main body of the target video (specific theme). The corresponding video clip, initial video, and initial image should contain the subject. The user is in the database 140 After selecting the model in, use the processor to further process the initial image or initial video to determine the position of the model in each video frame. The selection of the subject can be achieved by uploading a specific image, manually selecting a specific image in the video frame, semantic recognition and similar methods. For example, after the user enters the text content of the "model", the multimedia system 100 can automatically recognize each initial through semantic recognition. The "model image" in the video and the initial image is the subject.
在一些实施例中,主体信息至少包括主体的有无以及对应主体的主体位置。主体信息还可以包括主体的颜色信息、尺寸信息、名称信息、类 别信息或面部识别数据等。主体位置以理解为主体在图片和/或视频的画面中所处的位置的信息,例如可以是参考点的坐标的信息。主体的尺寸信息可以包括主体的实际尺寸信息和主体占广告视频的画面的尺寸的比例信息等。主体的类别信息可以理解为主体的分类,例如,主体的类别信息包括主体的分类是产品或模特的信息,或进一步细化为某一类产品信息,如包括多种主体为手机的类别信息可以是移动设备。In some embodiments, the subject information includes at least the presence or absence of the subject and the subject position of the corresponding subject. The subject information can also include the subject's color information, size information, name information, category information, or facial recognition data. The position of the subject is understood as the information about the position of the subject in the picture and/or video, for example, it may be the information of the coordinates of the reference point. The size information of the main body may include information about the actual size of the main body and information about the proportion of the main body in the size of the screen of the advertisement video. The category information of the subject can be understood as the category of the subject. For example, the category information of the subject includes the information that the subject’s classification is product or model, or it can be further refined into a certain category of product information. It is a mobile device.
在一些实施例中,主体信息可以通过标签(例如标签ID以及标签值)表征,对应的,可以对初始视频中的视频帧添加标签,该标签可以表示该图像或视频中包括的主体的名称,如一幅海报中包括了产品A、产品B和模特A,那么可以将该海报添加产品A、产品B和模特A的标签(例如添加修改产品A、产品B和模特A对应的标签ID,并将对应的标签值修改为1)。In some embodiments, the subject information can be characterized by tags (such as tag ID and tag value). Correspondingly, a tag can be added to the video frame in the initial video, and the tag can represent the name of the subject included in the image or video. If a poster includes product A, product B, and model A, you can add tags for product A, product B, and model A to the poster (for example, add and modify the tag IDs corresponding to product A, product B, and model A, and add The corresponding tag value is modified to 1).
在一些实施例中,确定初始图像或初始视频的方法中(如步骤310),数据库140中的每个待选择的图像或视频可以分别持有标签。当用户在数据库140中选取产品A、产品B或模特A作为主体(例如在确定目标视频的特定主题后,选取与目标视频特定主题对应的特定对象)时,数据库140能够自动与前述包括了产品A、产品B和模特A的海报关联,并将该海报提取作为初始图像。在提取初始视频时,可以直接将带有视频帧的部分视频内容进行进一步处理(例如,通过主体信息确定模型分析各个视频的视频帧)。从而得到包含主体和主体位置的标签。In some embodiments, in the method of determining the initial image or the initial video (such as step 310), each image or video to be selected in the database 140 may hold a tag. When the user selects product A, product B, or model A in the database 140 as the subject (for example, after determining the specific theme of the target video, select the specific object corresponding to the specific theme of the target video), the database 140 can automatically match the aforementioned product A. The poster of product B and model A is associated, and the poster is extracted as the initial image. When extracting the initial video, part of the video content with the video frame can be directly processed further (for example, the video frame of each video is analyzed through the main body information determination model). Thereby, a label containing the subject and the position of the subject is obtained.
在一些实施例中,筛选视频帧的方法也可以通过机器学习算法实现,即通过机器学习模型识别每个视频片段中是否包含特定对象。例如,目标视频的主题可以是牙齿保护宣传,对应的特定对象可以是“牙”、“医生”、“牙具”等与主题(牙齿保护)相关的特定物体,基于确定好的特定对象可以通过机器学习算法确定各个视频片段是否包含特定对象。In some embodiments, the method of filtering video frames can also be implemented by a machine learning algorithm, that is, a machine learning model is used to identify whether each video segment contains a specific object. For example, the subject of the target video can be tooth protection promotion, and the corresponding specific object can be "teeth", "doctor", "dental appliance" and other specific objects related to the theme (dental protection). Based on the determined specific object, the machine can be used The learning algorithm determines whether each video segment contains a specific object.
在一些实施例中,可以通过主体信息确定模型(例如,机器学习模型)对初始图像或初始视频进行处理,获取主体以及主体的位置。In some embodiments, the initial image or the initial video may be processed by a subject information determination model (for example, a machine learning model) to obtain the subject and the location of the subject.
在一些实施例中,对于初始视频,主体信息确定模型可以是生成模型、判定模型,也可以是机器学习中的深度学习模型,例如,可以是采用yolo系列算法、faster R-CNN算法或Efficient Det算法等的算法的深度学习模型。主体信息确定模型可以单独执行主体信息确认,也可以与其他流程/步骤(例如流程400)结合,以确定所述主体以及主体的位置。对应的,对主体信息确定模型可以单独进行训练,也可以与其他步骤一同进行。例如,在利用深度学习模型进行编辑处理时(例如,流程400)可以将人工标注的物体位置和类别作为训练样本对该模型进行训练,使该模型能够准标注别出初始视频中的主体。在一些实施例中,进一步的可以利用图像嵌入模型作为主体信息确定模型提取组成初始视频中的各个视频帧的图像,并提取视频帧的图像特征,以确定初始视频的主体。In some embodiments, for the initial video, the main body information determination model can be a generation model, a decision model, or a deep learning model in machine learning. For example, it can be a yolo series algorithm, a faster R-CNN algorithm, or an EfficientDet. Algorithms and other algorithms for deep learning models. The subject information determination model can perform subject information confirmation alone, or can be combined with other processes/steps (such as process 400) to determine the subject and the location of the subject. Correspondingly, the subject information determination model can be trained separately or together with other steps. For example, when a deep learning model is used for editing processing (for example, the process 400), manually labeled object positions and categories can be used as training samples to train the model, so that the model can accurately label the subject in the initial video. In some embodiments, the image embedding model can be further used as the main body information determination model to extract images that make up each video frame in the initial video, and image features of the video frames are extracted to determine the main body of the initial video.
在一些实施例中,对于初始图像,可以通过主体信息确定模型对初始图像进行处理,获取主体的位置。具体的,可以继续利用图像嵌入模型确定主体的位置,可以理解的是,视频帧中单帧的图像可以看成是一张图片,能够对多个视频帧处理的图像嵌入模型同样能够处理该初始图像。在一些实施例中,进行初始图像和初始视频处理的图像嵌入模型可以单独训练也可以一同训练,此外,在其他具体实施方式中,初始图像中的主体位置的确定同样可以利用初始视频中所使用的深度学习模型,例如,可以是采用yolo系列算法、R-CNN算法或Efficient Det算法等的算法的深度学习模型。In some embodiments, for the initial image, the initial image may be processed by the subject information determination model to obtain the position of the subject. Specifically, the image embedding model can continue to be used to determine the position of the subject. It is understandable that a single image in a video frame can be regarded as a picture, and the image embedding model that can process multiple video frames can also process the initial image. In some embodiments, the image embedding model for initial image and initial video processing can be trained separately or together. In addition, in other specific implementations, the determination of the subject position in the initial image can also be used in the initial video. The deep learning model of, for example, may be a deep learning model using algorithms such as the yolo series algorithm, the R-CNN algorithm, or the Efficient Det algorithm.
主体信息确定模型可以单独执行确定主体信息的操作;主体信息确定模型也可以与其他操作结合,在其他操作的执行过程中实现对主体信息的确定。The subject information determination model can perform the operation of determining subject information alone; the subject information determination model can also be combined with other operations to realize the determination of subject information during the execution of other operations.
在一些实施例中,可以将初始图像或初始视频直接输入主体信息确定模型,以使主体信息确定模型在对应的视频帧中标注对应的主体以及主体位置,从而确定初始图像或初始视频的主体信息。In some embodiments, the initial image or initial video may be directly input to the subject information determination model, so that the subject information determination model marks the corresponding subject and subject position in the corresponding video frame, thereby determining the subject information of the initial image or initial video .
在一些实施例中,时间较长的初始视频经过如流程400中所示的剪辑后可能得到一段或多段视频片段。为提高视频片段与目标视频的相关度,主体信息确定模型可以与流程400的相关操作结合,以使在流程400的拆分或剪辑过程中可以保留具有特定主体的视频片段,例如,可以保留主体为与目标视频特定主题相关的特定对象。在一些实施例中,可以利用主体信息确定模型对主体进行标注后裁剪,以保证裁剪后的视频中包括主体。In some embodiments, one or more video clips may be obtained after the initial video with a long time is edited as shown in the process 400. In order to improve the correlation between the video segment and the target video, the subject information determination model can be combined with the related operations of the process 400, so that the video segment with a specific subject can be retained during the splitting or editing of the process 400, for example, the subject can be retained For a specific object related to a specific topic of the target video. In some embodiments, the subject information determination model may be used to label and crop the subject to ensure that the subject is included in the cropped video.
在一些实施例中,主体信息确定模型可以与步骤310的相关操作结合,利用图像嵌入模型确定主体信息时,图像嵌入模型提取主体的图像特征,例如,例如确定步骤220中通过确定目标视频的特定主题所确定的特定对象的图像特征,如“模特”的图像特征。基于视频帧的图像特征和主体的图像特征确定数据库140中一系列包含有主体的视频帧,该一系列视频帧组成的镜头片段即为包含主体的初始视频或初始图像。In some embodiments, the subject information determination model can be combined with the related operations of step 310. When the image embedding model is used to determine subject information, the image embedding model extracts the image features of the subject. The image characteristics of a specific object determined by the subject, such as the image characteristics of a "model". Based on the image feature of the video frame and the image feature of the subject, a series of video frames containing the subject are determined in the database 140, and a shot segment composed of the series of video frames is the initial video or initial image containing the subject.
步骤520,基于主体信息对初始图像或初始视频进行编辑处理,得到多个视频片段。Step 520: Perform editing processing on the initial image or the initial video based on the subject information to obtain multiple video clips.
在一些实施例中,可以根据确定的主体的位置使编辑处理避开主体的所在范围,从而生成符合要求的视频片段。In some embodiments, the editing process can avoid the range of the subject according to the determined position of the subject, so as to generate a video clip that meets the requirements.
在一些实施例中,为提高处理精度,可以根据主体的外轮廓,规避编辑处理对主体的影响,在确定主体位置后,基于主体位置确定主体的外轮廓,以便于将主体与初始图像或初始视频中的背景部分区分开。需要说明的是,在一些其他实施例中,主体的信息还可以包括颜色信息和尺寸信息等,显然基于颜色信息和尺寸信息在主体位置的基础上能够更加快速高效的确定 主体的外轮廓。In some embodiments, in order to improve the processing accuracy, the outer contour of the main body can be used to avoid the influence of the editing process on the main body. The background part of the video is distinguished. It should be noted that in some other embodiments, the information of the subject may also include color information and size information, etc. Obviously, the outer contour of the subject can be determined more quickly and efficiently based on the color information and size information on the basis of the position of the subject.
在一些实施例中,主体的外轮廓可以由主体的尺寸确定,例如,可以根据主体的尺寸确定包含主体的最小矩形选框,并将该最小矩形选框作为主体的外轮廓。In some embodiments, the outer contour of the main body can be determined by the size of the main body. For example, the smallest rectangular marquee containing the main body can be determined according to the size of the main body, and the smallest rectangular marquee can be used as the outer contour of the main body.
在一些实施例中,主体的外轮廓可以由主体的边缘确定,其中,边缘指图像中主体与图像背景的交界点,例如,在确定主体的位置后,采用图像识别算法(例如边缘检测算法)确定主体的边缘,将主体的边缘作为主体的外轮廓。在一些实施例中,还可以将主体的边缘经预设处理得到的区域作为主体的外轮廓,例如,预设处理可以包括平滑处理,区域缩放等的一种或多种组合。In some embodiments, the outer contour of the subject can be determined by the edge of the subject, where the edge refers to the intersection point between the subject and the image background in the image. For example, after determining the position of the subject, an image recognition algorithm (such as an edge detection algorithm) is used. Determine the edge of the main body, and use the edge of the main body as the outer contour of the main body. In some embodiments, the area obtained by preset processing of the edge of the subject may also be used as the outer contour of the subject. For example, the preset processing may include one or more combinations of smoothing, area scaling, and the like.
在一些实施例中,根据主体的外轮廓,对初始图像或初始视频进行裁剪或缩放,以得到符合要求的视频片段。符合要求的视频片段可以是对初始图像或初始视频所进行的编辑处理没有影响主体的视频片段。可以通过避开主体的外轮廓对初始图像或初始视频进行裁剪以及保持主体的外轮廓内宽高比对初始图像或初始视频进行缩放实现。可以理解的是,编辑处理还可以包括本申请提到的任意编辑方法,例如,进行美化操作时,可以避免主体的外轮廓对主体的外轮廓外的图像进行虚化,以突出主体。In some embodiments, the initial image or initial video is cropped or zoomed according to the outer contour of the subject to obtain a video segment that meets the requirements. A video clip that meets the requirements may be a video clip that does not affect the main body by editing the original image or the original video. This can be achieved by cropping the initial image or initial video by avoiding the outer contour of the subject, and scaling the initial image or initial video by maintaining the inner aspect ratio of the outer contour of the subject. It is understandable that the editing process may also include any of the editing methods mentioned in the present application. For example, when performing beautification operations, the outer contour of the subject can avoid blurring the image outside the outer contour of the subject to highlight the subject.
在一些实施例中,避开主体的外轮廓对初始图像或初始视频进行裁剪可以通过抠图实现。具体的,已经识别出主体在初始图像或初始视频中的外轮廓,可以通过抠图(matting)算法避开主体外轮廓并将主体与初始图像或初始视频分离,分离后的主体的处理方法包括但不仅限于锁定或新建图层,当主体锁定或新建图层后,可以对背景部分进行进一步处理。In some embodiments, cutting the initial image or the initial video to avoid the outer contour of the subject can be achieved by matting. Specifically, the outer contour of the subject in the initial image or initial video has been identified, and the matting algorithm can be used to avoid the outer contour of the subject and separate the subject from the initial image or initial video. The processing methods for the separated subject include But it is not limited to locking or creating a new layer. When the subject locks or creates a new layer, further processing can be performed on the background part.
需要说明的是,在一些实施例中,抠图(matting)算法可以是基于深度学习的抠图算法,如基于学习的数字抠图(Learning Based Digital Matting)、最邻近结点算法抠图(KNN matting)等。在一些其他实施例中, 抠图算法还可以是基于聚类采样的抠图(Cluster-Based Sampling matting,CBS)、基于迭代直推学习的抠图(Iterative Transductive Learning for alpha matting,ITL)中的至少一个。It should be noted that, in some embodiments, the matting algorithm may be a matting algorithm based on deep learning, such as learning-based digital matting and nearest neighbor algorithm matting (KNN). matting) and so on. In some other embodiments, the matting algorithm may also be a matting based on cluster sampling (Cluster-Based Sampling matting, CBS), or Iterative Transductive Learning for alpha matting (ITL). at least one.
在一些实施例中,保持主体的外轮廓内宽高比对初始图像或初始视频进行缩放可以通过将主体与背景分离实现。具体的,为了避免主体在缩放过程中出现变形、扭曲等情况,将主体和背景部分分开进行缩放,在缩放过程中保持主体的外轮廓内的宽高比。仅作为示例,初始图像可以为像素尺寸800×600的海报,主体为海报内像素尺寸为150×330的手机(主体350宽高比为5:11),当目标视频或视频片段的尺寸为1200×800时,即需要将初始图像的尺寸缩放为1200×800,如果主体直接缩放,得到的缩放后尺寸为225×440,此时宽高比5:9.8,显然此时主体发成变形,而目标视频或视频片段中主体出现变形可能对于视频的效果、客户对于产品的认识产生不利的影响。在一些实施例中,保持主体外轮廓内宽高比方法可以是,分别获取初始图像或初始视频在目标视频尺寸(或视频片段尺寸)的宽度方向上和长度方向上缩放比例。例如,继续采用上述示例,初始图像宽度方向缩放1.25倍、长度方向缩放1.5倍,此时为了保证主体不发生变形可以选择长度方向和宽度方向上均缩放1.25倍或1.5倍。需要说明的是,在一些其他实施例中,主体的外轮廓可能不为矩形,该情况下同样适用于上述缩放方法,图像的处理和视频的处理方法类似,在此不过多赘述。In some embodiments, scaling the initial image or initial video while maintaining the outer contour and inner aspect ratio of the subject can be achieved by separating the subject from the background. Specifically, in order to avoid deformation and distortion of the main body during the zooming process, the main body and the background part are zoomed separately, and the aspect ratio within the outer contour of the main body is maintained during the zooming process. Just as an example, the initial image can be a poster with a pixel size of 800×600, and the main body is a mobile phone with a pixel size of 150×330 in the poster (the main body 350 aspect ratio is 5:11), when the size of the target video or video clip is 1200 ×800, that is, the size of the initial image needs to be scaled to 1200×800. If the subject is directly scaled, the scaled size will be 225×440. At this time, the aspect ratio is 5:9.8. Obviously, the subject is deformed at this time. The deformation of the subject in the target video or video clip may adversely affect the effect of the video and the customer's perception of the product. In some embodiments, the method of maintaining the internal aspect ratio of the outer contour of the main body may be to obtain the initial image or initial video separately to obtain the scale ratio in the width direction and the length direction of the target video size (or video segment size). For example, continuing to use the above example, the initial image is scaled by 1.25 times in the width direction and 1.5 times in the length direction. At this time, in order to ensure that the subject does not deform, you can choose to zoom in both the length direction and the width direction by 1.25 times or 1.5 times. It should be noted that, in some other embodiments, the outer contour of the subject may not be a rectangle. In this case, the above scaling method is also applicable. The image processing method is similar to the video processing method, so I will not repeat it here.
在一些实施例中,由于初始图像中或初始视频背景尺寸比例可能与目标视频或视频片段的尺寸不一致,直接进行缩放可能会导致比例变化。当需要对比例保持一致的情况下可以先对背景部分进行裁剪,裁剪后进行缩放。In some embodiments, since the initial image or the initial video background size ratio may be inconsistent with the size of the target video or video segment, directly scaling may cause the ratio to change. When the proportions need to be consistent, the background part can be cropped first, and then zoomed after cropping.
图6是根据本申请的一些实施例所示的生成视频的示例性流程图。在一些实施例中,流程600中的一个或以上步骤可以作为指令的形式 存储在存储设备(例如,数据库140)中,并且被处理设备(例如,服务器110中的处理设备112)调用和/或执行。如图6所示,该流程600具体包括以下步骤:Fig. 6 is an exemplary flow chart of generating a video according to some embodiments of the present application. In some embodiments, one or more steps in the process 600 may be stored as instructions in a storage device (for example, the database 140), and called by a processing device (for example, the processing device 112 in the server 110) and/or implement. As shown in FIG. 6, the process 600 specifically includes the following steps:
步骤610,获取初始视频与初始图像中的至少一种。Step 610: Acquire at least one of an initial video and an initial image.
在一些实施例中,初始视频与初始图像也可以分别称作待处理视频与待处理图像。步骤610可以参考流程300中的步骤310实现,步骤610也可以参考流程500中的步骤510。In some embodiments, the initial video and the initial image may also be referred to as the to-be-processed video and the to-be-processed image, respectively. Step 610 can be implemented with reference to step 310 in the process 300, and step 610 can also be implemented with reference to step 510 in the process 500.
步骤620,获取所述初始视频中目标视频的主体。Step 620: Obtain the main body of the target video in the initial video.
在一些实施例中,此处的目标视频的主体可以理解为与目标视频的特定主题对应的各个特定对象,例如,前述提及的以“牙齿保护宣传”为主体的目标视频,目标视频的主体(与特定主题对应的特定对象)可以是“牙”、“医生”、“牙具”等与主题(牙齿保护)相关的特定物体。In some embodiments, the main body of the target video here can be understood as each specific object corresponding to the specific theme of the target video, for example, the aforementioned target video with "teeth protection promotion" as the main body, the main body of the target video (A specific object corresponding to a specific theme) may be a specific object related to the theme (dental protection) such as "tooth", "doctor", and "dental appliance".
步骤630,基于目标视频预设尺寸和主体对初始视频进行裁剪缩放和/或剪辑,得到均包括主体的视频素材。In step 630, the initial video is cropped, zoomed and/or edited based on the preset size of the target video and the main body, to obtain video materials that all include the main body.
在一些实施例中,视频素材可以是视频片段中的通过初始视频而确定的视频片段,均包括主体的视频素材指基于初始视频而确定的均包含目标视频的主体(特定对象)的视频片段。In some embodiments, the video material may be a video segment determined by the initial video in the video segment, and the video material that all includes the subject refers to the video segment that is determined based on the initial video and all contains the subject (specific object) of the target video.
在一些实施例中,目标视频根据需要可以预设有预设尺寸,对于初始视频尺寸与目标视频尺寸不符或尺寸比例不符的情况,可以通过对初始视频进行裁剪缩放和/或剪辑处理。仅作为示例,目标视频尺寸为FHD(Full High Definition,1920×1080),当初始视频与目标视频尺寸不符但比例相同(如同为16:9)时,可以将初始视频通过缩放的方式得到与目标视频尺寸同为1920×1080的视频。当初始视频与目标视频比例不符(如初始视频为1:1)时,假如初始视频尺寸为1024×1024,根据目标视频尺寸比例得到裁剪目标尺寸为1024×768,即先将初始视频进行逐帧裁剪,然后将裁 剪后得到尺寸为1024×768的视频等比例放大至1920×1080。需要说明的是,在一些实施例中,当初始视频尺寸大于目标视频尺寸时,如初始视频尺寸为2560×2560时,此时可以将初始视频直接裁剪为目标视频尺寸1920×1080,也可以按上述步骤先裁剪为2560×1440后等比例缩放。由于视频帧可以看成是图片,故在该步骤中对视频进行逐帧裁剪的方式,可以参见本申请图7中所示的图像裁剪处理的方式。在一些实施例中,步骤630中对视频进行逐帧裁剪的方式还可以参考申请图14所示的视频裁剪的方法。In some embodiments, the target video may be preset with a preset size as required. In the case where the initial video size does not match the target video size or the size ratio does not match, the initial video may be cropped, zoomed and/or edited. Just as an example, the target video size is FHD (Full High Definition, 1920×1080). When the initial video does not match the target video size but the ratio is the same (like 16:9), the initial video can be scaled to get the same as the target video. The video size is the same as 1920×1080 video. When the ratio between the initial video and the target video does not match (for example, the initial video is 1:1), if the initial video size is 1024×1024, the cropping target size is 1024×768 according to the target video size ratio, that is, the initial video is first frame by frame Crop, and then enlarge the cropped video with a size of 1024×768 to 1920×1080 in equal proportions. It should be noted that, in some embodiments, when the initial video size is larger than the target video size, for example, when the initial video size is 2560×2560, the initial video can be directly cropped to the target video size of 1920×1080, or press The above steps are first cropped to 2560×1440 and then scaled equally. Since the video frame can be regarded as a picture, the method of cropping the video frame by frame in this step can refer to the image cropping method shown in FIG. 7 of this application. In some embodiments, the method of cropping the video frame by frame in step 630 can also refer to the video cropping method shown in FIG. 14 of the application.
在一些实施例中,当初始视频尺寸与目标视频尺寸相符,或经过裁剪缩放后尺寸与目标视频尺寸相符后,可以对时间较长(如超过15秒或20秒等)的初始视频进行剪辑,避免出现单个视频素材(视频片段)持续时间较长的问题,通常一个视频素材(视频片段)对应一个场景(镜头片段),长时间播放同一场景内画面可能会使观看者感觉无趣,通过缩短每个视频素材的方式,以突出重点。在一些实施方式中,还可以通过插值或抽样的方式调整视频素材的时长。需要说明的是,若初始视频需要分别进行裁剪缩放和剪辑时,可以先进行剪辑得到具有主体的视频后再进行裁剪缩放,也可以先进行裁剪缩放得到尺寸一致的视频后再进行裁剪,本申请对此不做限制。In some embodiments, when the initial video size matches the target video size, or after cropping and scaling the size matches the target video size, the initial video with a longer time (for example, more than 15 seconds or 20 seconds, etc.) can be edited. To avoid the problem of long duration of a single video material (video clip), usually one video material (video clip) corresponds to a scene (shot clip). Playing the same scene for a long time may make the viewer feel bored. A way of video material to highlight the key points. In some implementations, the duration of the video material can also be adjusted by interpolation or sampling. It should be noted that if the initial video needs to be cropped, zoomed, and clipped separately, you can first edit to obtain a video with the main body and then perform cropping and zooming, or you can perform cropping and zooming to obtain a video with the same size before cropping. This application There is no restriction on this.
在一些实施例中,当初始视频尺寸与目标视频尺寸相符,或经过裁剪缩放后尺寸与目标视频尺寸相符后,可以对时间较长(如超过15秒或20秒等)的初始视频进行剪辑,避免出现单个视频素材持续时间较长的问题,通常一个视频素材对应一个场景,长时间播放同一场景内画面可能会使观看者感觉无趣,通过缩短每个视频素材的方式,以突出重点。需要说明的是,若初始视频需要分别进行裁剪缩放和剪辑时,可以先进行剪辑得到具有主体的视频后再进行裁剪缩放,也可以先进行裁剪缩放得到尺寸一致的视频后再进行裁剪,本申请对此不做限制。In some embodiments, when the initial video size matches the target video size, or after cropping and scaling the size matches the target video size, the initial video with a longer time (for example, more than 15 seconds or 20 seconds, etc.) can be edited. To avoid the problem of a long duration of a single video material, usually one video material corresponds to a scene, and playing the same scene for a long time may make the viewer feel bored. By shortening each video material, we can highlight the key points. It should be noted that if the initial video needs to be cropped, zoomed, and clipped separately, you can first edit to obtain a video with the main body and then perform cropping and zooming, or you can perform cropping and zooming to obtain a video with the same size before cropping. This application There is no restriction on this.
步骤640,基于所述目标视频预设尺寸对初始图像进行裁剪和/或缩放,得到包括主体的图像素材。In step 640, the initial image is cropped and/or scaled based on the preset size of the target video to obtain an image material including the subject.
在一些实施例中,图像素材可以是通过初始图像而确定的,均包括主体的图像素材指基于初始图像而确定的均包含目标视频的主体(特定对象)图像。In some embodiments, the image material may be determined by the initial image, and the image material that all includes the subject refers to the subject (specific object) image that is determined based on the initial image and all contains the target video.
在一些实施例中,为了使图像也能够满足目标视频的尺寸要求,对初始图像中尺寸与目标视频尺寸不符的图像文件进行裁剪或缩放,继续以目标视频尺寸为FHD为例,通过裁剪和/或缩放得到尺寸为1920×1080且包括主体的图像。In some embodiments, in order to make the image meet the size requirement of the target video, crop or zoom the image file whose size does not match the target video size in the initial image, and continue to take the target video size as FHD as an example, through cropping and/ Or zoom to get an image with a size of 1920×1080 and including the subject.
在一些实施例中,步骤中610中获取初始图像和初始视频中的至少一种,当同时获取初始图像和初始视频时,执行步骤630和步骤640,且两步骤之间不存在先后顺序;当仅获取初始视频时,可以执行步骤630不执行步骤640;当仅获取初始图像时,可以跳过步骤630执行步骤640。In some embodiments, at least one of the initial image and the initial video is acquired in step 610. When the initial image and the initial video are acquired at the same time, step 630 and step 640 are performed, and there is no sequence between the two steps; When only the initial video is acquired, step 630 may be performed without performing step 640; when only the initial image is acquired, step 630 may be skipped and step 640 may be performed.
在一些实施例中,步骤640可以通过图7所示的方法实现,其中,图7是根据本申请的一些实施例所示的生成视频的示例性流程图,步骤640中的一个或以上子步骤可以作为指令的形式存储在存储设备(例如,数据库140)中,并且被处理设备(例如,服务器110中的处理设备112)调用和/或执行。In some embodiments, step 640 may be implemented by the method shown in FIG. 7, where FIG. 7 is an exemplary flowchart of generating a video shown in some embodiments of the present application, and one or more sub-steps in step 640 It may be stored in a storage device (for example, the database 140) as an instruction, and called and/or executed by a processing device (for example, the processing device 112 in the server 110).
步骤642,获取初始图像中所述目标视频的主体信息。Step 642: Obtain the subject information of the target video in the initial image.
在一些实施例中,可以通过主体信息确定模型(例如,机器学习模型)对初始图像进行处理,获取主体信息。在一些实施例中,对于初始图像,主体信息确定模型可以是生成模型、判定模型,也可以是机器学习中的深度学习模型,例如,可以是采用yolo系列算法、faster R-CNN算法或Efficient Det算法等的算法的深度学习模型。In some embodiments, the initial image may be processed by a subject information determination model (for example, a machine learning model) to obtain subject information. In some embodiments, for the initial image, the subject information determination model can be a generation model, a decision model, or a deep learning model in machine learning. For example, it can use the yolo series algorithm, the faster R-CNN algorithm, or the EfficientDet. Algorithms and other algorithms for deep learning models.
通过将初始图像或初始视频输入主体信息确定模型,确定主体信息。在一些实施例中,可以将初始图像或初始视频直接输入主体信息确定模型,以使主体信息确定模型在对应的视频帧中标注对应的主体以及主体位置,从而确定初始图像或初始视频的主体信息。在一些实施例中,步骤642可以参考流程500中步骤510的相关描述。Determine the subject information by inputting the initial image or the initial video into the subject information determination model. In some embodiments, the initial image or initial video may be directly input to the subject information determination model, so that the subject information determination model marks the corresponding subject and subject position in the corresponding video frame, thereby determining the subject information of the initial image or initial video . In some embodiments, for step 642, reference may be made to the related description of step 510 in the process 500.
步骤644,基于所述主体信息识别出所述主体的外轮廓。Step 644: Identify the outer contour of the subject based on the subject information.
在确定主体信息(包括主体和主体的位置)后,基于主体位置确定主体的外轮廓,以便于将主体与初始图像中的背景部分区分开。需要说明的是,在一些其他实施例中,主体的信息还可以包括颜色信息和尺寸信息等,显然基于颜色信息和尺寸信息在主体位置的基础上能够更加快速高效的确定主体的外轮廓。After the subject information (including the subject and the position of the subject) is determined, the outer contour of the subject is determined based on the subject position, so as to distinguish the subject from the background part in the initial image. It should be noted that in some other embodiments, the information of the subject may also include color information and size information, etc. Obviously, the outer contour of the subject can be determined more quickly and efficiently based on the color information and size information on the basis of the position of the subject.
在一些实施例中,主体的外轮廓可以由主体的尺寸确定,例如,可以根据主体的尺寸确定包含主体的最小矩形选框,并将该最小矩形选框作为主体的外轮廓。In some embodiments, the outer contour of the main body can be determined by the size of the main body. For example, the smallest rectangular marquee containing the main body can be determined according to the size of the main body, and the smallest rectangular marquee can be used as the outer contour of the main body.
在一些实施例中,主体的外轮廓可以由主体的边缘确定,其中,边缘指图像中主体与图像背景的交界点,例如,在确定主体的位置后,采用图像识别算法(例如边缘检测算法)确定主体的边缘,将主体的边缘作为主体的外轮廓。在一些实施例中,还可以将主体的边缘经预设处理得到的区域作为主体的外轮廓,例如,预设处理可以包括平滑处理,区域缩放等的一种或多种组合。In some embodiments, the outer contour of the subject can be determined by the edge of the subject, where the edge refers to the intersection point between the subject and the image background in the image. For example, after determining the position of the subject, an image recognition algorithm (such as an edge detection algorithm) is used. Determine the edge of the main body, and use the edge of the main body as the outer contour of the main body. In some embodiments, the area obtained by preset processing of the edge of the subject may also be used as the outer contour of the subject. For example, the preset processing may include one or more combinations of smoothing, area scaling, and the like.
步骤646,避开主体的外轮廓对所述初始图像进行裁剪。Step 646: Crop the initial image avoiding the outer contour of the subject.
可以通过避开主体的外轮廓对初始图像进行裁剪。在一些实施例中,避开主体的外轮廓对初始图像进行裁剪可以通过抠图实现。具体的,识别出主体在初始图像中的外轮廓后,可以通过抠图算法避开主体外轮廓并将主体与初始图像分离,分离后的主体的处理方法包括但不仅限于锁定 或新建图层,当主体锁定或新建图层后,可以对背景部分进行进一步处理。The initial image can be cropped by avoiding the outline of the subject. In some embodiments, cropping of the initial image by avoiding the outer contour of the subject can be achieved by matting. Specifically, after recognizing the outline of the subject in the initial image, the outline of the subject can be avoided by the matting algorithm and the subject can be separated from the initial image. The processing methods for the separated subject include but are not limited to locking or creating a new layer. After the subject is locked or a new layer is created, the background part can be further processed.
需要说明的是,在一些实施例中,抠图(matting)算法可以是基于深度学习的抠图算法,如基于学习的数字抠图(Learning Based Digital Matting)、最邻近结点算法抠图(KNN matting)等。在一些其他实施例中,抠图算法还可以是基于聚类采样的抠图(Cluster-Based Sampling matting,CBS)、基于迭代直推学习的抠图(Iterative Transductive Learning for alpha matting,ITL)中的至少一个。在一些实施例中,步骤646可以参考流程500中步骤520的相关描述。It should be noted that, in some embodiments, the matting algorithm may be a matting algorithm based on deep learning, such as learning-based digital matting and nearest neighbor algorithm matting (KNN). matting) and so on. In some other embodiments, the matting algorithm may also be a matting based on cluster sampling (Cluster-Based Sampling matting, CBS), and iterative transductive learning (Iterative Transductive Learning for alpha matting, ITL). at least one. In some embodiments, for step 646, reference may be made to the related description of step 520 in the process 500.
步骤648,保持所述主体的外轮廓内宽高比对所述待处理图像进行缩放。Step 648: Maintain the external contour and internal aspect ratio of the main body to zoom the image to be processed.
在一些实施例中,保持主体的外轮廓内宽高比对初始图像进行缩放可以通过将主体与背景分离实现。具体的,为了避免主体在缩放过程中出现变形、扭曲等情况,将主体和背景部分分开进行缩放,在缩放过程中保持主体的外轮廓内的宽高比。仅作为示例,初始图像可以为像素尺寸800×600的海报,主体为海报内像素尺寸为150×330的手机(主体350宽高比为5:11),当目标视频或视频片段的尺寸为1200×800时,即需要将初始图像的尺寸缩放为1200×800,如果主体直接缩放,得到的缩放后尺寸为225×440,此时宽高比5:9.8,显然此时主体发成变形,而目标视频或视频片段中主体出现变形可能对于视频的效果、客户对于产品的认识产生不利的影响。在一些实施例中,保持主体外轮廓内宽高比方法可以是,分别获取初始图像在目标视频尺寸(或视频片段尺寸)的宽度方向上和长度方向上缩放比例。例如,继续采用上述示例,初始图像宽度方向缩放1.25倍、长度方向缩放1.5倍,此时为了保证主体不发生变形可以选择长度方向和宽度方向上均缩放1.25倍或1.5倍。需要说明的是,在一些其他实施例中,主体的外轮廓可能不为矩形,该情况下同样适用于上述缩放方法,图像的 处理和视频的处理方法类似,在此不过多赘述。在一些实施例中,步骤648可以参考流程500中步骤520的相关描述。In some embodiments, scaling the initial image while keeping the outer contour and inner aspect ratio of the subject can be achieved by separating the subject from the background. Specifically, in order to avoid deformation and distortion of the main body during the zooming process, the main body and the background part are zoomed separately, and the aspect ratio within the outer contour of the main body is maintained during the zooming process. Just as an example, the initial image can be a poster with a pixel size of 800×600, and the main body is a mobile phone with a pixel size of 150×330 in the poster (the main body 350 aspect ratio is 5:11), when the size of the target video or video clip is 1200 ×800, that is, the size of the initial image needs to be scaled to 1200×800. If the subject is directly scaled, the scaled size will be 225×440. At this time, the aspect ratio is 5:9.8. Obviously, the subject is deformed at this time. The deformation of the subject in the target video or video clip may adversely affect the effect of the video and the customer's perception of the product. In some embodiments, the method of maintaining the internal aspect ratio of the outer contour of the main body may be to obtain the initial image to be scaled in the width direction and the length direction of the target video size (or video segment size) respectively. For example, continuing to use the above example, the initial image is scaled by 1.25 times in the width direction and 1.5 times in the length direction. At this time, in order to ensure that the subject does not deform, you can choose to zoom in both the length direction and the width direction by 1.25 times or 1.5 times. It should be noted that in some other embodiments, the outer contour of the subject may not be a rectangle. In this case, the above scaling method is also applicable. The image processing method is similar to the video processing method, so I will not repeat it here. In some embodiments, reference may be made to the related description of step 520 in the process 500 for step 648.
步骤650,基于视频配置信息对视频片段进行拼接,生成所述目标视频。In step 650, video clips are spliced based on the video configuration information to generate the target video.
步骤650中的视频片段可以包括均包括主体的视频素材和/或包括主体的图像素材。The video clip in step 650 may include video material each including the subject and/or image material including the subject.
在一些实施例中,所述视频配置信息可以基于脚本信息和/或视频模板确定。视频模板可以包括目标视频的整体视频模板也可以包括组成目标视频的各个视频片段的片段视频模板。视频模板中至少可以包括时间参数,在一些实施例中,时间参数至少体现目标视频或视频片段的长度(目标视频的总时长),在一些实施例中,在前述步骤中已经将初始图像和/或初始视频处理得到了与目标视频尺寸一致的视频片段(包括图像素材和/或视频素材)。故拼接可以是随机或有预定规律的将视频片段基于时间参数进行有序播放。仅作为示例,预定规律可以是图像素材和视频素材交替拼接、还可以是将图像素材集中于目标视频中部播放等、还可以是按照与各个视频片段对应的预设顺序播放,例如,对于前述漱口水广告视频,先播放包含主体(特定对象)“模特”的视频片段,再播放包含主体“牙”的视频片段,最后播放包含主体“漱口水产品”的视频片段。需要说明的是,由于图片不具备时间属性,在拼接中可以定义单幅图片所显示的时间(如3秒、5秒或10秒等),满足显示时间后切换至下一素材。在一些实施例中,视频片段的时长可能与片段视频模板中的时间参数不同,可以通过剪切部分视频片段、与其他视频片段合并、对各个视频帧进行抽样播放、在各个视频帧之间插值播放等一种或多种方法组合以调整视频片段的时长。In some embodiments, the video configuration information may be determined based on script information and/or video templates. The video template may include the overall video template of the target video, and may also include the fragment video template of each video segment that composes the target video. The video template can include at least a time parameter. In some embodiments, the time parameter at least reflects the length of the target video or video segment (the total duration of the target video). In some embodiments, the initial image and/ Or the initial video processing obtains a video segment (including image material and/or video material) consistent with the target video size. Therefore, the splicing can be random or predetermined rules to play video clips in an orderly manner based on time parameters. As an example only, the predetermined rule may be that image materials and video materials are alternately spliced, or image materials are concentrated in the middle of the target video for playback, etc., or they may be played in a preset order corresponding to each video segment. For example, for the aforementioned rinse For the saliva advertising video, first play the video clip containing the subject (specific object) "model", then play the video clip containing the subject "teeth", and finally play the video clip containing the subject "mouthwash product". It should be noted that since pictures do not have time attributes, the display time of a single picture (such as 3 seconds, 5 seconds, 10 seconds, etc.) can be defined in the stitching, and switch to the next material when the display time is satisfied. In some embodiments, the duration of the video clip may be different from the time parameter in the clip video template. You can cut some video clips, merge with other video clips, sample and play each video frame, and interpolate between each video frame. One or more methods such as playback can be combined to adjust the duration of the video clip.
在一些实施例中,视频模板还可以用于获取初始视频或初始图像、生成视频片段等步骤,对应的,本步骤中视频模块的对象可以是目标视频、 视频片段、初始图像以及初始视频。例如,对于初始图像和/或初始视频可以选择目标视频的主体对应的视频模板并进行套用,具体的,对于特定产品(如漱口水)的初始图像,可以将图像的背景替换为展示背景,并在图像上添加文字(例如产品介绍、价格介绍、购买链接等内容)以及动态效果(例如指向购买链接的箭头),从而生成产品展示视频片段。In some embodiments, the video template may also be used for steps such as obtaining an initial video or an initial image, and generating a video clip. Correspondingly, the object of the video module in this step may be a target video, a video clip, an initial image, and an initial video. For example, for the initial image and/or initial video, you can select the video template corresponding to the subject of the target video and apply it. Specifically, for the initial image of a specific product (such as mouthwash), you can replace the background of the image with the display background, and Add text (such as product introduction, price introduction, purchase link, etc.) and dynamic effects (such as an arrow pointing to the purchase link) on the image to generate a product display video clip.
在一些实施例中,视频模板还可以包括美化参数。通过美化参数对目标视频进行美化以获得更好的效果。在一些其他实施例中,上述美化参数可以不包括在视频模板中,而在进行视频渲染前、确定视频片段前和/或生成目标视频前的步骤中执行获取。对应的,后续美化方法还可以应用于目标视频、视频片段、初始图像以及初始视频。In some embodiments, the video template may also include beautification parameters. Beautify the target video by beautifying parameters to obtain better results. In some other embodiments, the aforementioned beautification parameters may not be included in the video template, and the acquisition is performed before video rendering, before determining a video segment, and/or before generating a target video. Correspondingly, the subsequent beautification method can also be applied to the target video, video segment, initial image, and initial video.
在一些实施例中,具体的,美化参数可以包括滤镜参数、动画参数、布局参数中的至少一个。滤镜参数可以是对目标视频全局增加一个效果滤镜(如黑白、复古、鲜艳等);动画参数可以是当目标视频有多个视频片段拼接过程中,在视频片段之间添加动画效果,使目标视频效果更佳自然;布局参数可以是由于视频片段中各个片段的主体位置不同,在一些实施例中,可以在素材中标记主体位置的信息(如主体位于整幅图像/视频的左上、右上、左下、右下等),布局参数将该主体位置信息进行组合和排列,使目标视频更加流畅,主体更加突出。在其他一些实施例中,美化参数还可以包括去水印或增加水印等。In some embodiments, specifically, the beautification parameters may include at least one of filter parameters, animation parameters, and layout parameters. The filter parameter can be to add a global effect filter (such as black and white, retro, bright, etc.) to the target video; the animation parameter can be when the target video has multiple video clips in the splicing process, an animation effect is added between the video clips to make The target video effect is better and natural; the layout parameters can be due to the different positions of the main body of each segment in the video clip. In some embodiments, the information of the main body position can be marked in the material (for example, the main body is located in the upper left and upper right of the entire image/video , Lower left, lower right, etc.), the layout parameters combine and arrange the subject position information to make the target video smoother and the subject more prominent. In some other embodiments, the beautification parameter may also include removing watermark or adding watermark, etc.
在一些实施例中,在进行拼接前或确定视频片段时,还可以基于视频配置信息,获取文字层、背景层或装饰层中的至少一种以及对应的载入参数。根据载入参数确定文字层、背景层以及装饰层在目标视频中的布局情况,在拼接渲染过程中将文字层、装饰层和/或背景层根据布局情况嵌入视频片段或目标视频。在一些实施例中,文字层可以是字幕、也可以是额外的文字介绍。此外图像素材有时候是透明背景的,可能会需要背景层, 可以理解的时,上述文字层和背景层均为根据目标视频实际情况进行添加。在一些实施例中,文字层、装饰层和背景层可以包括于视频模板中。In some embodiments, before splicing or when determining a video segment, at least one of a text layer, a background layer, or a decoration layer and corresponding loading parameters may also be obtained based on the video configuration information. Determine the layout of the text layer, the background layer, and the decoration layer in the target video according to the loading parameters, and embed the text layer, the decoration layer and/or the background layer into the video clip or the target video according to the layout during the splicing and rendering process. In some embodiments, the text layer may be subtitles or additional text introductions. In addition, the image material sometimes has a transparent background, and a background layer may be required. When it is understandable, the above text layer and background layer are added according to the actual situation of the target video. In some embodiments, the text layer, the decoration layer, and the background layer may be included in the video template.
在一些实施例中,初始图像和初始视频可能来自于不同途径,其颜色差别可能较大,故在生成视频片段时,视频片段的差异也可能比较大,因此需要进行归一化处理。在一些实施例中,归一化处理可以在确定视频片段时执行,即对初始图像和/或初始视频进行归一化处理,以生成风格统一的视频片段。在一些实施例中,归一化处理可以在进行拼接渲染时执行,即对视频片段进行归一化处理,以生成风格统一的目标视频。由于视频帧可以看成是图像,对图像归一化是指对图像进行了一系列标准的处理变换,使之变换为一固定标准形式的过程,该标准图像称作归一化图像。仅作为示例,在一些实施例中可以对视频片段、初始图像和/或初始视频进行灰度或Gamma值进行归一化处理,具体可以是首先获取图像或视频帧的图像直方图,至少对图像直方图进行均值化处理,基于至少进行均值化处理后的直方图调整图像或视频帧的灰度或Gamma值,实现图像归一化。在一些其他实施例中,归一化处理还可以是基于目标镜头主体的缩放归一化和旋转归一化中的一个或多个,此外归一化处理还可以是针对视频片段、初始图像和/或初始视频的亮度、色调、饱和度等进行的归一化处理。In some embodiments, the initial image and the initial video may come from different sources, and the color difference may be relatively large. Therefore, when the video clip is generated, the difference between the video clips may also be relatively large, so normalization processing is required. In some embodiments, the normalization process may be performed when the video segment is determined, that is, the initial image and/or the initial video are normalized to generate a video segment with a uniform style. In some embodiments, the normalization process may be performed during splicing rendering, that is, the normalization process is performed on the video segment to generate a target video with a uniform style. Since a video frame can be regarded as an image, the normalization of an image refers to the process of performing a series of standard processing transformations on the image to transform it into a fixed standard form. The standard image is called a normalized image. Just as an example, in some embodiments, the grayscale or Gamma value of the video segment, the initial image, and/or the initial video may be normalized. Specifically, the image histogram of the image or video frame may be obtained first, at least for the image The histogram is subjected to averaging processing, and the gray scale or gamma value of the image or video frame is adjusted based on the histogram after at least the averaging processing to realize image normalization. In some other embodiments, the normalization processing may also be based on one or more of the zoom normalization and rotation normalization of the target lens subject. In addition, the normalization processing may also be for video clips, initial images, and / Or normalization of the brightness, hue, saturation, etc. of the initial video.
图8是根据本申请的一些实施例所示的生成目标视频的示例性流程图。在一些实施例中,流程800中的一个或以上步骤可以作为指令的形式存储在存储设备(例如,数据库140)中,并且被处理设备(例如,服务器110中的处理设备112)调用和/或执行。Fig. 8 is an exemplary flow chart of generating a target video according to some embodiments of the present application. In some embodiments, one or more steps in the process 800 may be stored as instructions in a storage device (for example, the database 140), and called by a processing device (for example, the processing device 112 in the server 110) and/or implement.
步骤810,获取初始图像或初始视频,基于初始图像或初始视频生成多个视频片段。Step 810: Obtain an initial image or an initial video, and generate multiple video clips based on the initial image or the initial video.
在一些实施例中,可以仅获取初始视频,初始视频也可以称为视频文件。步骤810可以参考流程300中步骤310的相关描述,在此不再赘述。In some embodiments, only the initial video may be acquired, and the initial video may also be referred to as a video file. For step 810, reference may be made to the related description of step 310 in the process 300, which will not be repeated here.
在一些实施例中,步骤810也可以在流程800中可以省略,例如可以直接获取所述多个视频片段。In some embodiments, step 810 may also be omitted in the process 800, for example, the multiple video clips may be directly obtained.
步骤820,基于第一预设条件从所述多个视频片段中获取一个或多个候选视频片段。Step 820: Obtain one or more candidate video clips from the multiple video clips based on a first preset condition.
第一预设条件可以用于反映对目标视频的内容和/或时长的要求。例如,第一预设条件可以包括多个要素中的要求。所述多个要素包括但不限于对目标视频包含特定对象(F),目标视频包含特定主题(S),视频片段的时长(tp),目标视频的总时长(ta),目标视频包含特定的镜头画面(P),目标视频所包含的镜头画面数量(Pn),目标视频中特定主题的重叠数量(Fn),目标视频中特定主题的聚焦时间(St)等。目标视频所包含的镜头画面数量(Pn)可以为相同或不同的镜头画面数量。与多个句子组成段落或文章类似,多个镜头画面的组合可以构成一个新的视频,以表达更加详细的内容信息。特定主题的聚焦时间(St)是指视频片段或目标视频中关于特定主题的内容所占的时长。特定主题的重叠数量(Fn)是指视频片段或目标视频中关于特定主题的内容出现的次数。特定主题的重叠数量(Fn)和特定主题的聚焦时间(St)可以与对特定主题的凸显程度相关。特定主题的重叠数量(Fn)越多,特定主题的聚焦时间(St)越长,对所述特定主题的凸显程度越高。通过约束特定主题的重叠数量(Fn)和特定主题的聚焦时间(St),可以实现对特定主题所表达的内容进行更好的宣传/推广。在一些实施例中,第一预设条件可以由用户指定,或者由多媒体系统100(例如,处理设备112)基于所述视频配置信息、目标视频需要产生的宣传效果等自动确定。The first preset condition may be used to reflect the requirements for the content and/or duration of the target video. For example, the first preset condition may include requirements in multiple elements. The multiple elements include, but are not limited to, the target video contains a specific object (F), the target video contains a specific theme (S), the duration of the video segment (tp), the total duration of the target video (ta), and the target video contains a specific Shot picture (P), the number of shot pictures contained in the target video (Pn), the number of overlaps of a specific subject in the target video (Fn), the focusing time of a specific subject in the target video (St), etc. The number of shots (Pn) included in the target video can be the same or different. Similar to multiple sentences forming paragraphs or articles, the combination of multiple shots can form a new video to express more detailed content information. The focus time (St) of a specific topic refers to the length of time occupied by the content on the specific topic in the video segment or the target video. The number of overlaps of a specific topic (Fn) refers to the number of occurrences of content on a specific topic in a video clip or target video. The number of overlaps (Fn) of a specific topic and the focus time (St) of a specific topic can be related to the degree of prominence of the specific topic. The greater the number of overlaps (Fn) of a specific theme, the longer the focusing time (St) of the specific theme, and the higher the degree of highlighting the specific theme. By restricting the number of overlaps (Fn) of a specific theme and the focus time (St) of a specific theme, better publicity/promotion of the content expressed by a specific theme can be achieved. In some embodiments, the first preset condition may be specified by the user, or automatically determined by the multimedia system 100 (for example, the processing device 112) based on the video configuration information, the promotional effect that the target video needs to produce, and the like.
第一预设条件可以为对上述多个要素中的至少一个的约束。所述约束可以包括定性约束(例如是否包含特定对象(F)、特定主题(S)、特定镜头画面(P)等)或定量约束(例如,目标视频总时长(ta)、目标视频所包含的镜头画面数量(Pn),目标视频中特定主题的重叠数量(Fn),目标视频中特定主题的聚焦时间(St)等)。在一些实施例中,可以通过对要素对应的值的限定(例如,与预设的相应的阈值进行比较)实现对视频片段的筛选。例如,设定目标视频包含特定对象(F)时,其对应的值为1;相反,目标视频不包含特定对象(F)时,其对应的值为0。对于定性约束,第一预设条件可以为对应的要素的值大于0。对于定量约束,第一预设条件可以为对应的要素的值超出相应的阈值(例如,视频时长小于30秒,主题聚焦时间超过15秒等),以筛选出满足需要(例如,符合特定群体的兴趣或要求)的视频片段。The first preset condition may be a constraint on at least one of the above-mentioned multiple elements. The constraints may include qualitative constraints (for example, whether to include a specific object (F), a specific theme (S), a specific shot (P), etc.) or quantitative constraints (for example, the total duration of the target video (ta), the target video contains The number of shots (Pn), the number of overlaps of a specific subject in the target video (Fn), the focusing time of a specific subject in the target video (St), etc.). In some embodiments, the screening of video clips can be achieved by limiting the value corresponding to the element (for example, comparing with a preset corresponding threshold). For example, when the target video contains a specific object (F), its corresponding value is 1; on the contrary, when the target video does not contain a specific object (F), its corresponding value is 0. For qualitative constraints, the first preset condition may be that the value of the corresponding element is greater than 0. For quantitative constraints, the first preset condition can be that the value of the corresponding element exceeds the corresponding threshold (for example, the video duration is less than 30 seconds, the topic focus time exceeds 15 seconds, etc.) to filter out those that meet the needs (for example, those that meet the specific group Interest or request) video clips.
在一些实施例中,可以从多个视频片段中筛选出满足第一预设条件的确定为候选视频片段。例如,第一预设条件可以包括对目标视频的总时长(ta)和目标视频包含镜头画面数量(Pn)的约束,即可以根据视频片段的总时长(ta)和镜头画面数量(Pn)确定候选视频片段。示例性地,若第一预设条件可以为目标视频需包含3个镜头画面且总时长为40秒,则可以筛选3个分别包含不同镜头画面且时长总和为40秒的视频片段,比如15秒的视频片段1、15秒的视频片段2、10秒的视频片段3,为候选视频片段。通过对目标视频的总时长(ta)和目标视频包含镜头画面数量(Pn)进行约束,可以在目标视频总时长一定的情况下,对特定镜头画面有预定程度的曝光。In some embodiments, a plurality of video clips may be selected from a plurality of video clips that satisfy the first preset condition and are determined to be candidate video clips. For example, the first preset condition may include constraints on the total duration (ta) of the target video and the number of shots (Pn) contained in the target video, that is, it may be determined according to the total duration (ta) of the video segment and the number of shots (Pn). Candidate video clips. Exemplarily, if the first preset condition can be that the target video needs to contain 3 shots and the total duration is 40 seconds, then 3 video clips that each contain different shots and the total duration is 40 seconds can be filtered, such as 15 seconds Video segment 1, 15-second video segment 2, and 10-second video segment 3 are candidate video segments. By constraining the total duration (ta) of the target video and the number of shots (Pn) included in the target video, a predetermined degree of exposure can be provided to a specific shot with a certain total duration of the target video.
又例如,第一预设条件可以包括对视频片段特定镜头画面(P)的约束,即可以根据视频片段包含的镜头画面确定候选视频片段。示例性地,若第一预设条件为目标视频需包含“海上冲浪”的镜头画面,则可以筛选 一个或多个镜头画面为“海上冲浪”的视频片段为候选视频片段。通过对特定镜头画面(P)进行约束,可以使所述目标视频包含特定的内容,以满足用户的兴趣或需求。For another example, the first preset condition may include a constraint on a specific shot picture (P) of the video clip, that is, the candidate video clip may be determined according to the shot picture contained in the video clip. Exemplarily, if the first preset condition is that the target video needs to include the shots of "surfing on the sea", one or more video clips with the shots of "surfing on the sea" may be filtered as candidate video clips. By restricting the specific shots (P), the target video can be made to contain specific content to meet the interests or needs of users.
又例如,第一预设条件可以同时包括对目标视频包含特定对象(F)、目标视频的总时长(ta)、目标视频所包含的镜头画面数量(Pn)的约束,即可以根据视频片段中包含的对象以及视频时长确定候选视频片段。示例性地,若第一预设条件为目标视频中需包含3个镜头画面,以及包含“××地区”的标志,且目标视频总时长不得超过70秒,可以先筛选出包含“××地区”标志的视频片段4为一个候选视频片段,然后根据视频片段4的时长,如为20秒,再筛选两个使得视频总时长不超过70秒的视频片段为候选视频频段,如分别为30秒的视频片段5和20秒的视频片段6。通过对目标视频包含特定对象(F)、目标视频的总时长(ta)、目标视频所包含的镜头画面数量(Pn)进行约束,可以在目标视频总时长一定的情况下,对包含特定对象的特定镜头画面有预定程度的曝光,以凸显该特定对象。For another example, the first preset condition may simultaneously include constraints on the target video containing a specific object (F), the total duration of the target video (ta), and the number of shots (Pn) contained in the target video. The included objects and video duration determine candidate video segments. Exemplarily, if the first preset condition is that the target video needs to contain 3 shots and the logo including the "×× area", and the total duration of the target video cannot exceed 70 seconds, you can first filter out the "×× area" The video clip 4 marked with ”is a candidate video clip, and then according to the duration of the video clip 4, such as 20 seconds, two video clips whose total video duration does not exceed 70 seconds are selected as candidate video frequency bands, such as 30 seconds respectively. Video clip 5 and 20 second video clip 6. By constraining the target video containing a specific object (F), the total duration of the target video (ta), and the number of shots (Pn) contained in the target video, it is possible to limit the duration of the target video when the target video contains a specific object. A specific lens frame has a predetermined degree of exposure to highlight the specific object.
又例如,第一预设条件可以包括对目标视频中特定主题的重叠数量(Fn)以及目标视频中特定主题的聚焦时间(St)的约束,即可以根据主题的出现次数以及聚焦时长确定候选视频片段。示例性地,若第一预设条件为目标视频中需包含2个相同或不同的特定主题以及目标视频中特定主题的聚焦时间超过1分钟,可以筛选一个同时包含该数量的特定主题且聚焦时长超过1分钟的视频片段或者多个包含该数量的特定主题且聚焦时长累计超过1分钟的多个视频片段,作为候选视频片段。通过对目标视频中特定主题的重叠数量(Fn)以及目标视频中特定主题的聚焦时间(St)进行约束,则可以在目标视频中突出该特定主题,适于针对该特定主题的广告或宣传。For another example, the first preset condition may include constraints on the number of overlaps (Fn) of a specific topic in the target video and the focus time (St) of a specific topic in the target video, that is, the candidate video may be determined according to the number of occurrences of the topic and the focus duration. Fragment. Exemplarily, if the first preset condition is that the target video needs to contain two specific themes that are the same or different, and the focus time of the specific topic in the target video exceeds 1 minute, a specific topic that also contains this number and the focus time can be filtered Video clips exceeding 1 minute or multiple video clips containing this number of specific themes and having a cumulative focus duration of more than 1 minute are regarded as candidate video clips. By restricting the number of overlaps (Fn) of a specific topic in the target video and the focusing time (St) of a specific topic in the target video, the specific topic can be highlighted in the target video, which is suitable for advertising or promotion for the specific topic.
再例如,第一预设条件可以包括对目标视频的总时长(ta)、目标视频所包含的镜头画面数量(Pn)和目标视频中特定主题的聚焦时间(St)的 约束,即可以根据视频时长、镜头画面数量以及特定主题聚焦时间确定候选视频片段。示例性地,若第一预设条件为目标视频总时长为30秒,需包含3个镜头画面,且目标视频中特定主题的聚焦时间不低于15秒。可以先筛选出3个镜头画面,所述3镜头画面包含特定主题的时长不低于15秒(例如,16秒,18秒,20秒等),且3镜头画面总时长为30秒,比如可以筛选15秒的视频片段1(其中包含特定主题的内容为10秒)、10秒的视频片段2(其中包含特定主题的内容为6秒)、5秒的视频片段3(其中包含特定主题的内容为3秒),为候选视频片段。通过对目标视频的总时长(ta)、目标视频所包含的镜头画面数量(Pn)和目标视频中特定主题的聚焦时间(St)进行约束,可以在目标视频总时长一定的情况下,对包含特定主题的特定镜头画面有预定程度的聚焦,以更加突出目标视频中的该特定主题。For another example, the first preset condition may include constraints on the total duration of the target video (ta), the number of shots (Pn) contained in the target video, and the focus time (St) of a specific subject in the target video, that is, it may be based on the video The length of time, the number of shots and the focus time of a specific subject determine the candidate video clips. Exemplarily, if the first preset condition is that the total duration of the target video is 30 seconds, 3 shots need to be included, and the focus time of a specific subject in the target video is not less than 15 seconds. You can filter out 3 lens images first, the duration of the 3 lens images containing a specific subject is no less than 15 seconds (for example, 16 seconds, 18 seconds, 20 seconds, etc.), and the total duration of the 3 lens images is 30 seconds, for example, Screen 15-second video clip 1 (10 seconds for content containing a specific subject), 10-second video clip 2 (6 seconds for content containing a specific subject), and 5-second video clip 3 (which contains content on a specific subject) 3 seconds), which is a candidate video segment. By constraining the total duration of the target video (ta), the number of shots (Pn) contained in the target video, and the focus time (St) of a specific subject in the target video, the total duration of the target video can be restricted. The specific shots of the specific subject have a predetermined degree of focus, so as to more highlight the specific subject in the target video.
在一些实施例中,基于第一预设条件,确定候选视频片段,以生成目标视频的过程可以确定一个模型:f(a,b,c,…,n)→y,其中模型中f可以为第一预设条件,a,b,c,…,n为多个视频片段,y为目标视频。通过调整所述第一预设条件中的各要素对应的值与对应阈值的关系,可以确定候选视频片段,并基于候选视频片段,生成目标视频。在一些实施例中,基于视频配置信息,生成目标视频的过程也可以确定类似的模型。所述模型可以由处理设备112执行,基于多个视频片段,自动生成目标视频。在一些实施例中,可以通过机器学习模型,对所述模型中的f进行学习训练,以确定满足特定要求(例如播放效果)的第一预设条件或视频配置信息。In some embodiments, based on the first preset condition, the process of determining candidate video segments to generate the target video can determine a model: f(a,b,c,...,n)→y, where f in the model can be The first preset condition, a, b, c,..., n are multiple video clips, and y is the target video. By adjusting the relationship between the value corresponding to each element in the first preset condition and the corresponding threshold value, a candidate video segment can be determined, and a target video can be generated based on the candidate video segment. In some embodiments, based on the video configuration information, the process of generating the target video may also determine a similar model. The model may be executed by the processing device 112 to automatically generate a target video based on multiple video clips. In some embodiments, a machine learning model can be used to learn and train f in the model to determine the first preset condition or video configuration information that meets a specific requirement (for example, a playback effect).
在一些实施例中,满足第一预设条件中任意一个要求的候选视频片段可以为一个或多个。例如,满足指定镜头画面要求的候选视频片段可以为3个,满足特定对象要求的候选视频片段可以为5个等。相应地,满足第一预设条件的候选视频片段可以为一组或多组。In some embodiments, there may be one or more candidate video clips that meet any one of the first preset conditions. For example, there may be 3 candidate video clips that meet the requirements of a specified lens picture, and 5 candidate video clips that meet the requirements of a specific object. Correspondingly, the candidate video clips meeting the first preset condition may be one or more groups.
在一些实施例中,第一预设条件可以包括对目标视频中各个镜头片段(视频片段)的约束,对应的,可以根据特定镜头片段的约束将满足该约束的视频片段分为该镜头片段的候选视频片段组,以得到各个镜头片段的候选视频片段组。In some embodiments, the first preset condition may include constraints on each shot segment (video clip) in the target video. Correspondingly, the video clip that meets the constraint may be divided into the shot segment according to the constraints of the specific shot segment. Candidate video segment groups to obtain candidate video segment groups for each shot segment.
可以理解的是,上述内容仅为示例,在一些实施例中,第一预设条件与多个要素中的至少一个相关,第一预设条件可以表征为对多个要素中的一个或多个的约束(对要求的约束也可以理解为对目标视频的要求),在包含多个要素的约束的情况下,对满足每个要求的视频片段的筛选可以为任意合理的顺序,本说明书对此不做限制。It can be understood that the foregoing content is only an example. In some embodiments, the first preset condition is related to at least one of the multiple elements, and the first preset condition may be characterized as (The requirements for the requirements can also be understood as the requirements for the target video). In the case of multiple element constraints, the selection of video clips that meet each requirement can be in any reasonable order. No restrictions.
在一些实施例中,处理设备112可以利用机器学习模型确定候选视频片段。例如,可以利用训练好的机器学习模型确定多个视频片段中的每一个是否包含特定对象,将所述多个视频片段中包含特定对象的视频片段确定为候选视频片段。在一些实施例中,训练好的机器学习模型的输入可以为视频片段,输出可以为视频片段包含的对象,或视频片段是否包含特定对象的二分类结果,本说明书对此不做限制。在一些替代性实施例中,可以通过其他可行的方式确定候选视频片段,本说明书对此不做限制。In some embodiments, the processing device 112 may use a machine learning model to determine candidate video segments. For example, a trained machine learning model may be used to determine whether each of the multiple video clips contains a specific object, and a video clip that contains the specific object among the multiple video clips is determined as a candidate video clip. In some embodiments, the input of the trained machine learning model may be a video clip, and the output may be the object contained in the video clip, or the binary classification result of whether the video clip contains a specific object, this specification does not limit this. In some alternative embodiments, the candidate video segment may be determined in other feasible ways, which is not limited in this specification.
在一些实施例中,所述第一预设条件还包括所述多个要素中两个或以上特定要素的要素约束条件。所述两个或以上特定要素可以为不同要素。在一些实施例中,所述要素约束条件可以涉及所述两个或以上特定要素的优先级。例如,基于判断的难易程度,特定主体(F)的优先级高于特定主题(S)的优先级;基于主题突出效果,主题聚焦时间(St)的优先级高于主题重复次数(Sn)的优先级等。在一些实施例中,所述要素约束条件可以涉及对不同要素考虑的顺序。例如,同时存在目标视频的总时长(ta)、目标视频所包含的镜头画面数量(Pn)和目标视频中特定主题的聚焦时间(St)时,优先考虑镜头画面数量(Pn),其次考虑特定主题的聚焦时间(St), 最后可以通过快放或慢放,调整视频的总时长(ta)。在一些实施例中,所述要素约束条件还可以涉及不同要素之间是否兼容匹配。例如,15秒视频总时长(ta)与20秒主题聚焦时间(st)不兼容。In some embodiments, the first preset condition further includes element constraint conditions of two or more specific elements in the plurality of elements. The two or more specific elements may be different elements. In some embodiments, the element constraint condition may involve the priority of the two or more specific elements. For example, based on the difficulty of the judgment, the priority of the specific subject (F) is higher than the priority of the specific theme (S); based on the highlighting effect of the theme, the priority of the theme focus time (St) is higher than the number of repetitions of the theme (Sn) The priority etc. In some embodiments, the element constraint conditions may involve the order in which different elements are considered. For example, when the total duration of the target video (ta), the number of shots contained in the target video (Pn), and the focus time (St) of a specific subject in the target video are present at the same time, the number of shots (Pn) is given priority, followed by the specific The focus time (St) of the subject, and finally the total duration (ta) of the video can be adjusted by fast or slow playback. In some embodiments, the element constraint condition may also involve compatibility and matching between different elements. For example, the 15-second total video duration (ta) is not compatible with the 20-second topic focus time (st).
在一些实施例中,第一预设条件可以包括目标视频中镜头画面的绑定条件。所述绑定条件可以反映至少两个指定镜头画面在目标视频中的关联关系,例如,绑定条件(也可以理解为关联关系)可以是指定镜头画面a与指定镜头画面b必须在目标视频中同时出现,或在目标视频中指定镜头画面a必须出现在指定镜头画面b之前等。在一些实施例中,处理设备112可以从多个视频片段中确定包含指定镜头画面的视频片段,基于绑定条件将包含指定镜头画面的视频片段组合,以作为一个候选视频片段。例如,若绑定条件为镜头画面a在目标视频中必须出现在镜头画面b之前,则可以将镜头画面a与镜头画面b组合,并将镜头画面a放置在镜头画面b之前,或标注镜头画面a与镜头画面b的先后顺序,以作为一个候选视频片段。通过将满足绑定条件的镜头画面组合为一个候选视频片段,可以帮助在后续处理过程中将其作为整体进行处理,以提高视频合成的效率。在一些实施例中,满足绑定条件的镜头画面也可以不组合为一个候选视频片段,而是分别存在于连续或不连续的(例如,间隔的)候选视频片段中。In some embodiments, the first preset condition may include the binding condition of the shot picture in the target video. The binding condition may reflect the association relationship of at least two specified lens images in the target video. For example, the binding condition (also can be understood as an association relationship) may be that the specified lens image a and the specified lens image b must be in the target video Appear at the same time, or the designated lens picture a must appear before the designated lens picture b in the target video, etc. In some embodiments, the processing device 112 may determine a video clip containing a specified shot frame from a plurality of video clips, and combine the video clip containing the specified shot frame based on a binding condition to serve as a candidate video clip. For example, if the binding condition is that the lens image a must appear before the lens image b in the target video, you can combine the lens image a and the lens image b, and place the lens image a before the lens image b, or mark the lens image The sequence of a and shot b is used as a candidate video segment. Combining shots and pictures that meet the binding conditions into a candidate video segment can help to process them as a whole in the subsequent processing process, so as to improve the efficiency of video synthesis. In some embodiments, the shots that meet the binding condition may not be combined into one candidate video segment, but exist in continuous or discontinuous (for example, interval) candidate video segments, respectively.
步骤830,对一个或多个候选视频片段进行分组以确定至少一个片段集合。Step 830: Group one or more candidate video clips to determine at least one clip set.
在一些实施例中,流程800可以用于同时生成多个(如目标数量个)目标视频。所述分组可以理解为对候选视频片段的组合,则对应步骤830可以包括对候选视频片段进行组合以生成目标数量个片段集合。In some embodiments, the process 800 may be used to generate multiple (eg, target number) target videos at the same time. The grouping can be understood as a combination of candidate video segments, and the corresponding step 830 can include combining the candidate video segments to generate a target number of segment sets.
至少一个片段集合中的每个片段集合为由一个或多个候选视频片段组合的且同时满足第一预设条件和视频配置信息中的其他条件的片段集合。在一些实施例中,视频配置信息中的其他条件可以为第二预设条件。 所述第二预设条件与片段集合/视频片段的内容差异相关。例如,所述第二预设条件可以具体包括片段集合的组合差异度的约束,即至少一个片段集合中的每个片段集合的候选视频的组合的差异度大于预设阈值。关于第二预设条件的判断,可以参考后图9及其相关描述。Each segment set in the at least one segment set is a segment set that is combined by one or more candidate video segments and meets the first preset condition and other conditions in the video configuration information at the same time. In some embodiments, other conditions in the video configuration information may be the second preset condition. The second preset condition is related to the content difference of the segment set/video segment. For example, the second preset condition may specifically include a constraint on the combination difference degree of the segment sets, that is, the difference degree of the combination of candidate videos of each segment set in at least one segment set is greater than a preset threshold. Regarding the judgment of the second preset condition, refer to the following figure 9 and related descriptions.
在一些实施例中,视频配置信息中的其他条件还可以包括但不限于对目标视频的边框、字幕、色调、饱和度、背景音乐等一种或多种组合的要求。例如,至少一个片段集合中可以包括上述视频片段1、视频片段2、视频片段3组合的片段集合1,视频片段4、视频片段5、视频片段6组合的片段集合2,视频片段1、视频片段2、视频片段3、视频片段4组合的片段集合3等。In some embodiments, other conditions in the video configuration information may also include, but are not limited to, requirements for one or more combinations of frame, subtitle, hue, saturation, background music, etc. of the target video. For example, at least one fragment set may include the aforementioned video fragment 1, video fragment 2, and video fragment 3 combined fragment set 1, video fragment 4, video fragment 5, and video fragment 6 combined fragment set 2, video fragment 1, video fragment 2. Video segment 3, video segment 4 combined segment set 3, etc.
步骤840,基于所述至少一个片段集合中的每个片段集合,生成一个目标视频。Step 840: Generate a target video based on each segment set in the at least one segment set.
在一些实施例中,流程800用于同时生成多个目标视频时,步骤840可以包括对每个片段集合,基于该片段集合合成一个目标视频。In some embodiments, when the process 800 is used to generate multiple target videos at the same time, step 840 may include synthesizing a target video based on the set of segments for each set of segments.
对于至少一个片段集合(也可以为目标数量个片段集合)中的每个片段集合,可以基于该片段集合合成一个目标视频。对应地,至少一个片段集合可以合成对应数量的目标视频。在一些实施例中,可以根据视频配置信息所包括的序列信息对视频片段进行排序组合,生成目标视频。在一些实施例中,对于没有顺序要求的视频片段,可以基于片段集合中每个视频片段的镜头画面的衔接性或随机合成目标视频。例如,展现内容为白天的视频片段可以放置在展现内容为夜晚的视频片段之前。在一些实施例中,可以基于待宣传产品或信息的宣传文案合成目标视频。例如,用于宣传垃圾分类的视频,其宣传文案可以为先展现不分类可能产生的恶果,接着展现分类的益处,最后展现分类方法,则可以根据片段集合中每个视频片段包含的内容按照该宣传文案中展现内容的顺序合成目标视频。For each segment set in at least one segment set (or a target number of segment sets), a target video can be synthesized based on the segment set. Correspondingly, at least one segment set can synthesize a corresponding number of target videos. In some embodiments, the video clips may be sorted and combined according to the sequence information included in the video configuration information to generate the target video. In some embodiments, for video clips without order requirements, the target video may be randomly synthesized based on the cohesion of the shots of each video clip in the clip set. For example, a video clip whose presentation content is daytime may be placed before a video clip whose presentation content is night. In some embodiments, the target video may be synthesized based on the promotional copy of the product or information to be promoted. For example, for a video used to promote garbage classification, its promotional copy can be to first show the possible consequences of unclassification, then show the benefits of classification, and finally show the classification method. You can follow the content of each video segment in the segment set according to the Synthesize the target video in the sequence of the content displayed in the promotional copy.
在一些实施例中,当合成目标数量个目标视频时,合成的目标数量个目标视频可以分批或同时投放。在一些实施例中,可以基于每个目标视频的宣传效果调整所述第一预设条件、所述第二预设条件、和/或其他条件。宣传效果可以根据用户反馈、播放量、评价、反馈结果(如,产品销售量、垃圾分类结果等)等获得,本说明书对此不做限制,具体内容可以参考本申请图17和18的相关描述。In some embodiments, when a target number of target videos are synthesized, the synthesized target number of target videos may be delivered in batches or at the same time. In some embodiments, the first preset condition, the second preset condition, and/or other conditions may be adjusted based on the promotion effect of each target video. The publicity effect can be obtained based on user feedback, broadcast volume, evaluation, feedback results (such as product sales, garbage classification results, etc.). This manual does not limit this, and the specific content can refer to the relevant descriptions in Figures 17 and 18 of this application. .
在一些实施例中,第二预设条件可以包括至少一个片段集合中任意两个片段集合之间的差异度大于预设差异度阈值。在一些实施例中,任意两个片段集合之间的差异度也可以称为任意两个片段集合之间的候选视频片段组合差异度。In some embodiments, the second preset condition may include that the difference between any two segment sets in the at least one segment set is greater than a preset difference threshold. In some embodiments, the difference degree between any two fragment sets may also be referred to as the candidate video fragment combination difference degree between any two fragment sets.
图9是根据本申请的一些实施例所示的确定片段集合的示例性流程图。具体涉及基于第二预设条件筛选出至少一个片段集合的方法。在一些实施例中,流程900中的一个或以上步骤可以作为指令的形式存储在存储设备(例如,数据库140)中,并且被处理设备(例如,服务器110中的处理设备112)调用和/或执行。Fig. 9 is an exemplary flowchart of determining a segment set according to some embodiments of the present application. Specifically, it relates to a method of screening out at least one segment set based on a second preset condition. In some embodiments, one or more steps in the process 900 may be stored as instructions in a storage device (for example, the database 140), and called by a processing device (for example, the processing device 112 in the server 110) and/or implement.
步骤910,确定两个及以上片段集合中的每个片段集合与其他片段集合之间的候选视频片段的组合差异度。Step 910: Determine a combination difference degree of candidate video segments between each of the two or more segment sets and other segment sets.
在一些实施例中,多媒体系统100(例如,处理设备112)可以通过随机挑选满足多组第一预设条件的候选视频片段,并进行随机组合构成两个及以上片段集合。在一些实施例中,对于根据第一预设条件确定的多个候选视频片段,可以从对应同一要素的每组候选视频片段中,随机挑选一个或多个候选片段形成一个片段集合。通过重复随机挑选的步骤,可以形成两个及以上片段集合。其中,可以先进行对有定性限制的视频片段进行随机选择,再随机选择其他视频片段。例如,根据视频配置信息最后一个镜头应当是产品镜头,其他镜头无顺序要求但存在内容要求,在生成两 个及以上片段集合可以先从候选视频片段中随机选择产品镜头的视频片段作为最后一个镜头,再根据其他镜头的内容要求从对应的候选视频片段随机选择相应的视频片段,然后进行排序(例如,随机排序),其中,可以根据其他镜头的内容要求对候选镜头进行分组,从每组中确定视频片段,从而得到一个候选视频片段的组合方式。In some embodiments, the multimedia system 100 (for example, the processing device 112) may randomly select multiple sets of candidate video clips that meet the first preset condition, and randomly combine them to form two or more clip sets. In some embodiments, for multiple candidate video clips determined according to the first preset condition, one or more candidate video clips may be randomly selected from each group of candidate video clips corresponding to the same element to form a clip set. By repeating the steps of random selection, two or more fragment sets can be formed. Among them, random selection of video clips with qualitative restrictions can be performed first, and then other video clips are randomly selected. For example, according to the video configuration information, the last shot should be a product shot, and other shots have no order requirements but have content requirements. When generating two or more fragment sets, you can randomly select the video fragment of the product shot from the candidate video fragments as the last shot. , And then randomly select the corresponding video clips from the corresponding candidate video clips according to the content requirements of other shots, and then sort them (for example, random sorting). Among them, the candidate shots can be grouped according to the content requirements of other shots, from each group Determine the video segment to get a combination of candidate video segments.
步骤910中两个及以上片段集合中的每个片段集合与其他片段集合之间的候选视频片段的组合差异度可以通过比较片段集合的组合方式实现,例如,比较各个片段集合间的组合方式的差异率,即各个片段集合间互不相同的视频片段的比率。在一些实施例中,步骤910还可以通过对不同的视频片段进行赋值或利用机器学习算法实现,具体描述可以参考后续图10、图11的相关描述。In step 910, the difference in the combination of candidate video segments between each of the two or more segment sets and other segment sets can be achieved by comparing the combination of the segment sets, for example, comparing the combination of the various segment sets The difference rate, that is, the ratio of video clips that are different from each other in each clip set. In some embodiments, step 910 can also be implemented by assigning values to different video clips or using machine learning algorithms. For specific descriptions, reference may be made to subsequent related descriptions in FIG. 10 and FIG. 11.
在一些实施例中,任意两个片段集合之间的差异度可以包括任意两个片段集合之间的候选视频片段组合和/或该两个片段集合之间的候选视频片段与边框、字幕、色调等其他条件的组合差异度。例如,片段集合1可以包括视频片段1、视频片段2、边框1、字幕1,片段集合5可以包括视频片段1、视频片段2、边框2、字幕2,片段集合2与片段集合5的差异度可以为边框与字幕的差异,则片段集合1的差异率为0%以及片段集合2的差异率为33%。此时差异率可以用于表征所述差异度。In some embodiments, the degree of difference between any two fragment sets may include a combination of candidate video fragments between any two fragment sets and/or candidate video fragments and borders, subtitles, and tones between the two fragment sets. The degree of difference in combination of other conditions. For example, segment set 1 may include video segment 1, video segment 2, frame 1, subtitle 1, segment set 5 may include video segment 1, video segment 2, frame 2, subtitle 2, and the degree of difference between segment set 2 and segment set 5. It can be the difference between the frame and the caption, then the difference rate of the segment set 1 is 0% and the difference rate of the segment set 2 is 33%. At this time, the difference rate can be used to characterize the degree of difference.
步骤920,将与其他片段集合的组合差异度高于预设阈值的片段集合作为至少一个片段集合。Step 920: Use a segment set whose combination difference with other segment sets is higher than a preset threshold as at least one segment set.
可以根据前述两个及以上片段集合中各个片段集合组合差异度对各个片段集合进行筛选,例如筛选条件可以为筛选出的至少一个片段集合中各个片段集合间的组合差异度高于预设阈值(例如差异率高于50%)。通过筛选出满足第二预设条件的多个片段集合,可以使得基于至少一个片 段集合生成目标视频,具有不同的内容展现效果,进而达到不同的宣传效果,同时可以满足视频平台对投放视频的要求。Each fragment set can be filtered according to the combination difference degree of each fragment set in the aforementioned two or more fragment sets. For example, the filtering condition may be that the combination difference between each fragment set in at least one of the selected fragment sets is higher than a preset threshold ( For example, the difference rate is higher than 50%). By filtering out multiple fragment sets that meet the second preset condition, the target video can be generated based on at least one fragment set, which has different content display effects, and thus achieves different promotional effects, while meeting the requirements of video platforms for placing videos .
在一些实施例中,第一预设条件与第二预设条件的筛选顺序可以变更,例如可以基于多个视频片段生成多个候选片段集合,以使多个候选片段集合满足第二预设条件。然后,基于第一预设条件从多个候选片段集合中筛选出至少一个片段集合;再基于至少一个目标片段集合中的每个片段集合,生成一个目标视频,具体可以参考图12及其描述。In some embodiments, the screening order of the first preset condition and the second preset condition can be changed. For example, multiple candidate clip sets can be generated based on multiple video clips, so that the multiple candidate clip sets meet the second preset condition . Then, at least one segment set is selected from the multiple candidate segment sets based on the first preset condition; and then based on each segment set in the at least one target segment set, a target video is generated. For details, refer to FIG. 12 and its description.
流程900可能带来的有益效果包括但不限于:(1)基于片段集合之间的差异确定目标数量个片段集合,可以确定多个具有不同内容表现效果的目标视频,提高生成的目标视频的多样性;(2)整个目标视频的生成无需人工操作,提高了视频合成效率。需要说明的是,不同实施例可能产生的有益效果不同,在不同的实施例里,可能产生的有益效果可以是以上任意一种或几种的组合,也可以是其他任何可能获得的有益效果。The beneficial effects that the process 900 may bring include, but are not limited to: (1) The target number of fragment sets is determined based on the difference between the fragment sets, multiple target videos with different content performance effects can be determined, and the variety of generated target videos is improved. (2) No manual operation is required to generate the entire target video, which improves the efficiency of video synthesis. It should be noted that different embodiments may have different beneficial effects. In different embodiments, the possible beneficial effects may be any one or a combination of the above, or any other beneficial effects that may be obtained.
图10是根据本申请的一些实施例所示的确定片段集合组合差异度的示例性流程图。在一些实施例中,流程1000中的一个或以上步骤可以作为指令的形式存储在存储设备(例如,数据库140)中,并且被处理设备(例如,服务器110中的处理设备112)调用和/或执行。Fig. 10 is an exemplary flow chart for determining the degree of difference in a combination of fragment sets according to some embodiments of the present application. In some embodiments, one or more steps in the process 1000 may be stored as instructions in a storage device (for example, the database 140), and called by a processing device (for example, the processing device 112 in the server 110) and/or implement.
步骤1010,为一个或多个候选视频片段中的每一个赋予一个标识字符。Step 1010: Assign an identification character to each of one or more candidate video clips.
在一些实施例中,可以对每个候选视频片段赋予不同的标识字符。对于至少一个片段集合(或目标数量个片段集合)中的每个片段集合,标识字符可以根据候选视频片段的数量确定,例如,候选视频片段的数量较少时,可以赋予一个英文字母作为标识字符。候选视频片段的数量较多时,可以赋予一个ASCI码作为标识字符。特别地,对于有特殊要求或必须出现在目标视频中的特定候选视频片段可以不赋予标识字符,例如,目标视频 的最后一个镜头必须为产品展示视频片段,如果该视频片段仅有一个,则可以不赋予标识字符。In some embodiments, different identification characters may be assigned to each candidate video segment. For each fragment set in at least one fragment set (or target number of fragment sets), the identification character can be determined according to the number of candidate video fragments. For example, when the number of candidate video fragments is small, an English letter can be assigned as the identification character . When the number of candidate video clips is large, an ASCI code can be assigned as an identification character. In particular, for specific candidate video clips that have special requirements or must appear in the target video, no identification characters can be assigned. For example, the last shot of the target video must be a product display video clip. If there is only one video clip, it can be No identification characters are assigned.
步骤1020,基于一个或多个候选视频片段的标识字符,确定对应于片段集合与其他片段集合的字符串。 Step 1020, based on the identification characters of one or more candidate video segments, determine character strings corresponding to the segment set and other segment sets.
对于每一个片段集合,片段集合的字符串为该片段集合中各个候选视频片段的标识字符的集合。例如,若视频片段1标识字符为A,视频片段2标识字符为B,视频片段3标识字符为C,视频片段4标识字符为D,则片段集合1对应字符串为ABC。在一些实施例中,片段集合还可以包括顺序要求,按照顺序要求例如,片段集合2是与片段组合1内容相同的顺序不同的组合,其对应字符串可以为CAB。For each segment set, the character string of the segment set is a set of identification characters of each candidate video segment in the segment set. For example, if the identifying character of video segment 1 is A, the identifying character of video segment 2 is B, the identifying character of video segment 3 is C, and the identifying character of video segment 4 is D, then the character string corresponding to segment set 1 is ABC. In some embodiments, the fragment set may also include order requirements. For example, the fragment set 2 is a combination with the same content as the fragment set 1 in a different order, and the corresponding character string may be CAB.
步骤1030,将片段集合与其他片段集合对应的字符串的编辑距离确定为片段集合与其他片段集合之间的候选视频片段的组合差异度。Step 1030: Determine the edit distance of the character string corresponding to the segment set and other segment sets as the combined difference degree of the candidate video segments between the segment set and the other segment sets.
编辑距离可以反映两个字符串之间不同字符的数量。编辑距离越小,对应的不同字符数量越少,表示两个片段集合之间差异度越小。例如,片段集合1对应字符串为ABC,片段集合3对应字符串为ABCD,则片段集合1与片段集合3之间的编辑距离为1,即差异度为1。对于有顺序要求的字符串可以将顺序不同的字符也算入编辑距离,此外,为避免内容上的重复,顺序差异可以整体计算一次,例如,步骤1020中的片段集合1与片段集合2的编辑距离可以为1。Edit distance can reflect the number of different characters between two strings. The smaller the editing distance, the smaller the number of corresponding different characters, and the smaller the difference between the two fragment sets. For example, if the character string corresponding to segment set 1 is ABC, and the character string corresponding to segment set 3 is ABCD, the edit distance between segment set 1 and segment set 3 is 1, that is, the degree of difference is 1. For strings with sequence requirements, characters with different sequences can also be included in the edit distance. In addition, to avoid repetition of content, the sequence difference can be calculated once as a whole. For example, the edit distance between segment set 1 and segment set 2 in step 1020 Can be 1.
在一些实施例中,片段集合的数量可以任意正整数,例如,1个、3个、5个、8个、10个等。在一些实施例中,可以将候选视频片段随机组合为N个候选片段集合,基于第二预设条件从N个候选片段集合中筛选目标数量个片段集合。通过筛选目标数量个满足第二预设条件的片段集合,可以使得基于目标数量个片段集合生成具有不同的内容展现效果的目标数量个目标视频,进而达到不同的宣传效果。In some embodiments, the number of fragment sets can be any positive integer, for example, 1, 3, 5, 8, 10, etc. In some embodiments, the candidate video fragments may be randomly combined into N candidate fragment sets, and a target number of fragment sets are selected from the N candidate fragment sets based on the second preset condition. By screening the target number of segment sets that meet the second preset condition, a target number of target videos with different content presentation effects can be generated based on the target number of segment sets, thereby achieving different promotional effects.
图11是根据本申请的一些实施例所示的确定片段集合组合差异度的示例性流程图。在一些实施例中,流程1100中的一个或以上步骤可以作为指令的形式存储在存储设备(例如,数据库140)中,并且被处理设备(例如,服务器110中的处理设备112)调用和/或执行。Fig. 11 is an exemplary flow chart for determining the degree of difference in the combination of fragment sets according to some embodiments of the present application. In some embodiments, one or more steps in the process 1100 may be stored as instructions in a storage device (for example, the database 140), and called by a processing device (for example, the processing device 112 in the server 110) and/or implement.
步骤1110,基于训练好的特征提取模型以及两个及以上片段集合中的候选视频片段,生成每个候选视频片段对应的片段特征。Step 1110: Generate a segment feature corresponding to each candidate video segment based on the trained feature extraction model and candidate video segments in two or more segment sets.
所述特征提取模型也可以称为镜头特征提取模型。在一些实施例中,可以基于训练好的特征提取模型对所述两个及以上片段集合中的候选视频片段进行处理,以生成对应的片段特征,具体包括获取每一个候选视频片段包含的多个视频帧,并确定每个视频帧对应的一个或多个图像特征,然后基于训练好的特征提取模型处理所述多个视频帧中的图像特征以及多个视频帧中图像特征之间的相互关系,确定候选视频片段对应的片段特征。其中,特征提取处理可以是指对原始信息进行处理并提取特征数据,特征提取处理可以提升原始信息的表达,以方便后续任务。The feature extraction model may also be referred to as a shot feature extraction model. In some embodiments, the candidate video segments in the two or more segment sets may be processed based on the trained feature extraction model to generate corresponding segment features, which specifically includes obtaining multiple candidate video segments included in each candidate video segment. Video frames, and determine one or more image features corresponding to each video frame, and then process the image features in the multiple video frames and the relationship between the image features in the multiple video frames based on the trained feature extraction model , Determine the segment feature corresponding to the candidate video segment. Among them, the feature extraction processing can refer to processing the original information and extracting feature data, and the feature extraction processing can improve the expression of the original information to facilitate subsequent tasks.
视频帧对应的图像特征可以包括视频帧中对象(例如,主体)的形状信息、视频帧中多个对象之间的位置关系信息、视频帧中对象的色彩信息、视频帧中对象的完整程度或视频帧中的亮度中的至少一个。The image characteristics corresponding to the video frame may include the shape information of the object (for example, the subject) in the video frame, the positional relationship information between multiple objects in the video frame, the color information of the object in the video frame, the completeness of the object in the video frame, or At least one of the brightness in the video frame.
在一些实施例中,特征化提取处理可以采用统计方法(例如主成分分析方法)、降维技术(例如线性判别分析方法)、特征归一化、数据分桶等方法。示例性地,以视频帧中的亮度为例,可以将亮度值在0~80以内按比例对应[1,0,0],亮度值80-160按比例对应[0,1,0],亮度值在80以上对应[0,0,1]。In some embodiments, the feature extraction process may use statistical methods (such as principal component analysis methods), dimensionality reduction techniques (such as linear discriminant analysis methods), feature normalization, data bucketing, and other methods. Exemplarily, taking the brightness in a video frame as an example, the brightness value within 0-80 can be proportionally corresponding to [1,0,0], and the brightness value 80-160 can be proportionally corresponding to [0,1,0]. Values above 80 correspond to [0,0,1].
然而,由于所获得的图像特征是多样的,有些获得的图像特征难以用固定的函数或者明确的规则衡量。因此,特征化提取处理也可以借助于机器学习的方式(如采用特征提取模型),可以通过采集到的信息自动 学习来形成可预测的模型,从而获取较高的准确性。特征提取模型可以是生成模型、判定模型,也可以是机器学习中的深度学习模型,例如,可以是采用yolo系列算法、FasterRCNN算法或EfficientDet算法等的算法的深度学习模型。机器学习模型可以在每个视频帧的画面中检测到已设定的需关注的对象。需要关注的对象可以包括生物(人、动物等),商品(汽车、装饰物、化妆品等)、背景(山、路、桥、房子等)等。进一步地,对于视频而言,需关注的对象可以进一步进行设定,例如可以设定为人物或商品等。可以将多个候选视频片段输入该机器学习模型中,机器学习模型能够输出各个候选视频片段中的对象的位置信息、亮度等图像特征。However, because the obtained image features are diverse, some of the obtained image features are difficult to measure with fixed functions or clear rules. Therefore, the feature extraction process can also rely on machine learning methods (such as using feature extraction models), which can automatically learn the collected information to form a predictable model, thereby obtaining higher accuracy. The feature extraction model can be a generative model, a decision model, or a deep learning model in machine learning. For example, it can be a deep learning model using algorithms such as the yolo series algorithm, the FasterRCNN algorithm, or the EfficientDet algorithm. The machine learning model can detect the set objects that need attention in each video frame. The objects that need attention can include creatures (humans, animals, etc.), commodities (cars, decorations, cosmetics, etc.), backgrounds (mountains, roads, bridges, houses, etc.), etc. Further, for a video, the object to be paid attention to can be further set, for example, it can be set as a character or a product. Multiple candidate video clips can be input into the machine learning model, and the machine learning model can output image features such as position information and brightness of objects in each candidate video clip.
需要说明的是,本领域技术人员可以任意地变化所使用的特征提取模型,本说明书对此不做限制。如特征提取模型可以是GoogLeNeT模型、VGG模型、ResNet模型等。通过使用机器学习模型的方式来提取视频帧的特征,可以更加准确地确定图像特征。It should be noted that those skilled in the art can arbitrarily change the feature extraction model used, which is not limited in this specification. For example, the feature extraction model can be GoogLeNeT model, VGG model, ResNet model, etc. By using the machine learning model to extract the features of the video frame, the image features can be determined more accurately.
训练好的特征提取模型可以是一个基于序列的机器学习模型,其可以将一个可变长度的输入变为固定长度的向量表达进行输出。可以理解,由于不同镜头的时长不同,其对应的视频帧的数量也有所不同,通过训练好的特征提取模型进行处理后,可以将其转化为固定长度的向量进行表征,有助于后续的处理。The trained feature extraction model can be a sequence-based machine learning model, which can convert a variable-length input into a fixed-length vector expression for output. It can be understood that due to the different durations of different shots, the number of corresponding video frames is also different. After processing through the trained feature extraction model, it can be converted into a fixed-length vector for characterization, which is helpful for subsequent processing .
示例性地,基于序列的机器学习模型可以是深度学习模型(DNN),循环神经网络模型(LSTM)或者双向长短时记忆(Bi-directional LSTM,Bi-directional Long Short-Term Memory)模型,门重复单元GRU模型等及其组合构成,本说明书在此不做限制。具体地,将获得的视频帧对应的图像特征(如1,2,3…,n各特征)及其关系(如顺序序列和/或时间先后关系)输入特征提取模型中,特征提取模型就可以输出各个时刻的编码隐藏状态的序列(如h 1~h n),其中,h n包含了这一段时间内镜头的所有信息。通过 这样的方式,特征提取模型可以将一段时间内(如某个镜头对应的候选视频片段)的多个图像特征转化为一个固定长度的向量表达h n(即镜头特征)。有关特征提取模型的训练过程可参见图19的相应描述,在此不再赘述。 Exemplarily, the sequence-based machine learning model may be a deep learning model (DNN), a recurrent neural network model (LSTM), or a bi-directional LSTM (Bi-directional Long Short-Term Memory) model, and the gate repeats The unit GRU model, etc. and its combination are not limited in this specification. Specifically, the image features corresponding to the obtained video frames (such as the features of 1, 2, 3..., n) and their relationships (such as sequential sequence and/or chronological relationship) are input into the feature extraction model, and the feature extraction model can be Output the sequence of the coded hidden state at each time (such as h 1 ~h n ), where h n contains all the information of the lens during this period of time. In this way, the feature extraction model can convert multiple image features within a period of time (such as a candidate video segment corresponding to a certain shot) into a fixed-length vector expression h n (ie, shot feature). For the training process of the feature extraction model, refer to the corresponding description in FIG. 19, which will not be repeated here.
可以理解,对于多个候选视频片段,可以分别采用以上步骤和方法进行处理,以得到不同候选视频片段的片段特征。在此,假定候选视频片段分别为1,2,3…,m,相对应获得的片段特征为R c,1,R c,2,…R c,i…R c,m,后面的描述中沿用此设置。 It can be understood that, for multiple candidate video clips, the above steps and methods can be used to process separately to obtain the clip characteristics of different candidate video clips. Here, it is assumed that the candidate video segments are 1, 2, 3..., m respectively, and the corresponding segment features obtained are R c,1 , R c,2 ,...R c,i …R c,m , in the following description Follow this setting.
步骤1120,基于片段特征生成每个片段集合对应的集合特征向量。Step 1120: Generate a set feature vector corresponding to each set of fragments based on the characteristics of the fragments.
可以基于片段集合中各个候选视频片段的获取顺序以及片段特征,生成每个片段集合对应的集合特征向量。示例性地,可以采用向量拼接、向量串接等方式获得片段集合对应的集合特征向量。以片段集合c为例,片段集合c对应的集合特征向量R c={R c,1,R c,2,…R c,i…R c,m} T;其中上标T表示矩阵转置。 The collection feature vector corresponding to each segment collection may be generated based on the acquisition sequence and the segment characteristics of each candidate video segment in the segment collection. Exemplarily, vector splicing, vector concatenation, etc. may be used to obtain the set feature vector corresponding to the fragment set. In an example set of segments c, c set of segments corresponding to a set of feature vector R c = {R c, 1 , R c, 2, ... R c, i ... R c, m} T; where the superscript T denotes the matrix transpose .
步骤1130,基于训练好的判别模型以及每个片段集合对应的集合特征向量确定每个片段集合与其他片段集合之间的相似程度。Step 1130: Determine the degree of similarity between each fragment set and other fragment sets based on the trained discriminant model and the set feature vector corresponding to each fragment set.
多媒体系统100(例如,处理设备112)可以基于训练好的判别模型对任意两个片段集合对应的集合特征向量进行相似程度判定,从而确定每个片段集合与其他片段集合之间的相似程度。The multimedia system 100 (for example, the processing device 112) may determine the similarity degree of the set feature vectors corresponding to any two fragment sets based on the trained discriminant model, so as to determine the degree of similarity between each fragment set and other fragment sets.
假定有a,b,c…k个片段集合,分别具有对应的集合特征向量R a,R b,R c…R k。在执行步骤930时可以选择k个片段集合中的其中一个(如选择片段集合c),分别计算其集合特征向量R c与另外(k-1)个片段集合的集合特征向量的相似程度,得到所有的相似程度对比结果,将相似程度对比结果作为相似程度。 Assuming a, b, c ... k set of segments, each having a corresponding set of eigenvectors of R a, R b, R c ... R k. When step 930 is performed, one of the k fragment sets can be selected (for example, fragment set c), and the similarity between the set feature vector R c and the set feature vector of the other (k-1) fragment sets can be calculated to obtain For all similarity comparison results, the similarity comparison result is regarded as the similarity degree.
在一些实施例中,步骤1130还可以使用向量相似度系数来判定两个片段集合之间的相似程度。相似度系数指的是利用算式计算样本间的相 似程度,相似度系数的值越小,说明个体间相似度越小,差异越大。当两个片段集合之间的相似度系数很大,则可判断两个片段集合之间的相似程度较高。在一些实施例中,所使用的相似系数包括但不限于简单匹配相似系数、Jaccard相似系数、余弦相似度、调整余弦相似度、皮尔森相关系数等。In some embodiments, step 1130 may also use a vector similarity coefficient to determine the degree of similarity between two fragment sets. The similarity coefficient refers to the calculation of the similarity between samples using a calculation formula. The smaller the value of the similarity coefficient, the smaller the similarity between individuals and the greater the difference. When the similarity coefficient between the two fragment sets is large, it can be judged that the similarity between the two fragment sets is relatively high. In some embodiments, the similarity coefficient used includes but is not limited to simple matching similarity coefficient, Jaccard similarity coefficient, cosine similarity, adjusted cosine similarity, Pearson correlation coefficient, and the like.
步骤1140,基于相似程度确定每个片段集合与其他片段集合之间的组合差异度。Step 1140: Determine the degree of combined difference between each segment set and other segment sets based on the degree of similarity.
步骤1130中采用的相似度系数指的是利用算式计算样本间的相似程度,相似度系数的值越小,说明个体间相似度越小,差异越大。当两个片段集合之间的相似度系数很大,则可判断两个片段集合之间的差异度较小。在一些实施例中,相似度系数为[0,1]内的实数时,组合差异度可以为其倒数、相反数,1-相似度系数等。The similarity coefficient used in step 1130 refers to the calculation of the degree of similarity between samples using a calculation formula. The smaller the value of the similarity coefficient, the smaller the similarity between individuals and the greater the difference. When the similarity coefficient between the two fragment sets is large, it can be judged that the difference between the two fragment sets is small. In some embodiments, when the similarity coefficient is a real number within [0,1], the combined difference degree can be its reciprocal, inverse number, 1-similarity coefficient, etc.
在一些实施例中,还可以基于集合特征向量生成至少一个片段集合。具体可以基于聚类算法对所述多个集合特征向量进行聚类,获得多个集合聚类簇,并基于聚类结果生成至少一个片段集合。In some embodiments, at least one segment set may also be generated based on the set feature vector. Specifically, the multiple set feature vectors may be clustered based on a clustering algorithm to obtain multiple set cluster clusters, and at least one segment set is generated based on the clustering result.
具体地,假定所需获得的至少一个片段集合的视频片段数量(即目标视频镜头数的预设值)为P,实际所得的聚类簇的数量为Q。若片段集合需要的视频数量P小于或者等于实际所得的聚类簇的数量为Q,则选取P个聚类簇,并从每个聚类簇中选择一个候选视频;若推荐视频集需要的视频数量P大于实际所得的聚类簇的数量为Q,则可以从每个聚类簇中选择若干个远离聚类中心的候选视频,随机抽取P个候选视频片段以构成片段集合。Specifically, it is assumed that the number of video segments of at least one segment set that needs to be obtained (that is, the preset value of the number of target video shots) is P, and the number of clusters actually obtained is Q. If the number of videos required by the segment set P is less than or equal to the number of cluster clusters actually obtained is Q, then P clusters are selected, and a candidate video is selected from each cluster; if the video required by the video set is recommended The number P is greater than the number of cluster clusters actually obtained, which is Q, then several candidate videos far away from the cluster center can be selected from each cluster cluster, and P candidate video segments are randomly selected to form a segment set.
在一些实施例中,可以基于密度的聚类算法(如DBSCAN密度聚类算法)获得多个集合聚类簇。具体地,确定所需的片段集合的预设值,即确定片段集合的个数(数量为P);进一步地,确定聚类的邻域参数(∈,MinPts),其中,∈对应聚类簇在向量空间中的半径,MinPts对应形成聚 类簇所需的样本个数的最小值,得到数量为Q的聚类簇,多次调整邻域参数(∈,MinPts)并对视频特征向量进行聚类处理,直至所得的聚类簇数量Q大于等于预设值P为止。此时,随机选取个数与预设值P相等的多个集合聚类簇;基于集合聚类簇,确定片段集合。In some embodiments, a density-based clustering algorithm (such as the DBSCAN density clustering algorithm) may be used to obtain multiple aggregate clusters. Specifically, determine the preset value of the required fragment set, that is, determine the number of fragment sets (the number is P); further, determine the neighborhood parameter (∈, MinPts) of the cluster, where ∈ corresponds to the cluster cluster In the radius of the vector space, MinPts corresponds to the minimum number of samples required to form clusters, and the number of clusters is Q. The neighborhood parameters (∈, MinPts) are adjusted multiple times and the video feature vectors are clustered. Class processing until the obtained number of clusters Q is greater than or equal to the preset value P. At this time, a number of aggregate clusters equal in number to the preset value P are randomly selected; based on the aggregate cluster clusters, the segment set is determined.
在一些实施例中,流程1100也可以用于判断生成的多个目标视频(例如,流程800或1200中的多个目标视频)的相似度,并基于所述相似度推荐目标视频。In some embodiments, the process 1100 may also be used to determine the similarity of multiple generated target videos (for example, the multiple target videos in the process 800 or 1200), and recommend the target video based on the similarity.
图12是根据本申请的一些实施例所示的生成视频的示例性流程图。在一些实施例中,流程1200中的一个或以上步骤可以作为指令的形式存储在存储设备(例如,数据库140)中,并且被处理设备(例如,服务器110中的处理设备112)调用和/或执行。Fig. 12 is an exemplary flow chart of generating a video according to some embodiments of the present application. In some embodiments, one or more steps in the process 1200 may be stored in a storage device (for example, the database 140) as instructions, and called by a processing device (for example, the processing device 112 in the server 110) and/or implement.
其中,对于流程1200,步骤1210和步骤1240与流程800中步骤810和步骤840相同或相似,可以参见图8及其相关描述,在此不再赘述。流程1200中步骤1220与步骤1230不同于流程800。For the process 1200, step 1210 and step 1240 are the same as or similar to step 810 and step 840 in the process 800, and reference may be made to FIG. 8 and its related descriptions, which will not be repeated here. Steps 1220 and 1230 in the process 1200 are different from the process 800.
步骤1220,基于多个视频片段随机生成多个候选片段集合。Step 1220: Randomly generate multiple candidate segment sets based on multiple video segments.
多媒体系统(例如,处理设备112)可以对多个视频片段进行随机组合,生成候选片段集合。在一些实施例中,处理设备112可以对步骤810中获得的部分视频片段进行随机组合,生成M(M大于或等于目标数量)个候选片段集合。在一些实施例中,处理设备112可以对步骤810中获得的所有视频片段进行组合,从中筛选出M(M大于或等于目标数量)个组合确定为候选片段集合,或将所有组合确定为候选片段集合。一个候选片段集合中可以包括一个或多个视频片段。The multimedia system (for example, the processing device 112) may randomly combine multiple video clips to generate a set of candidate clips. In some embodiments, the processing device 112 may randomly combine the partial video clips obtained in step 810 to generate M (M greater than or equal to the target number) candidate clip sets. In some embodiments, the processing device 112 may combine all the video clips obtained in step 810, and select M (M greater than or equal to the target number) combinations from them to determine the set of candidate clips, or determine all the combinations as candidate clips gather. A candidate segment set may include one or more video segments.
在一些实施例中,多个候选片段集合满足第二预设条件。第二预设条件包括多个候选片段集合中任意两个候选片段集合之间的视频片段组合差异度大于预设差异度阈值。预设差异度阈值可以为任意大于0的正整 数,例如,1、2等。关于第二预设条件的更多内容可以参见图9、图10及其相关描述,在此不再赘述。In some embodiments, multiple candidate segment sets satisfy the second preset condition. The second preset condition includes that the video segment combination difference between any two candidate segment sets in the plurality of candidate segment sets is greater than the preset difference threshold. The preset difference degree threshold can be any positive integer greater than 0, for example, 1, 2, and so on. For more content about the second preset condition, please refer to FIG. 9, FIG. 10 and related descriptions, which will not be repeated here.
步骤1230,基于第一预设条件从多个候选片段集合中筛选出至少一个片段集合。Step 1230: Filter out at least one fragment set from the multiple candidate fragment sets based on the first preset condition.
第一预设条件可以包括但不限于目标视频的总时长、目标视频所包含的镜头画面数量、目标视频包含指定镜头画面、目标视频包含特定对象等一种或多种的组合。关于第一预设条件的更多内容可以参见图8及其相关描述,在此不再赘述。在一些实施例中,处理设备可以基于候选片段集合整体确定至少一个(例如,目标数量个)片段集合。例如,可以筛选视频片段的时长总和和/或镜头画面数量满足第一预设条件的候选片段集合为目标数量个片段集合中的一个或多个。在一些实施例中,处理设备112可以基于每个候选片段集合中视频片段包含的内容确定目标数量个片段集合。例如,可以利用训练好的机器学习确定候选片段集合中视频片段是否包含特定对象,基于确定结果筛选包含特定对象的候选片段集合。训练好的机器学习模型的输入可以为候选片段集合,或候选片段集合中的视频片段,对应地,输出可以为候选片段集合是否包含特定对象,视频片段是否包含特定对象等,本说明书对此不做限制。The first preset condition may include, but is not limited to, the total duration of the target video, the number of shots included in the target video, the target video contains the specified shots, the target video contains the specific object, and a combination of one or more. For more content about the first preset condition, refer to FIG. 8 and related descriptions, which will not be repeated here. In some embodiments, the processing device may determine at least one (for example, a target number) segment set based on the candidate segment set as a whole. For example, it is possible to filter a set of candidate fragments whose total duration of video fragments and/or the number of shots meets the first preset condition as one or more of the target number of fragment sets. In some embodiments, the processing device 112 may determine the target number of fragment sets based on the content contained in the video fragments in each candidate fragment set. For example, trained machine learning can be used to determine whether a video clip in the set of candidate clips contains a specific object, and based on the determination result, the set of candidate clips containing the specific object can be screened. The input of the trained machine learning model can be a set of candidate fragments, or a video fragment in the set of candidate fragments. Correspondingly, the output can be whether the candidate fragment set contains a specific object, whether the video fragment contains a specific object, etc. This specification does not Do restrictions.
流程800以及流程1200,可以通过对初始视频或初始图像进行拆分、筛选、组合、裁剪、美化等一种或多种的组合操作合成满足需求的目标视频。在一些实施例中,服务器110(例如,处理设备112)可以通过对初始视频或初始图像进行拆分获得多个视频片段,基于视频时长、视频内容等约束条件从多个视频片段中筛选出一个或多个候选视频片段,通过对候选视频片段进行组合获得彼此之间差异满足预设阈值的至少一个片段集合或目标数量个片段集合,基于片段集合生成目标数量个目标视频。在一些实施例中,服务器110可以通过对拆分获得的多个视频片段随机组合获得 多个彼此之间差异满足预设阈值的候选片段集合,基于视频时长、视频内容等约束条件从候选片段集合中筛选出目标数量个片段集合,基于目标数量个片段集合生成目标数量个目标视频。在一些实施例中,本说明书实施例提供的视频合成方法和/或系统可以用于宣传视频的合成,例如,可以基于预先拍摄的用于产品、或文化、或公益等宣传的视频文件,通过拆分、筛选、美化、合成等操作对视频文件进行处理生成多样化的宣传视频。In the process 800 and the process 1200, the target video that meets the requirements can be synthesized through one or more combination operations such as splitting, filtering, combining, cropping, and beautifying the initial video or the initial image. In some embodiments, the server 110 (for example, the processing device 112) may obtain multiple video clips by splitting the initial video or initial image, and filter out one of the multiple video clips based on constraints such as video duration and video content. Or a plurality of candidate video segments, by combining the candidate video segments to obtain at least one segment set or a target number of segment sets whose differences meet a preset threshold, and generate the target number of target videos based on the segment set. In some embodiments, the server 110 may randomly combine multiple video clips obtained by splitting to obtain multiple candidate clip sets whose differences between each other meet a preset threshold, and collect candidate clips based on constraints such as video duration and video content. The target number of fragment sets are filtered out, and the target number of target videos are generated based on the target number of fragment sets. In some embodiments, the video synthesis method and/or system provided in the embodiments of this specification can be used to synthesize promotional videos. For example, it can be based on pre-shot video files used for product, cultural, or public welfare, etc., through Splitting, screening, beautifying, compositing and other operations process video files to generate diversified promotional videos.
在一些实施例中,目标视频通常会带有背景音乐或主题音乐,背景音乐或主题音乐作为用于调节气氛的一种音乐,插入于视频中之中,能够增强情感的表达,达到一种让观众身临其境的感受。同时,背景音乐或主题音乐具备时间属性,可以将背景音乐或主题音乐的时长、节奏等元素作为本申请一些实施例中的时间参数。In some embodiments, the target video usually has background music or theme music. The background music or theme music is used as a kind of music to adjust the atmosphere. It can be inserted into the video to enhance the expression of emotions and achieve a The audience's immersive experience. At the same time, background music or theme music has time attributes, and elements such as the duration and rhythm of the background music or theme music can be used as time parameters in some embodiments of the present application.
图13是根据本申请的一些实施例所示的添加音频素材的示例性流程图。在一些实施例中,流程1300中的一个或以上步骤可以作为指令的形式存储在存储设备(例如,数据库140)中,并且被处理设备(例如,服务器110中的处理设备112)调用和/或执行。Fig. 13 is an exemplary flowchart of adding audio material according to some embodiments of the present application. In some embodiments, one or more steps in the process 1300 may be stored as instructions in a storage device (for example, the database 140), and called by a processing device (for example, the processing device 112 in the server 110) and/or implement.
步骤1310,获取初始音频。Step 1310: Obtain initial audio.
在一些实施例中,初始音频(也可以称为待处理音乐)可以是用户导入或用户在数据库140中选取得到。初始音频可以具有不同的类型,如温暖、舒缓、轻快、专注、激昂、愤怒、恐怖等。多媒体系统100(例如,处理设备112)可以基于目标视频的主体、主题、视频效果或视频配置信息,选取不同类型的初始音频。例如,公益宣传视频的初始音频可以选择温暖、舒缓的初始音频。在一些实施例中,若目标视频较长,可以选择多个初始音频,将音频首尾连接。在另一些实施例中,若目标视频较短,可以只选用所述初始音频中的部分(如副歌)等。In some embodiments, the initial audio (also referred to as music to be processed) may be imported by the user or selected by the user in the database 140. The initial audio can have different types, such as warm, soothing, brisk, focused, agitated, angry, scary, etc. The multimedia system 100 (for example, the processing device 112) may select different types of initial audio based on the subject, theme, video effect, or video configuration information of the target video. For example, the initial audio of a public welfare promotion video can be a warm and soothing initial audio. In some embodiments, if the target video is longer, multiple initial audios can be selected and the audios are connected end to end. In other embodiments, if the target video is short, only part of the initial audio (such as a chorus) can be selected.
步骤1320,对初始音频基于节奏进行标记得到至少一个音频切分点。Step 1320: Mark the initial audio based on the rhythm to obtain at least one audio segmentation point.
在一些实施例中,基于节奏进行标记可以是基于整个歌曲的结构进行标记,如标记前奏、主歌和副歌等,也可以是将歌曲划分的更为细致,如根据鼓点或者节拍进行切分标记。在一些实施例中,对初始音频的标记粒度可以由初始图像和/或初始视频的数量决定。仅作为示例,假设图像和视频素材数量中等,将初始音频依照鼓点标记后,一部分切分点无法匹配素材,故可以将初始音频先标记为前奏、主歌和副歌,再将副歌部分按照鼓点标记,得到数量合适的切分点。In some embodiments, marking based on rhythm can be based on the structure of the entire song, such as marking the intro, verse, and chorus, or it can be divided into more detailed songs, such as segmentation based on drums or beats. mark. In some embodiments, the marking granularity of the initial audio may be determined by the number of initial images and/or initial videos. Just as an example, assuming that the number of image and video materials is medium, after marking the initial audio according to the drum beats, some of the cut points cannot match the material, so the initial audio can be marked as the intro, verse and chorus first, and then the chorus part is marked as Drum marks to get the right number of cut points.
在一些实施例中,基于节奏对初始音频进行标记可以通过软件(如Adobe Audition、FL Studio等)或插件(如基于Vue.js的audio wave plugin等)实现。在一些实施例中,可以通过基于信号分析的音频节奏分析算法实现对初始音频的自动标记。需要说明的是,音频标记处理方式多样,在本实施例中不做限制。In some embodiments, marking the initial audio based on the rhythm can be implemented by software (such as Adobe Audition, FL Studio, etc.) or plug-ins (such as audio wave plugin based on Vue.js, etc.). In some embodiments, the automatic marking of the initial audio can be realized by an audio rhythm analysis algorithm based on signal analysis. It should be noted that there are various audio tag processing methods, which are not limited in this embodiment.
步骤1330,至少部分基于视频配置信息确定目标视频的至少一个视频切分点;Step 1330: Determine at least one video segmentation point of the target video based at least in part on the video configuration information;
在一些实施例中,视频配置信息可以用于确定目标视频的视频切分点。其中,可以根据视频配置信息确定不同的主题、不同镜头以及拼接方式等。根据不同的主题、不同镜头以及拼接方式确定目标视频的至少一个视频切分点,其中,各个接头的拼接时点可以作为视频切分点。例如,对于轮胎广告,基于视频配置信息可以确定目标视频涉及镜头包括赛车、换轮胎、颁奖典礼,轮胎展示等,可以基于所述不同的镜头,确定所述目标视频的至少一个视频切分点,作为可选剪辑点。In some embodiments, the video configuration information may be used to determine the video segmentation point of the target video. Among them, different themes, different shots, and splicing methods can be determined according to the video configuration information. At least one video segmentation point of the target video is determined according to different themes, different shots, and splicing methods, where the splicing time point of each joint can be used as the video segmentation point. For example, for a tire advertisement, based on the video configuration information, it can be determined that the target video involves shots including racing cars, tire changing, award ceremony, tire display, etc., and at least one video segmentation point of the target video may be determined based on the different shots, As an optional editing point.
在一些实施例中,单个可选剪辑点可以选择添加视频片段,也可以选择不添加素材,素材是否添加取决于可选剪辑点的数量和两个可选剪 辑点之间的时间间隔。仅作为示例,若单个可选剪辑点处未添加素材,可以将前一素材或后一素材的持续时间进行适当延长。由于可选剪辑点与节奏相关联,故通过该可选剪辑点添加素材,易于素材的排布的同时,提供很好的节奏型,提高目标视频的效果。在一些其他实施例中,该可选剪辑点还可以作为目标视频的起始点或截止点。In some embodiments, a single selectable editing point can choose to add a video clip or not to add a material. Whether to add a material depends on the number of selectable editing points and the time interval between two selectable editing points. Just as an example, if no material is added at a single selectable editing point, the duration of the previous material or the next material can be appropriately extended. Since the optional cutting point is associated with the rhythm, adding material through the optional cutting point makes it easy to arrange the material while providing a good rhythm pattern and improving the effect of the target video. In some other embodiments, the optional cut point may also be used as the starting point or ending point of the target video.
步骤1340,将至少一个音频切分点与至少一个视频切分点进行匹配。Step 1340: Match at least one audio segmentation point with at least one video segmentation point.
在一些实施例中,视频片段与可选剪辑点的匹配,可以依照两个可选剪辑点之间的间隔时间进行。仅作为示例,假设初始音频中30s处有一个切分点,在该切分点后最近的切分点为45s处,此时可以选择持续时间在15s左右的视频插入30s处的切分点。在一些实施例中,两个剪辑点之间可能间隔只有几秒,此时可以设定一个阈值,如两剪辑点之间间隔小于该阈值(如3秒或5秒等),插入视频片段中静态的素材或时间较短的素材。In some embodiments, the matching of the video clip with the optional cut points may be performed according to the interval time between the two optional cut points. Just as an example, suppose there is a split point at 30s in the initial audio, and the nearest split point after the split point is 45s. At this time, a video with a duration of about 15s can be inserted into the split point at 30s. In some embodiments, the interval between two editing points may be only a few seconds. At this time, a threshold can be set. For example, if the interval between two editing points is less than the threshold (such as 3 seconds or 5 seconds, etc.), insert it into the video clip Static material or material with a short period of time.
在一些实施例中,视频片段的长度不一,可能出现一些视频片段由于时间的问题无法与可选剪辑点匹配,在一些实施例中,可以将视频进行切分或变速,例如持续时间为15s的视频片段可以进行切分后得到一个10s的素材和一个5s的素材,将切分后的素材与可选剪辑点匹配。再例如,视频片段持续时间为22s,两可选剪辑点之间间隔为20s,此时可以将视频片段进行加速播放,将持续时间缩短至20s后插入该可选剪辑点,需要说明的是,在一些实施例中,为了保证目标视频的效果,可以对视频片段的变速设定一个阈值(如±5%或±10%)等,对变速超过该阈值的视频片段采用拼接的方式处理。In some embodiments, the length of the video clips is different, and it may happen that some video clips cannot be matched with the optional cutting points due to time issues. In some embodiments, the video can be divided or changed, for example, the duration is 15s. The video clip can be split to get a 10s material and a 5s material, and the split material can be matched with the optional cutting point. For another example, the duration of the video clip is 22s, and the interval between the two optional editing points is 20s. At this time, the video clip can be played at an accelerated rate. After shortening the duration to 20s, insert the optional editing point. It should be noted that, In some embodiments, in order to ensure the effect of the target video, a threshold (such as ±5% or ±10%) may be set for the variable speed of the video clip, and the video clips whose variable speed exceeds the threshold may be processed in a splicing manner.
在一些实施例中,视频片段中可能包括初始音轨(如背景音、独白等原声),根据实际需要可以将视频片段中的初始音轨进行剔除,也可以 将该音轨进行保留,在目标视频中同时播放,在本申请中不做限制。In some embodiments, the original audio track (such as background sound, monologue, etc.) may be included in the video clip. According to actual needs, the original audio track in the video clip can be removed, or the audio track can be kept in the target The videos are played at the same time, and there is no restriction in this application.
在一些实施例中,流程1300可以由音频匹配模型完成。通过将所述目标视频输入音频匹配模型,即可在目标视频中添加音频。所述音频匹配模型可以完成流程1300中步骤1310-1340的操作。所述音频匹配模型可以是机器学习模型,例如神经网络模型,深度学习模型等。In some embodiments, the process 1300 may be completed by an audio matching model. By inputting the target video into the audio matching model, audio can be added to the target video. The audio matching model can complete the operations of steps 1310-1340 in the process 1300. The audio matching model may be a machine learning model, such as a neural network model, a deep learning model, etc.
在一些实施例中,生成的目标视频可以同放于不同的播放媒介。示例性播放媒介可以包括视频网站的横版视频界面,手机短视频应用中的竖版视频界面,户外电子大屏等。为方便目标视频在不同播放媒介的投放,本申请提供的方法在生成目标视频后还可以对目标视频进行后处理,具体的,对目标视频进行后处理以满足至少一个视频输出条件,至少一个视频输出条件与目标视频的播放媒介相关。其中,至少一个视频输出条件可以包括视频尺寸条件,后处理可以包括根据视频尺寸条件对目标视频的画面进行裁剪。In some embodiments, the generated target video can be simultaneously played on different playback media. Exemplary playback media may include a horizontal video interface of a video website, a vertical video interface in a short video application of a mobile phone, an outdoor electronic large screen, and the like. In order to facilitate the placement of the target video in different playback media, the method provided in this application can also post-process the target video after the target video is generated. Specifically, the target video is post-processed to meet at least one video output condition, and at least one video The output condition is related to the playback medium of the target video. Wherein, the at least one video output condition may include a video size condition, and the post-processing may include cropping a picture of the target video according to the video size condition.
图14是本申请的一些实施例所示的目标视频后处理(画面裁剪)的示意图。FIG. 14 is a schematic diagram of target video post-processing (screen cropping) shown in some embodiments of the present application.
目标视频的后处理(画面裁剪)可以基于目标视频在不同投放的视频尺寸设定对应的裁剪框,现有技术中裁剪框的中心与进行画面裁剪的目标视频的画面的中心重合,然后基于裁剪框进行画面裁剪。这样的画面裁剪方式可能使得目标视频的主要信息在裁剪后缺失(如广告商品的画面被裁剪掉)。如图14所示,本申请的目标视频后处理的系统可以通过执行目标视频裁剪的方法来实现对视频画面的尺寸进行转换的目的。目标视频画面裁剪的系统可以通过将基于裁剪主体的信息和预设画面尺寸来对视频的画面进行裁剪,可以使得经过画面裁剪后的目标视频的主要信息不易缺失(如广告商品的画面被保留)。The post-processing (screen cropping) of the target video can set the corresponding cropping frame based on the target video at different video sizes. In the prior art, the center of the cropping frame coincides with the center of the screen of the target video for screen cropping, and then based on cropping. Frame cropping. Such a screen cropping method may make the main information of the target video missing after cropping (for example, the screen of the advertisement product is cropped). As shown in FIG. 14, the target video post-processing system of the present application can achieve the purpose of converting the size of the video screen by executing the method of target video cropping. The target video screen cropping system can crop the video screen based on the information of the cropped subject and the preset screen size, which can make the main information of the target video after the screen cropping is not easy to be missing (such as the screen of the advertisement product is retained) .
可以理解的是,图14中待处理视频14可以是目标视频,也可以 是初始视频、初始图像、视频片段中的一种或多种。在一些实施例中,流程1400可以单独运用,待处理视频14可以是需要处理(变换尺寸)的任何视频。It can be understood that the to-be-processed video 14 in FIG. 14 may be a target video, or may be one or more of an initial video, an initial image, and a video segment. In some embodiments, the process 1400 can be used alone, and the to-be-processed video 14 can be any video that needs to be processed (size changed).
在一些实施例中,待处理视频14的画面裁剪的系统可以集成在生成视频的系统100中,并通过处理设备112(也可以写作处理终端)实现目标视频的后处理。In some embodiments, the system for cropping the image of the to-be-processed video 14 can be integrated in the system 100 for generating the video, and the post-processing of the target video can be achieved through the processing device 112 (or a processing terminal).
处理设备112可以在多种应用场景中用于转换视频的画面尺寸。例如,处理设备112可以用于将原本在户外电子屏上投放的目标视频的画面尺寸进行转换,使其适合在地铁电视屏幕上投放。又例如,处理设备112可以用于将手机或摄像机拍摄的视频的画面尺寸,调整为视频网站播放的优选尺寸。再例如,处理设备112可以用于将横屏视频转换为竖屏视频。The processing device 112 can be used to convert the screen size of the video in a variety of application scenarios. For example, the processing device 112 may be used to convert the screen size of the target video originally placed on an outdoor electronic screen to make it suitable for placement on a subway TV screen. For another example, the processing device 112 may be used to adjust the screen size of a video shot by a mobile phone or a camera to a preferred size for playing on a video website. For another example, the processing device 112 may be used to convert a horizontal screen video into a vertical screen video.
在一个典型的应用场景中,当需要将横屏视频(例如画面的宽高比为16:9)转换为竖屏视频(例如画面的宽高比为9:16)时,处理设备112可以获取待处理视频14(横屏视频),并基于模型12将待处理视频14拆分成多个视频片段16;处理设备112可以识别出待处理视频14中的主体15;处理设备112可以根据主体15以及竖屏视频的预设画面尺寸来为所述待处理视频14配置裁剪框13,并根据裁剪框13来分别裁剪多个视频片段16的画面17,再将裁剪之后的视频片段16重新拼接为完整的视频,以得到竖屏视频。In a typical application scenario, when a horizontal screen video (for example, the aspect ratio of the picture is 16:9) needs to be converted to a vertical screen video (for example, the aspect ratio of the picture is 9:16), the processing device 112 can obtain The to-be-processed video 14 (horizontal screen video), and split the to-be-processed video 14 into multiple video segments 16 based on the model 12; the processing device 112 can identify the subject 15 in the to-be-processed video 14; the processing device 112 can be based on the subject 15 And the preset picture size of the vertical screen video is used to configure a cropping frame 13 for the to-be-processed video 14, and according to the cropping frame 13, the images 17 of multiple video clips 16 are respectively cropped, and then the cropped video clips 16 are re-spliced into Complete video to get vertical screen video.
在处理设备112中可以包括模型12。处理设备112基于模型12,获取裁剪主体15和/或视频片段16。示例的,模型12可以为机器学习模型,识别出的视频片段16中的裁剪主体15,例如,裁剪主体15可以前述特定对象和/或主体,裁剪主体15可以是人、汽车、化妆品等。The model 12 may be included in the processing device 112. The processing device 112 obtains the cropped subject 15 and/or the video clip 16 based on the model 12. For example, the model 12 may be a machine learning model, and the cropped subject 15 in the identified video clip 16, for example, the cropped subject 15 can be the aforementioned specific object and/or subject, and the cropped subject 15 can be a person, car, cosmetics, etc.
模型12可以被存储在处理设备112中,当需要使用模型12的相关功能时,处理设备112执行调用模型12的操作。模型12可以指基于处 理设备112而进行的若干方法的集合。这些方法可以包括大量的参数。在使用模型12时,模型12中的参数可以是被预先设置好的,也可以是可以动态调整的。一些参数可以通过训练的方法获得,一些参数可以在执行的过程中获得。关于本说明书中涉及模型的具体说明,可参见本说明书的相关部分。The model 12 may be stored in the processing device 112, and when the related functions of the model 12 need to be used, the processing device 112 performs an operation of calling the model 12. The model 12 may refer to a collection of several methods performed based on the processing device 112. These methods can include a large number of parameters. When the model 12 is used, the parameters in the model 12 can be preset or can be dynamically adjusted. Some parameters can be obtained through training, and some parameters can be obtained during execution. For the specific description of the model in this manual, please refer to the relevant part of this manual.
处理设备112可以通过设定裁剪框13对目标视频进行后处理。裁剪框13可以理解为根据待处理视频待转换的目标尺寸而确定的裁剪边界。在基于裁剪框13进行视频的画面裁剪的过程中,可以保留裁剪框13内的画面而删除裁剪框13外的画面,从而可以将目标视频裁剪为各个输出要求对应的尺寸。The processing device 112 may perform post-processing on the target video by setting the cropping frame 13. The cropping frame 13 can be understood as a cropping boundary determined according to the target size of the video to be processed to be converted. In the process of cropping the video based on the cropping frame 13, the screens in the cropping frame 13 can be retained and the screens outside the cropping frame 13 can be deleted, so that the target video can be cropped to the size corresponding to each output requirement.
在一些实施例中,处理设备112可以为用于播放待进行画面裁剪的目标视频的播放设备,因此,用于播放待进行画面裁剪的目标视频的播放设备可以获取目标视频,并基于该设备自身播放视频的尺寸来对目标视频的画面进行裁剪,并将进行画面裁剪后的目标视频进行自动播放。在另一些实施例中,处理设备112可以为能够执行目标视频的画面裁剪的操作的智能设备(例如电脑、手机、智能穿戴设备等),智能设备可以将经过画面裁剪后的目标视频发送给用于播放待进行画面裁剪的目标视频的播放设备。In some embodiments, the processing device 112 may be a playback device for playing the target video to be trimmed. Therefore, the playback device for playing the target video to be trimmed can obtain the target video and base it on the device itself. The size of the playing video is used to crop the screen of the target video, and the target video after screen cropping is automatically played. In other embodiments, the processing device 112 may be a smart device (such as a computer, a mobile phone, a smart wearable device, etc.) capable of performing screen cropping operations of the target video, and the smart device may send the screen cropped target video to the user. It is a playback device that plays the target video to be cropped.
如图14所示的目标视频画面裁剪的流程可以由处理设备112执行。例如,目标视频画面裁剪的方法可以以程序或指令的形式存储在存储装置(如处理终端的存储设备、存储器)中,上述目标视频画面裁剪的方法的流程1400可以包括以下步骤:The process of cropping the target video screen as shown in FIG. 14 may be executed by the processing device 112. For example, the method for cropping a target video frame may be stored in a storage device (such as a storage device or memory of a processing terminal) in the form of a program or instruction. The process 1400 of the method for cropping a target video frame may include the following steps:
步骤1410,获取待进行画面裁剪的目标视频。Step 1410: Obtain the target video to be cropped.
在一些实施例中,目标视频可以用于作为广告视频,广告视频可以理解为一种借助灵活的创意锁定与之相关联的受众群体的视频内容,以 达到向该受众群体传播信息或营销商品等目的。在一些实施例中,目标视频可以通过电视、户外广告屏、电子设备(如手机、电脑、智能穿戴设备等)的网页或弹窗等形式呈现给受众群体。画面剪裁可以理解为一种基于预设的尺寸对视频的画面按照预设画面尺寸进行裁剪的方式。在一些实施例中,对画面进行裁剪可以基于画面中的主要信息和预设画面尺寸(可以理解为目标画面尺寸)来设置一个裁剪框,并基于裁剪框对画面进行裁剪。画面中的主要信息可以包括场景、人物、商品等。步骤1410可以直接获取前述步骤生成的目标视频。在一些实施例中,流程1400可以独立运行,即可以获取其他渠道的视频作为待处理视频14。In some embodiments, the target video can be used as an advertising video. The advertising video can be understood as a kind of video content that uses flexible creativity to lock the audience group associated with it, so as to disseminate information or market products to the audience group, etc. Purpose. In some embodiments, the target video may be presented to the audience in the form of a TV, an outdoor advertising screen, a web page or a pop-up window of an electronic device (such as a mobile phone, a computer, a smart wearable device, etc.). Screen cropping can be understood as a way of cropping a video screen according to a preset screen size based on a preset size. In some embodiments, cropping the screen can set a cropping frame based on the main information in the screen and a preset screen size (which can be understood as the target screen size), and crop the screen based on the cropping frame. The main information in the screen can include scenes, characters, commodities, and so on. In step 1410, the target video generated in the foregoing steps can be directly obtained. In some embodiments, the process 1400 can run independently, that is, videos from other channels can be obtained as the to-be-processed video 14.
步骤1420,基于目标视频确定一个或多个镜头片段。Step 1420: Determine one or more shot segments based on the target video.
在一些实施例中,步骤1420可以参考前述流程400的相关描述。In some embodiments, reference may be made to the related description of the foregoing process 400 for step 1420.
步骤1430,获取目标视频包含的各个视频片段的裁剪主体信息,裁剪主体信息反应该视频片段的特定裁剪主体以及特定裁剪主体的位置信息。Step 1430: Obtain the cropping subject information of each video segment contained in the target video, and the cropping subject information reflects the specific cropping subject of the video segment and the position information of the specific cropping subject.
在一些实施例中,裁剪主体也可以是前述特定对象以及主体中的一个,获取方法可以参考对应的描述。In some embodiments, the cropped subject may also be one of the aforementioned specific object and subject, and the method of obtaining may refer to the corresponding description.
在一些实施例中,可以通过机器学习的方法确定剪裁主体信息,则对应的,步骤1430可以包括为使用机器学习模型获取每个镜头片段(视频片段)中的裁剪主体信息。对应的,机器学习模型能够识别每个所述镜头片段中的裁剪主体,机器学习模型还可以在识别到裁剪主体的同时获取裁剪主体信息。裁剪主体信息可以表示与裁剪主体相关的一些信息,裁剪主体信息用于至少反映裁剪主体的位置。在一些实施例中,裁剪主体信息只需包括裁剪主体的位置信息和名称信息。在另一些实施例中,裁剪主体信息可以包括裁剪主体的位置信息、尺寸信息、名称信息、类别信息等。裁剪主体的位置信息可以理解位置在目标视频的画面中所处的位置的信息, 例如可以是参考点的坐标的信息。裁剪主体的尺寸信息可以包括裁剪主体的实际尺寸信息和裁剪主体占目标视频的画面的尺寸的比例信息等。裁剪主体的类别信息可以理解为裁剪主体的分类,例如,裁剪主体的类别信息包括裁剪主体的分类是人或是物的信息,又例如,裁剪主体的类别信息还可以包括裁剪主体是护肤品、洗护用品或是汽车等的信息。仅作为示例,当裁剪主体为洗发水时,则裁剪主体的名称信息可以为洗发水,裁剪主体的类别信息可以为洗护用品。In some embodiments, the tailoring subject information can be determined by a machine learning method. Correspondingly, step 1430 can include obtaining the tailoring subject information in each shot segment (video segment) by using a machine learning model. Correspondingly, the machine learning model can identify the cropped subject in each of the shot segments, and the machine learning model can also obtain the cropped subject information while recognizing the cropped subject. The cropped subject information can represent some information related to the cropped subject, and the cropped subject information is used to at least reflect the position of the cropped subject. In some embodiments, the cropping subject information only needs to include the location information and name information of the cropping subject. In other embodiments, the crop subject information may include position information, size information, name information, category information, etc. of the crop subject. The position information of the cropped subject can understand the information of the position of the position in the screen of the target video, for example, it can be the information of the coordinates of the reference point. The size information of the cropped subject may include actual size information of the cropped subject and information on the proportion of the cropped subject to the size of the target video frame. The category information of the crop subject can be understood as the category of the crop subject. For example, the category information of the crop subject includes information about whether the subject of the crop is classified as a person or an object. For example, the category information of the crop subject can also include whether the subject is a skin care product, Information about toiletries or cars. As an example only, when the cutting subject is shampoo, the name information of the cutting subject may be shampoo, and the category information of the cutting subject may be toiletries.
在一些实施例中,机器学习模型可以是生成模型、判定模型,也可以是机器学习中的深度学习模型,例如,可以是采用yolo系列算法、FasterRCNN算法或EfficientDet算法等的算法的深度学习模型。机器学习模型可以在每个视频帧的画面中检测到已设定的需关注的对象。需要关注的物体可以包括生物(人、动物等),商品(汽车、装饰物、化妆品等)、背景(山、路、桥、房子等)等。进一步地,对于目标视频而言,需关注的对象可以进一步进行设定,例如可以设定为人脸或商品等。可以将多个镜头片段输入该机器学习模型中,机器学习模型能够输出各个镜头片段中的裁剪主体的名称信息和位置信息等数据。In some embodiments, the machine learning model may be a generative model, a decision model, or a deep learning model in machine learning. For example, it may be a deep learning model using algorithms such as the yolo series algorithm, the FasterRCNN algorithm, or the EfficientDet algorithm. The machine learning model can detect the set objects that need attention in each video frame. Objects that need attention can include creatures (humans, animals, etc.), commodities (cars, decorations, cosmetics, etc.), backgrounds (mountains, roads, bridges, houses, etc.), and so on. Further, for the target video, the object to be paid attention to can be further set, for example, it can be set as a human face or a product. Multiple shot fragments can be input into the machine learning model, and the machine learning model can output data such as name information and position information of the cropped subject in each shot fragment.
在一些实施例中,可以基于大量带有标识的训练样本训练该机器学习模型,具体的,将带有标识的训练样本输入机器学习模型,通过常用的方法(例如梯度下降法)进行训练,以更新机器学习模型的相关参数。在一些实施例中,训练样本可以是镜头片段和镜头片段中的所包括的裁剪主体。训练样本的获取方式可以调用存储器及数据库中的数据。在一些实施例中,样本的标识可以是镜头片段中的对象是否是裁剪主体。如果是,标记为“1”,否则标记为“0”。在一些实施例中,标识的获取方式可以是人工标记,还可以是机器自动标记或其他方式,本实施例对此不做限制。In some embodiments, the machine learning model can be trained based on a large number of training samples with logos. Specifically, the training samples with logos are input into the machine learning model, and the training is performed by a common method (for example, the gradient descent method). Update the relevant parameters of the machine learning model. In some embodiments, the training samples may be shot fragments and cropped subjects included in the shot fragments. The way to obtain the training samples can call the data in the memory and the database. In some embodiments, the identification of the sample may be whether the object in the shot segment is the subject of cropping. If yes, mark it as "1", otherwise mark it as "0". In some embodiments, the identification method may be manual marking, automatic marking by machine or other methods, which is not limited in this embodiment.
在一些实施例中,还可以获取目标视频的裁剪主题信息。在一些 实施例中,裁剪主体信息可以是前述目标视频的特定主题。在一些替代性实施例中,目标视频的裁剪主题信息可以是待处理目标视频的标题或简介中的关键词信息,也可以是目标视频的标签信息,还可以是用户自定义的信息或者是数据库中已存储的信息。In some embodiments, the cropping theme information of the target video can also be obtained. In some embodiments, the cropped subject information may be a specific subject of the aforementioned target video. In some alternative embodiments, the cropping theme information of the target video can be the keyword information in the title or introduction of the target video to be processed, it can also be the tag information of the target video, it can also be user-defined information or a database. Information that has been stored in.
步骤1440,根据视频尺寸条件预设的画面尺寸以及裁剪主体信息,对目标视频所包含的各个视频片段的画面进行裁剪。Step 1440: According to the preset picture size and the cropping subject information according to the video size condition, the picture of each video segment included in the target video is cropped.
在一些实施例中,步骤1440可以包括根据预设画面尺寸以及裁剪主体信息,对镜头片段的画面进行裁剪。In some embodiments, step 1440 may include cropping the frame of the shot fragment according to the preset frame size and cropping subject information.
视频尺寸条件预设的画面尺寸可以理解为对目标视频的画面进行裁剪的目标尺寸。预设画面尺寸可以包括画面的目标宽高比,还可以包括画面的目标宽度和/或目标高度。在一些实施例中,在进行画面裁剪的过程中,根据预设画面尺寸设定每个视频帧的裁剪框的宽高比和具体大小,基于每个视频帧的裁剪框,将每个视频帧的裁剪框外的画面裁剪掉而保留裁剪框内的画面。用户可以根据目标视频的目标播放终端的显示播放尺寸来将预设画面尺寸手动输入到系统内,系统也可以自动获取目标视频的目标播放终端显示播放的最佳尺寸,该最佳尺寸的数据可以存储在待播放目标视频的设备中。The picture size preset by the video size condition can be understood as the target size for cropping the picture of the target video. The preset screen size may include the target aspect ratio of the screen, and may also include the target width and/or target height of the screen. In some embodiments, in the process of screen cropping, the aspect ratio and specific size of the cropping frame of each video frame are set according to the preset screen size, and each video frame is set based on the cropping frame of each video frame. The picture outside the cropping frame is cropped and the picture inside the cropping frame is retained. The user can manually input the preset screen size into the system according to the display and playback size of the target playback terminal of the target video. The system can also automatically obtain the optimal size of the target playback terminal for the target video. The optimal size of the data can be Stored in the device where the target video is to be played.
裁剪框可以理解为根据画面裁剪的目标画面尺寸而确定的裁剪边界。裁剪框可以为矩形、平行四边形、圆形等。The cropping frame can be understood as the cropping boundary determined according to the target screen size of the screen cropping. The cropping frame can be rectangular, parallelogram, circular, etc.
在一些实施例中,为了防止每个视频片段中的画面出现大幅抖动,步骤1440可以进一步包括以下步骤:根据裁剪主体信息和预设画面尺寸,确定视频片段中若干视频帧的裁剪框的尺寸和初始位置;对若干个视频帧的裁剪框的初始位置进行处理,确定若干个视频帧的裁剪框对应的最终位置;根据裁剪框的最终位置,将视频片段的各个视频帧的画面进行裁剪,以保留裁剪框内的画面。在本实施例中,根据视频片段中所包含的若干个 视频帧的裁剪框的初始位置,确定若干个视频帧的裁剪框的最终位置,在保证裁剪主体位于裁剪框内的同时,还可以减小相邻的视频帧的裁剪框之间的位置差异,能够避免相邻的视频帧的裁剪框之间的位置差异过大而导致视频片段中的画面出现突然跳变或抖动。裁剪框的初始位置可以理解为基于裁剪主体信息和预设画面尺寸初步确定出的裁剪框的位置,裁剪框的最终位置可以理解为在对初始位置的信息进行处理后确定出的新的位置。在一些实施例中,初始位置的信息可以包括参考点的初始坐标信息,最终位置的信息可以包括参考点的最终坐标信息。In some embodiments, in order to prevent the picture in each video segment from being greatly shaken, step 1440 may further include the following steps: determining the size and size of the cropping frame of several video frames in the video segment according to the cropping subject information and the preset picture size. Initial position; the initial position of the cropping box of several video frames is processed, and the final position corresponding to the cropping box of several video frames is determined; according to the final position of the cropping box, the picture of each video frame of the video clip is cropped to Keep the picture inside the cropping frame. In this embodiment, according to the initial positions of the cropping boxes of several video frames contained in the video clip, the final positions of the cropping boxes of several video frames are determined. While ensuring that the subject of the crop is in the cropping box, it can also be reduced. The small positional difference between the cropping boxes of adjacent video frames can prevent the positional difference between the cropping boxes of adjacent video frames from being too large to cause sudden jumps or jitters in the pictures in the video clip. The initial position of the cropping frame can be understood as the position of the cropping frame preliminarily determined based on the cropping subject information and the preset screen size, and the final position of the cropping frame can be understood as the new position determined after processing the initial position information. In some embodiments, the initial position information may include the initial coordinate information of the reference point, and the final position information may include the final coordinate information of the reference point.
在一些实施例中,当根据裁剪主体信息和预设画面尺寸,确定视频片段中若干视频帧的裁剪框的尺寸和初始位置时,可以根据主题信息和裁剪主体信息来确定每个裁剪主体与主题信息的相关度,再根据相关度、裁剪主体信息和预设画面尺寸来确定裁剪框的初始位置和尺寸。本实施例的具体实施方法,请参照图16部分的相关阐述。In some embodiments, when determining the size and initial position of the crop box of several video frames in the video clip according to the cropping subject information and the preset picture size, each cropping subject and the subject can be determined according to the subject information and cropping subject information. The relevance of the information, and then determine the initial position and size of the cropping frame according to the relevance, cropping subject information and the preset screen size. For the specific implementation method of this embodiment, please refer to the relevant description in the part of FIG. 16.
在另一些实施例中,以裁剪框为矩形为例,当根据裁剪主体信息和预设画面尺寸,确定若干视频帧的裁剪框的尺寸和初始位置时,可以根据预设画面尺寸,确定裁剪框的宽高比,再基于裁剪主体的位置、尺寸以及裁剪框的宽高比来确定裁剪框的初始位置和尺寸,再根据预设画面尺寸对裁剪框进行等比例缩放。例如,在确定裁剪框的宽高比的过程中,如果预设画面尺寸是800×800,则将裁剪框的宽高比设置为1:1。又例如,如果预设画面尺寸是720×540,则将裁剪框的宽高比设置为4:3。在设置了裁剪框的宽高比后,可以基于宽高比设置多个等宽高比但尺寸不同的裁剪框,然后基于步骤1110中识别出的裁剪主体和裁剪框的宽高比来确定裁剪框以及裁剪框的位置和尺寸,以使得每个裁剪主体均位于裁剪框内,再根据预设画面尺寸来确定对上述裁剪框及其内的画面进行等比例缩放。具体的,可以基于同宽高比的裁剪框与视频帧的画面中裁剪主体所在的区域重叠的 面积来确定裁剪框的初始位置和尺寸。另外,对裁剪框以及裁剪框内的画面进行宽高等比例缩小或放大,以使得裁剪框的尺寸符合预设画面尺寸,从而防止裁剪后的画面中出现黑边。仅作为示例,视频帧的画面大小为1024×768,而预设画面尺寸为是960×960,可以先确定一个768×768的裁剪框,按照裁剪框裁剪后再将视频帧的画面的宽高同比例放大至960×960。In other embodiments, taking the cropping frame as a rectangle as an example, when the size and initial position of the cropping frame of several video frames are determined according to the cropping subject information and the preset picture size, the cropping frame can be determined according to the preset picture size The initial position and size of the cropping frame are determined based on the position and size of the crop subject and the aspect ratio of the cropping frame, and then the cropping frame is scaled equally according to the preset screen size. For example, in the process of determining the aspect ratio of the cropping frame, if the preset picture size is 800×800, the aspect ratio of the cropping frame is set to 1:1. For another example, if the preset picture size is 720×540, the aspect ratio of the cropping frame is set to 4:3. After setting the aspect ratio of the cropping frame, you can set multiple cropping frames with the same aspect ratio but different sizes based on the aspect ratio, and then determine the crop based on the cropping subject identified in step 1110 and the aspect ratio of the cropping frame The position and size of the frame and the cropping frame are such that each cropping subject is located in the cropping frame, and then the cropping frame and the images within the cropping frame and the images inside are determined to be scaled in an equal proportion according to the preset picture size. Specifically, the initial position and size of the cropping frame can be determined based on the overlapping area of the cropping frame with the same aspect ratio and the area where the crop subject is located in the picture of the video frame. In addition, the cropping frame and the screen within the cropping frame are reduced or enlarged in proportion to width and height, so that the size of the cropping frame meets the preset screen size, thereby preventing black borders from appearing in the cropped screen. Just as an example, the picture size of the video frame is 1024×768, and the default picture size is 960×960. You can first determine a 768×768 cropping frame, and then crop the video frame according to the cropping frame. Enlarge the same scale to 960×960.
在一些实施例中,确定视频片段中若干视频帧的裁剪框对应的最终位置具体可以包括:从视频片段所包含的所有视频帧中挑选出若干个视频帧,判断每对(每两个)间隔预设帧数的视频帧的裁剪框的参考点(如中心点)之间的距离是否小于预设距离;如果小于预设距离的裁剪框的对数大于预设对数,则可以理解为该视频片段中裁剪主体的位置是相对静止的,此时可以求出视频片段所包含的所有视频帧的裁剪框的参考点的平均位置,并基于该平均位置调节各个视频帧的裁剪框的位置;如果小于预设距离的裁剪框的对数小于预设对数,则可以理解为该视频片段中裁剪主体的位置是动态变化的,此时可以根据视频片段所包含的所有视频帧的裁剪框的参考点的位置确定出平滑的轨迹线,并基于该轨迹线调节各个视频帧的裁剪框的位置(例如,使得各个视频帧的裁剪框的参考点都位于该轨迹线上)。在一些实施例中,预设帧数可以是2帧、3帧或5帧等。在另一些实施例中,间隔预设帧数的一对视频帧也可以是相邻的一对视频帧。需要说明的是,本说明书中参考点可以是裁剪框的中心点、矩形的左上顶角点、矩形的右下角等。参考点优选为裁剪框的中心点,以在移动位置裁剪框的位置时减小裁剪框与该裁剪框中各个裁剪主体的相对位置的变化。In some embodiments, determining the final position corresponding to the crop box of several video frames in the video clip may specifically include: selecting several video frames from all the video frames contained in the video clip, and judging the interval of each pair (every two) Whether the distance between the reference points (such as the center point) of the cropping frame of the preset number of video frames is less than the preset distance; if the logarithm of the cropping frame less than the preset distance is greater than the preset logarithm, it can be understood as this The position of the crop subject in the video clip is relatively static. At this time, the average position of the reference point of the crop box of all video frames contained in the video clip can be obtained, and the position of the crop box of each video frame can be adjusted based on the average position; If the logarithm of the crop frame that is less than the preset distance is less than the preset logarithm, it can be understood that the position of the crop subject in the video clip is dynamically changing. At this time, it can be based on the size of the crop frame of all video frames contained in the video clip. The position of the reference point determines a smooth trajectory line, and the position of the crop box of each video frame is adjusted based on the trajectory line (for example, so that the reference point of the crop box of each video frame is located on the trajectory line). In some embodiments, the preset number of frames may be 2 frames, 3 frames, 5 frames, or the like. In other embodiments, a pair of video frames separated by a preset number of frames may also be a pair of adjacent video frames. It should be noted that the reference point in this specification can be the center point of the cropping frame, the top left corner point of the rectangle, the bottom right corner of the rectangle, and so on. The reference point is preferably the center point of the cropping frame, so as to reduce changes in the relative positions of the cropping frame and each crop subject in the cropping frame when the position of the cropping frame is moved.
在另一些实施例中,调节视频片段中各个视频帧的裁剪框的位置具体可以包括以下步骤:将视频片段的若干视频帧的裁剪框的参考点的初始坐标信息根据时间进行平滑处理;根据所述平滑处理的结果,确定参考点的最终坐标信息;基于最终坐标信息确定参考点的位置。在一些实施例 中,将视频片段的若干个视频帧的裁剪框的参考点的初始坐标信息根据时间进行平滑处理,可以是对参考点坐标值进行线性回归处理。线性回归处理的具体方法和更多细节,请参见图15的相关阐述。In other embodiments, adjusting the position of the cropping frame of each video frame in the video clip may specifically include the following steps: smoothing the initial coordinate information of the reference point of the cropping frame of several video frames of the video clip according to time; According to the result of the smoothing process, the final coordinate information of the reference point is determined; the position of the reference point is determined based on the final coordinate information. In some embodiments, the initial coordinate information of the reference points of the crop frames of several video frames of the video clip is smoothed according to time, which may be linear regression processing on the coordinate values of the reference points. For the specific method and more details of linear regression processing, please refer to the relevant description in Figure 15.
图15为根据本申请的一些实施例所示的平滑处理的示意图。如图15所示,在一些实施例中,对参考点的初始坐标信息进行平滑处理包括:进行线性回归处理,以获取线性回归方程及其斜率。具体的,可以对各裁剪框的参考点的初始坐标信息基于时间做线性回归,得到线性回归方程、拟合直线段(参见图15)及线性回归方程的斜率;基于该拟合直线段和斜率,来得到各裁剪框的参考点的最终坐标信息。具体的,可以将斜率的绝对值与一个斜率阈值进行比较。如果斜率的绝对值小于斜率阈值,则认为该视频片段中裁剪主体的位置是相对静止的,取该拟合直线段的中点的位置作为每个视频帧的裁剪框的参考点的最终位置;如果斜率的绝对值大于斜率阈值,则认为该视频片段中裁剪主体的位置是动态变化的,取该拟合直线段上每一时间点的位置作为对应时间点的各个视频帧的裁剪框的参考点的最终位置。斜率阈值可以设定为0.1、0.05或0.01等,本领域技术人员可以根据目标视频的实际情况来设定斜率阈值,例如对于主题为汽车的目标视频,由于镜头是运动的概率较高,可以将斜率阈值设置得高一些,如设为0.1。FIG. 15 is a schematic diagram of smoothing processing according to some embodiments of the present application. As shown in FIG. 15, in some embodiments, smoothing the initial coordinate information of the reference point includes: performing linear regression processing to obtain the linear regression equation and its slope. Specifically, linear regression can be performed on the initial coordinate information of the reference point of each clipping frame based on time to obtain the linear regression equation, the fitted straight line segment (see Figure 15) and the slope of the linear regression equation; based on the fitted straight line segment and slope , To get the final coordinate information of the reference point of each cropping frame. Specifically, the absolute value of the slope can be compared with a slope threshold. If the absolute value of the slope is less than the slope threshold, the position of the cropped subject in the video segment is considered to be relatively static, and the position of the midpoint of the fitted straight line segment is taken as the final position of the reference point of the cropping frame of each video frame; If the absolute value of the slope is greater than the slope threshold, the position of the cropped subject in the video segment is considered to be dynamically changing, and the position of each time point on the fitted straight line segment is taken as the reference for the cropping frame of each video frame at the corresponding time point The final position of the point. The slope threshold can be set to 0.1, 0.05, or 0.01, etc. Those skilled in the art can set the slope threshold according to the actual situation of the target video. The slope threshold is set higher, such as 0.1.
仅作为示例,线性回归处理一个由12个视频帧组成的视频片段。在此示例中,将横屏视频转换成竖屏视频,所以裁剪框中心的纵坐标可以固定在中心位置0.5,仅需对横坐标做平滑处理。线性回归处理的具体过程如下:对应于12个时间点1,2,3,……,12的12个视频帧,各个视频帧的裁剪框的参考点的横坐标的初始相对位置依次为:0.91,0.87,0.83,0.74,0.68,0.61,0.55,0.51,0.43,0.39,0.37,0.34。如图15所示,基于上述时间点和横坐标,得到12个数据点,坐标分别是:(1,0.91)、 (2,0.87)、(3,0.83)、(4,0.74)、(5,0.68)、(6,0.61)、(7,0.55)、(8,0.51)、(9,0.43)、(10,0.39)、(11,0.37)、(12,0.34)。Just as an example, linear regression processes a video segment consisting of 12 video frames. In this example, the horizontal screen video is converted to the vertical screen video, so the vertical coordinate of the center of the cropping frame can be fixed at the center position 0.5, and only the horizontal coordinate needs to be smoothed. The specific process of linear regression processing is as follows: 12 video frames corresponding to 12 time points 1, 2, 3, ..., 12, the initial relative position of the abscissa of the reference point of the cropping frame of each video frame is 0.91 ,0.87,0.83,0.74,0.68,0.61,0.55,0.51,0.43,0.39,0.37,0.34. As shown in Figure 15, based on the above time points and abscissas, 12 data points are obtained. The coordinates are: (1,0.91), (2,0.87), (3,0.83), (4,0.74), (5 ,0.68), (6,0.61), (7,0.55), (8,0.51), (9,0.43), (10,0.39), (11,0.37), (12,0.34).
基于上述12个数据点做线性拟合,得到近似线性方程为x=-0.06t+0.91,斜率约为-0.06;该斜率绝对值大于0.01,认为镜头追踪运动;将t=1,2,……,12分别代入该近似线性方程,得到各视频帧内裁剪框的最终位置的横坐标:0.91,0.85,0.79,0.73,0.67,0.61,0.55,0.49,0.43,0.37,0.31,0.25。Based on the above 12 data points for linear fitting, the approximate linear equation is obtained as x=-0.06t+0.91, and the slope is about -0.06; the absolute value of the slope is greater than 0.01, it is considered that the lens is tracking motion; t=1, 2,... …, 12 are respectively substituted into the approximate linear equation to obtain the abscissa of the final position of the cropping frame in each video frame: 0.91, 0.85, 0.79, 0.73, 0.67, 0.61, 0.55, 0.49, 0.43, 0.37, 0.31, 0.25.
在另一些实施例中,将每个视频片段所包含的多个视频帧的裁剪框的参考点的初始坐标信息根据时间进行平滑处理,可以是对坐标值进行多项式回归处理。具体的,可以对各裁剪框的参考点的坐标值基于时间做多项式回归,得到一条拟合曲线。然后,可以取该拟合曲线上每一时间点的位置作为对应时间点的各个视频帧的裁剪框的参考点的最终位置。In other embodiments, the initial coordinate information of the reference points of the crop boxes of the multiple video frames included in each video segment is smoothed according to time, which may be a polynomial regression process on the coordinate values. Specifically, a polynomial regression can be performed on the coordinate value of the reference point of each cropping frame based on time to obtain a fitting curve. Then, the position of each time point on the fitting curve can be taken as the final position of the reference point of the crop frame of each video frame at the corresponding time point.
图16为根据本申请的一些实施例所示的确定各个视频帧的裁剪框的尺寸和位置的流程图。在一些实施例中,流程1600中的一个或以上步骤可以作为指令的形式存储在存储设备(例如,数据库140)中,并且被处理设备(例如,服务器110中的处理设备112)调用和/或执行。FIG. 16 is a flowchart of determining the size and position of the crop box of each video frame according to some embodiments of the present application. In some embodiments, one or more steps in the process 1600 may be stored as instructions in a storage device (for example, the database 140), and called by a processing device (for example, the processing device 112 in the server 110) and/or implement.
步骤1610,根据目标视频的主题信息和裁剪主体信息,确定裁剪主体信息中的一个或多个裁剪主体与主题信息之间的相关度。Step 1610: Determine the correlation between one or more cropped subjects in the cropped subject information and the subject information according to the subject information and the cropped subject information of the target video.
裁剪主题信息的可以包括目标视频的特定主题。在一些实施例中,裁剪主体与主题信息的相关度可以用来表示二者的关联程度;关联程度越高,则相关度越高。仅作为示例,“方向盘套”与“汽车内饰”的相关度大于“汽车车门”与“汽车内饰”的相关度;“汽车车门”与“汽车内饰”的相关度大于“护手霜”与“汽车内饰”的相关度。The subject information can be tailored to include a specific subject of the target video. In some embodiments, the degree of relevance between the tailored subject and the topic information can be used to indicate the degree of relevance between the two; the higher the degree of relevance, the higher the degree of relevance. Just as an example, the correlation between "steering wheel cover" and "car interior" is greater than the correlation between "car door" and "car interior"; the correlation between "car door" and "car interior" is greater than that of "hand cream" "Relevance to "car interior".
在一些实施例中,裁剪主体与主题信息的相关度可以基于二者的解释文本得到。解释文本可以是裁剪主体或主题信息的文字描述,例如, “汽车内饰”的解释文本是“汽车内饰主要是指汽车内部改装所用到的汽车产品,涉及到汽车内部的方方面面,比如汽车方向盘套、汽车坐垫,汽车脚垫、汽车香水、汽车挂件、内部摆件、收纳箱等等都是汽车内饰产品”。又例如,“方向盘套”的解释文本是“方向盘套指套在方向盘上的套子。方向盘套带有很强的装饰性”。裁剪主体与主题信息的解释文本可以是预先存储在系统中,也可以是基于裁剪主体的名称和主题信息实时从网络上获取。In some embodiments, the correlation between the cropped subject and the topic information can be obtained based on the interpretation text of the two. The explanatory text can be a text description of the tailored subject or subject information. For example, the explanatory text of "car interior" is "car interior mainly refers to the car products used in the interior modification of the car, involving all aspects of the car interior, such as the car steering wheel. Covers, car seat cushions, car floor mats, car perfumes, car pendants, interior decorations, storage boxes, etc. are all car interior products." For another example, the explanatory text of "steering wheel cover" is "steering wheel finger cover on the steering wheel. The steering wheel cover is very decorative." The interpretation text of the tailored subject and the subject information can be pre-stored in the system, or it can be obtained from the Internet in real time based on the tailored subject’s name and subject information.
在一些实施例中,可以基于wor18vec等文本嵌入模型,获得解释文本的表示向量,可以基于所述表示向量的距离,得到裁剪主体与主题信息的相关度。表示向量的距离越小,所述相关度越高。例如,通过计算裁剪主体“方向盘套”和主题信息“汽车内饰”解释文本的向量距离,可以得到二者的相关度。In some embodiments, the representation vector of the explanatory text can be obtained based on a text embedding model such as wor18vec, and the correlation between the cropped subject and the topic information can be obtained based on the distance of the representation vector. The smaller the distance representing the vector, the higher the correlation. For example, by calculating the vector distance of the explanatory text of the subject "steering wheel cover" and the subject information "car interior", the correlation between the two can be obtained.
在一些实施例中,裁剪主体也可以通过候选裁剪主体的方法确认,候选裁剪主体与前述候选主体类似,可以由用户输入,也可以通过机器学习的方法从各个视频片段中确定,通过候选裁剪主体确定裁剪主体的方法可以参考前述通过中前述过程中基于候选主体确定特定主体的方法,即,使用机器学习模型获取每个所述视频片段中的候选裁剪主体,然后根据所述目标视频的主题信息,从所述候选裁剪主体中挑选出所述一个或多个特定裁剪主体。In some embodiments, the crop subject can also be confirmed by the candidate crop subject method. The candidate crop subject is similar to the aforementioned candidate subject and can be input by the user or determined from each video clip through the method of machine learning. The method of determining the subject of cropping can refer to the method of determining a specific subject based on the candidate subject in the foregoing process, that is, using a machine learning model to obtain the candidate cropped subject in each of the video clips, and then according to the subject information of the target video , Selecting the one or more specific cropped subjects from the candidate cropped subjects.
步骤1620,根据预设画面尺寸和特定裁剪主体,确定与至少一个视频帧中对应的多个备选裁剪框。Step 1620: Determine multiple candidate cropping frames corresponding to at least one video frame according to the preset picture size and the specific cropping subject.
在一些实施例中,在每个视频帧中,可以根据预设画面尺寸和特定裁剪主体设定至少一个备选裁剪框。在不包含任何裁剪主体的视频帧内,可以只设定一个备选裁剪框,该备选裁剪框默认居中。在包含至少一个裁剪主体的视频帧内,可以设定多个备选裁剪框,多个备选裁剪框的参考点的位置和/或尺寸不同,多个备选裁剪框的宽高比相同。In some embodiments, in each video frame, at least one candidate cropping frame can be set according to a preset picture size and a specific cropping subject. In a video frame that does not contain any subject to be cropped, only one candidate cropping frame can be set, and the candidate cropping frame is centered by default. In a video frame containing at least one crop subject, multiple candidate cropping frames can be set, and the positions and/or sizes of the reference points of the multiple candidate cropping frames are different, and the aspect ratios of the multiple candidate cropping frames are the same.
步骤1630,根据裁剪主体信息和相关度,为至少一个备选裁剪框打分。Step 1630: Score at least one candidate cropping frame according to the cropping subject information and the correlation degree.
在一些实施例中,可以基于备选裁剪框内每个裁剪主体与目标视频的主题的相关度,先为每个裁剪主体打分,确定每个裁剪主体的得分,再计算备选裁剪框的得分。具体的,可以以裁剪主体与视频主题的相关度为权重,乘以对应裁剪主体的得分,再进行求和后得到各个备选裁剪框的得分。在一些实施例中,每个裁剪主体的得分可以是该裁剪主体所占面积与视频帧的总面积的比值。仅作为示例,某视频的主题是“洗护用品”,视频中某一帧的一个备选裁剪框中,包含完整的裁剪主体:洗发水1,洗发水2,人脸1。洗发水1,洗发水2,人脸1与“洗护用品”的关联度分别为0.86、0.86,0.45,洗发水1,洗发水2,人脸1的裁剪主体得分分别为0.35,0.1,0.1。则该备选裁剪框的得分为0.86×0.35+0.86×0.1+0.45×0.1=0.432。In some embodiments, based on the correlation between each crop subject in the candidate crop frame and the subject of the target video, score each crop subject first, determine the score of each crop subject, and then calculate the score of the candidate crop frame . Specifically, the correlation between the cropped subject and the video theme can be used as the weight, multiplied by the score of the corresponding cropped subject, and then summed to obtain the score of each candidate cropping frame. In some embodiments, the score of each cropped subject may be the ratio of the area occupied by the cropped subject to the total area of the video frame. Just as an example, the subject of a video is “toilet care products”. An optional cropping box in a certain frame of the video contains the complete cropped subject: shampoo 1, shampoo 2, and face 1. The correlations between shampoo 1, shampoo 2, and face 1 and "toilet products" are 0.86, 0.86, and 0.45, respectively. The cut subject scores of shampoo 1, shampoo 2, and face 1 are 0.35, 0.1, and 0.1, respectively. . Then the score of the candidate cropping frame is 0.86×0.35+0.86×0.1+0.45×0.1=0.432.
步骤1640,基于备选裁剪框的打分结果,确定至少一个视频帧的裁剪框的尺寸和位置。Step 1640: Determine the size and position of the crop frame of at least one video frame based on the scoring result of the candidate crop frame.
在一些实施例中,基于备选裁剪框的得分,确定视频帧的裁剪框的尺寸和位置的方法可以是视频片段中得分最高的备选裁剪框的参考点的位置为准,将该备选裁剪框的参考点的位置作为视频片段中所有视频帧的裁剪框的参考点最终位置,将该备选裁剪框的尺寸作为视频片段中所有视频帧的裁剪框的尺寸。在另一些实施例中,确定视频帧的裁剪框的尺寸和位置的方法也可以是挑选出每个视频帧的得分排名前Y位的备选裁剪框,计算出该Y个备选裁剪框的参考点的平均位置,将该平均位置作为该视频帧的裁剪框的位置,并将得分最高的备选裁剪框的尺寸作为该视频帧的裁剪框的尺寸。Y的取值可以选择为3、4、5或8等,本领域技术人员可以根据每个视频帧中备选裁剪框的数目来确定Y的值。In some embodiments, based on the score of the candidate cropping frame, the method for determining the size and position of the cropping frame of the video frame may be that the position of the reference point of the candidate cropping frame with the highest score in the video segment shall prevail. The position of the reference point of the cropping frame is taken as the final position of the reference point of the cropping frame of all video frames in the video clip, and the size of the candidate cropping frame is taken as the size of the cropping frame of all video frames in the video clip. In other embodiments, the method for determining the size and position of the crop frame of the video frame may also be to select the candidate crop frame with the top Y rank in the score of each video frame, and calculate the value of the Y candidate crop frames. The average position of the reference point is taken as the position of the crop frame of the video frame, and the size of the candidate crop frame with the highest score is taken as the size of the crop frame of the video frame. The value of Y can be selected as 3, 4, 5, or 8, etc., and those skilled in the art can determine the value of Y according to the number of candidate cropping frames in each video frame.
在图16所示的实施例中,基于裁剪主体信息及目标视频的主题信息与裁剪主体的相关度,确定裁剪框的尺寸和位置,裁剪主体在裁剪后的画面中能够被保留,裁剪后的目标视频损失尽可能更少的主要信息(与目标视频的主题信息相关的信息)。In the embodiment shown in Figure 16, the size and position of the crop frame are determined based on the subject information of the cropped subject and the correlation between the subject information of the target video and the subject of cropping. The target video loses as little main information as possible (information related to the subject information of the target video).
在一些实施例中,用于目标视频画面裁剪的方法1400还可以包括:步骤1450,将裁剪后的各个镜头片段按照预定顺序拼接为新的目标视频。预定顺序可以是视频配置信息的序列信息确定的顺序,也可以是用户自行设置的新的拼接顺序。In some embodiments, the method 1400 for cropping a target video frame may further include: step 1450: splicing the cropped shot fragments into a new target video in a predetermined order. The predetermined sequence may be the sequence determined by the sequence information of the video configuration information, or may be a new splicing sequence set by the user.
在一些实施例中,上述对目标视频的裁剪也可以应用到本申请的其他涉及裁剪的部分,例如,对视频片段的归一化处理中可以将各个初始视频和/或初始图像裁剪为相同的尺寸。In some embodiments, the above-mentioned cropping of the target video can also be applied to other crop-related parts of the present application. For example, in the normalization process of the video segment, each initial video and/or initial image can be cropped to be the same size.
在一些实施例中,本申请可以生成多个目标视频进行投放,并根据不同视频的反馈结果优化本申请的视频生成算法。In some embodiments, this application may generate multiple target videos for delivery, and optimize the video generation algorithm of this application according to the feedback results of different videos.
在一些实施例中,上述多个目标视频可以基于不同的受众生成,对应的,将目标视频向特定的受众人群投放,其中,受众是指目标视频的投放群体。具体地,所述特定的受众人群可以是特定年龄、性别、具有特定行为特征等的人群。所述特定年龄可以是指年轻化(例如,投放平台中,年龄处在15-40岁的用户比例为80%,则认为用户特征为年轻化),中年化,老年化等。所述性别可以用男女比例表征(例如男女比例1:3)。所述行为特征可以包括浏览习惯,购物偏好等。例如,投放平台的用户更喜欢浏览哪类视频。投放的时长可以较短,例如,小于1周,较长,例如,一周以上等。投放时期可以是高峰期(618促销期间,双十一期间),空闲期(非促销期)等。平台投放位置可以是,例如,是否首页推荐。投放的平台的特征可以包括平台的类型(线上平台(APP),线下平台(机场、地铁等),APP 的种类(视频播放,音乐播放,学习类等)),平台的流量特征(如大流量)等。In some embodiments, the aforementioned multiple target videos may be generated based on different audiences, and correspondingly, the target videos are delivered to a specific audience group, where the audience refers to the target video delivery group. Specifically, the specific audience group may be a group of people of a specific age, gender, and behavior characteristics. The specific age may refer to younger age (for example, if the proportion of users who are 15-40 years old in the delivery platform is 80%, the user is considered to be younger), middle-aged, aging, and so on. The gender can be characterized by a male to female ratio (for example, a male to female ratio of 1:3). The behavior characteristics may include browsing habits, shopping preferences, and so on. For example, what kind of videos the users of the platform prefer to browse. The delivery time may be shorter, for example, less than 1 week, longer, for example, more than one week, etc. The launch period can be peak period (618 promotion period, double eleven period), idle period (non-promotion period), etc. The platform placement position can be, for example, whether it is a homepage recommendation. The characteristics of the released platform can include the type of platform (online platform (APP), offline platform (airport, subway, etc.), APP type (video playback, music playback, learning, etc.)), platform traffic characteristics (such as Large flow) and so on.
图17是根据本申请的一些实施例所示的目标视频受众人群,生成目标视频的流程图。在一些实施例中,流程1700中的一个或以上步骤可以作为指令的形式存储在存储设备(例如,数据库140)中,并且被处理设备(例如,服务器110中的处理设备112)调用和/或执行。Fig. 17 is a flowchart of generating a target video according to a target video audience group shown in some embodiments of the present application. In some embodiments, one or more steps in the process 1700 may be stored in a storage device (for example, the database 140) in the form of instructions, and called by a processing device (for example, the processing device 112 in the server 110) and/or implement.
步骤1710,获取多个视频片段的受众接受度。Step 1710: Obtain the audience acceptance of multiple video clips.
在一些实施例中,视频片段可能与视频受众有明显的相关性,例如,包含奥特曼的视频片段可能受儿童受众的欢迎,而中年人受众可能不感兴趣。受众接受度是描述视频片段与视频受众的相关性的指标。视频片段的受众接受度可以通过将视频片段或视频片段的要素(例如,各个标签ID及标签值)输入训练好的机器学习模型实现。其中,机器学习模型可以根据各个视频片段在不同受众下的投放效果确定,投放效果的相关描述可以参考后续步骤1740。In some embodiments, the video clip may have obvious relevance to the video audience. For example, a video clip containing Ultraman may be popular with a child audience, while a middle-aged audience may not be interested. Audience acceptance is an indicator that describes the relevance of video clips and video audiences. The audience acceptance of the video clip can be realized by inputting the video clip or the elements of the video clip (for example, each tag ID and tag value) into a trained machine learning model. Among them, the machine learning model can be determined according to the delivery effect of each video clip under different audiences, and the relevant description of the delivery effect can refer to the subsequent step 1740.
步骤1720,对特定的受众群体,根据对应的受众特征条件从多个视频片段中确定受众接受度高于阈值的候选片段,用以生成目标视频。Step 1720: For a specific audience group, determine candidate fragments whose audience acceptance is higher than the threshold from a plurality of video fragments according to corresponding demographic conditions, so as to generate a target video.
在一些实施例中,受众接受度可以具体表现为视频片段的特定标签及其对应的标签值,例如,女性受众的受众接受度的标签ID可以为#61,男性受众的受众接受度的标签ID可以为#62,则对应的标签值为具体的受众接收度。In some embodiments, audience acceptance may be specifically expressed as a specific tag of the video clip and its corresponding tag value. For example, the tag ID of the audience acceptance of female audiences may be #61, the tag ID of the audience acceptance of male audiences It can be #62, and the corresponding tag value is the specific audience acceptance.
在一些实施例中,受众接受度可以是定量描述的,例如,受众接受度对应的标签值可以是[-1,1]内的任意实数,其中,正值表示喜欢,负值表示厌恶,0表示没有兴趣,数值的绝对值越大其程度越高。对应的,步骤1720中阈值可以是标签值范围内的一个表示喜欢的值,例如0.5。In some embodiments, the audience acceptance can be quantitatively described. For example, the tag value corresponding to the audience acceptance can be any real number within [-1,1], where a positive value means like, a negative value means dislike, and 0 Indicates no interest, the greater the absolute value of the value, the higher the degree. Correspondingly, the threshold in step 1720 may be a value in the range of tag values that represents a preference, such as 0.5.
在一些实施例中,受众接受度可以是定性描述的,例如,受众接受度对应的标签值可以是-3,-2,-1,0,1,2,3,在实际中,可以通过四位二进制标签表达,其中标签值的前两位表示厌恶,标签值的后两位表示喜欢,标签值为0表示没有兴趣。对应的,所述阈值可以是表示喜欢的值,例如,标签值的前两位为00,标签值的后两位大于01。In some embodiments, the audience acceptance can be qualitatively described. For example, the tag value corresponding to the audience acceptance can be -3, -2, -1,0,1,2,3. In practice, it can pass four Bit binary tag expression, where the first two digits of the tag value indicate dislike, the last two digits of the tag value indicate like, and the tag value of 0 indicates no interest. Correspondingly, the threshold may be a value indicating a preference, for example, the first two digits of the tag value are 00, and the last two digits of the tag value are greater than 01.
在一些实施例中,流程800中所述第一预设条件可以包括步骤1720中的受众接受度高于阈值,对应的,步骤1720的实现可以参考流程800中步骤820的相关描述。In some embodiments, the first preset condition in the process 800 may include that the audience acceptance in step 1720 is higher than the threshold. Correspondingly, the implementation of step 1720 can refer to the related description of step 820 in the process 800.
步骤1730,获取目标视频的投放效果反馈,并根据投放效果反馈调整受众特征条件或受众接受度中的至少一个。Step 1730: Obtain the delivery effect feedback of the target video, and adjust at least one of demographic conditions or audience acceptance according to the delivery effect feedback.
在一些实施例中,投放效果可以通过目标视频的点击率、完播率、重播次数、观看人数等相关指标中的至少一个确定。例如,目标视频的完播率较高可以认为受众喜欢该视频,重播次数较高可以说明喜欢程度较高。In some embodiments, the delivery effect may be determined by at least one of related indicators such as the click rate, the completion rate, the number of replays, and the number of viewers of the target video. For example, a higher completion rate of a target video can be considered that the audience likes the video, and a higher number of replays can indicate a higher degree of likeness.
在一些实施例中,可以确定目标视频各个视频片段的投放效果,例如,可以根据各个视频片段的切换量(指用户在该视频片段停止本视频的播放切换到下一个视频)与播放量的比例确定用户喜欢,该比例越高用户越喜欢该视频片段。再例如可以根据完播用户或重播用户的跳过部分确定视频片段的投放效果,其中,被跳过的视频片段播放效果较差。In some embodiments, the delivery effect of each video segment of the target video can be determined. For example, it can be based on the ratio of the switching amount of each video segment (referring to the user stopping the playback of this video and switching to the next video in the video segment) and the playback volume It is determined that the user likes it, the higher the ratio, the more the user likes the video clip. For another example, the delivery effect of the video clip may be determined according to the skipped part of the user after the broadcast or the user of the rebroadcast, where the playback effect of the skipped video clip is poor.
在一些实施例中,在获取投放效果后,可以将投放效果作为反馈输入步骤1710中的机器学习模型,实现对受众特征条件或受众接受度的重新确定。在一些实施例中,可以根据各个视频片段的投放效果直接修改对应视频片段的对应标签值。In some embodiments, after the delivery effect is obtained, the delivery effect may be input to the machine learning model in step 1710 as feedback to realize the re-determination of demographic characteristics or audience acceptance. In some embodiments, the corresponding tag value of the corresponding video segment can be directly modified according to the delivery effect of each video segment.
在一些实施例中,可以根据前述确定的投放效果以及机器学习模型预估目标视频的投放效果,并将投放效果高于预设值的目标视频进行投 放。具体的,当目标视频的类型为创意广告时,可以基于创意广告的至少一个广告元素的元素效果参数,确定目标视频的预估效果数据。In some embodiments, the delivery effect of the target video can be estimated based on the aforementioned determined delivery effect and the machine learning model, and the target video with the delivery effect higher than a preset value can be delivered. Specifically, when the type of the target video is a creative advertisement, the estimated effect data of the target video may be determined based on the element effect parameter of at least one advertisement element of the creative advertisement.
广告元素可以理解为广告创意的组成单元,具体可以包括镜头主要元素(前述特定对象、主体、裁剪主体等)、修饰性元素(背景图片、模特、文案、商标、产品图片和/或促销标识等)以及元素呈现方式(动画动作、AE模板等)。Advertising elements can be understood as the component units of advertising creativity, which can specifically include the main elements of the lens (the aforementioned specific objects, subjects, cropped subjects, etc.), and decorative elements (background pictures, models, copywriting, trademarks, product pictures, and/or promotional logos, etc.) ) And element presentation methods (animation actions, AE templates, etc.).
在一些实施例中,除前述步骤1730中的投放效果外,广告创意的投放效果相关数据还可以包括点击通过率、曝光率、转化率、投资回报率等。点击通过率可以理解为网络广告创意的点击到达率,即该广告的实际点击次数。曝光率可以理解为单位时间内广告创意展示的次数。例如某网络媒体浏览量是3000人每天,如果广告独占某一广告位,那么一天广告曝光量为3000,如果该广告位轮流展示3个广告,那么一天该广告的曝光量为3000/3。转化率可以理解为广告被点击后向进一步效果转化的次数(如购买、支付订单)与广告点击次数的比值。投资回报率可以理解为广告投放的回报和投入成本的比例。已投放广告创意的投放效果数据可以包括预设时间段的效果数据,例如一天、一周、一个月或一个季度等。在一些实施例中,已投放广告创意的投放效果数据还可以包括广告创意随时间、季节、以及平台受众特点变化的投放效果变化趋势。In some embodiments, in addition to the delivery effect in the foregoing step 1730, the data related to the delivery effect of the advertisement creative may also include click-through rate, exposure rate, conversion rate, return on investment, and the like. Click-through rate can be understood as the click-through rate of online advertising creatives, that is, the actual number of clicks on the advertisement. Exposure rate can be understood as the number of times an ad creative is displayed per unit time. For example, the number of views of a certain network media is 3000 people per day. If an advertisement exclusively occupies a certain advertising space, then the daily advertising exposure will be 3000. If the advertising space displays 3 advertisements in turn, then the daily advertising exposure will be 3000/3. Conversion rate can be understood as the ratio of the number of times an advertisement is clicked and converted to further effects (such as purchase and payment orders) to the number of advertisement clicks. The return on investment can be understood as the ratio of the return on advertising to the cost of input. The delivery effect data of the placed advertising creative may include the effect data of a preset time period, such as one day, one week, one month, or one quarter. In some embodiments, the placement effect data of the advertisement creative that has been placed may also include the placement effect change trend of the advertisement creative over time, season, and platform audience characteristics.
在一些实施例中,图18是根据本申请的一些实施例所示的目标视频的预估效果数据的确认的流程图。在一些实施例中,流程1800中的一个或以上步骤可以作为指令的形式存储在存储设备(例如,数据库140)中,并且被处理设备(例如,服务器110中的处理设备112)调用和/或执行。In some embodiments, FIG. 18 is a flowchart of confirming the estimated effect data of the target video according to some embodiments of the present application. In some embodiments, one or more steps in the process 1800 may be stored as instructions in a storage device (for example, the database 140), and called by a processing device (for example, the processing device 112 in the server 110) and/or implement.
步骤1810,获取广告元素效果预估模型。Step 1810: Obtain an advertisement element effect prediction model.
在一些实施例中,广告元素效果预估模型可以是能够为各个广告元素进行打分的模型。广告元素效果预估模型可以为训练后的机器学习模 型,通过训练后的模型,将包含广告元素的若干个已投放广告创意以及包含广告元素的若干个已投放广告创意的投放效果数据作为特征输入广告元素效果预估模型,可以得到广告元素的得分。In some embodiments, the advertising element effect prediction model may be a model that can score each advertising element. The advertising element effect prediction model can be a trained machine learning model. Through the trained model, the delivery effect data of several advertising creatives containing advertising elements and several advertising creatives containing advertising elements are used as feature input The advertising element effect prediction model can get the score of the advertising element.
在一些实施例中,广告元素效果预估模型可以根据前述特定的受众群体与目标视频的特定主题确定,例如,特定的受众群体可以为女性,目标视频的特定主题可以为漱口水广告则对应的广告元素效果预估模型女性向的清洁类和/或洗护类日用品产品的广告元素效果预估模型。In some embodiments, the advertising element effect estimation model can be determined based on the aforementioned specific audience group and the specific theme of the target video. For example, the specific audience group may be women, and the specific theme of the target video may correspond to the mouthwash advertisement. Advertising element effect prediction model The advertising element effect prediction model of cleaning and/or washing and nursing daily necessities products for women.
步骤1820,将标记有至少一个元素标签的广告元素输入广告元素效果预估模型,确定广告元素的元素效果参数。Step 1820: Input the advertisement element marked with at least one element tag into the advertisement element effect prediction model, and determine the element effect parameter of the advertisement element.
至少一个元素标签包括广告元素与创意广告的关系,其中,广告元素的元素效果参数可以指广告元素在一定时间内的贡献量。The at least one element tag includes the relationship between the advertisement element and the creative advertisement, where the element effect parameter of the advertisement element may refer to the contribution amount of the advertisement element in a certain period of time.
在一些实施例中,元素标签可以相当于前述步骤230中各个要素,其中广告元素与创意广告可以是视频片段中特定对象、主体或裁剪主体与目标视频的特定主题的关系,可以由相关性表示。在一些实施例中,广告元素与创意广告的关系还可以包括广告元素与特定场合(如地铁站、火车站、市中心巨屏)、特定时间(例如双十一、情人节)的关系。In some embodiments, the element tags may be equivalent to the various elements in the foregoing step 230, where the advertisement element and the creative advertisement may be the relationship between a specific object, subject, or cropped subject in the video clip and a specific subject of the target video, which may be represented by correlation . In some embodiments, the relationship between the advertisement element and the creative advertisement may also include the relationship between the advertisement element and a specific occasion (such as a subway station, a railway station, a giant screen in a city center), and a specific time (such as Double Eleven, Valentine's Day).
在一些实施例中,步骤1820可以先确定包含广告元素的若干个已投放广告创意。基于已投放广告的投放效果数据,确定包含广告元素的若干个已投放广告创意的投放效果数据;服务器将包含广告元素的若干个已投放广告创意的投放效果数据的平均值、中位数、累加和或加权累加和,作为广告元素的元素效果参数。例如,包含广告元素a的已投放广告创意为M1、M2、M3和M4。M1、M2、M3和M4的平均点击率分别是1000次、2000次、500次和3500次。将已投放广告创意为M1、M2、M3和M4的平均点击率的累加和作为广告元素a的元素效果参数即为7000次的平均点击率。广告元素的元素效果参数可以包括通过包含广告元素的已投放广 告创意的投放效果数据进行数值统计计算得到的数据,可以直观地体现了广告元素在一段时期内的贡献度。In some embodiments, step 1820 may first determine a number of published advertisement creatives containing advertisement elements. Based on the placement effect data of the advertisements that have been placed, determine the placement effect data of several placed advertising creatives that contain advertising elements; the server will include the average, median, and accumulation of the placement effect data of several placed advertising creatives that contain advertising elements The sum or the weighted cumulative sum is used as the element effect parameter of the advertisement element. For example, the published advertisement creatives containing the advertisement element a are M1, M2, M3, and M4. The average click-through rates of M1, M2, M3, and M4 are 1,000, 2,000, 500, and 3500, respectively. Taking the advertisement creative as the cumulative sum of the average click-through rates of M1, M2, M3, and M4 as the element effect parameter of the advertisement element a, it is the average click-through rate of 7000 times. The element effect parameter of the advertisement element may include data obtained by numerical statistical calculation through the placement effect data of the advertisement creative that contains the advertisement element, which can intuitively reflect the contribution of the advertisement element over a period of time.
在一些实施例中,可以通过投放实验来确定各个广告元素的元素效果参数。在进行投放实验的过程中,由于有些广告元素是会被重复利用的,可以通过正交实验算法来通过计算包含最多广告元素的最少广告创意集合,从而可以通过投放最少的创意获取最多的广告元素数据。进一步地,还可以先对广告元素进行分类,指定某些类别的广告元素(如模特或文案等)所包含的所有广告元素都需要进行投放为前提,使用正交实验算法计算出包含指定类型的所有广告元素的最少广告创意集合。In some embodiments, the element effect parameter of each advertisement element may be determined through a placement experiment. In the process of placing experiments, since some advertising elements will be reused, orthogonal experimental algorithms can be used to calculate the smallest set of advertising creatives that contain the most advertising elements, so that the most advertising elements can be obtained by placing the least creatives data. Further, you can also classify advertising elements first, and specify that all advertising elements contained in certain categories of advertising elements (such as models or copywriting, etc.) need to be delivered, and use the orthogonal experiment algorithm to calculate the specified types of advertising elements. The minimum advertising creative collection of all advertising elements.
步骤1830,基于至少一个广告元素的元素效果参数,在至少一个广告元素中确定符合预期的广告元素,符合预期的广告元素的元素效果参数大于参数阈值。 Step 1830, based on the element effect parameter of the at least one advertisement element, determine an advertisement element that meets the expectation among the at least one advertisement element, and the element effect parameter of the advertisement element that meets the expectation is greater than the parameter threshold.
步骤1840,确定符合预期的广告元素占创意广告中至少一个广告元素的比例。Step 1840: Determine the proportion of advertisement elements that meet the expectations in at least one advertisement element in the creative advertisement.
在一些实施例中,确定符合预期的广告元素占创意广告中至少一个广告元素的比例可以是该视频片段组合中各个预期的广告元素占据目标视频的比例。In some embodiments, it is determined that the proportion of advertisement elements that meet expectations in at least one advertisement element in the creative advertisement may be the proportion of each expected advertisement element in the combination of video clips occupying the target video.
步骤1850,基于比例确定目标视频的预估效果数据。Step 1850: Determine the estimated effect data of the target video based on the ratio.
在一些实施例中,可以基于目标视频中各个广告元素的投放效果数据以及含广告元素的若干个已投放广告创意的投放效果数据对目标视频进行预估,得到目标视频的预估效果数据。In some embodiments, the target video can be estimated based on the delivery effect data of each advertising element in the target video and the delivery effect data of several advertising creatives containing the advertising elements to obtain the estimated effect data of the target video.
图19是根据本申请的一些实施例所示的模型训练过程的示例性流程图。在一些实施例中,流程1900中的一个或以上步骤可以作为指令的形式存储在存储设备(例如,数据库140)中,并且被处理设备(例如,服 务器110中的处理设备112)调用和/或执行。在一些实施例中,流程1900可以用于训练第一初始模型。Fig. 19 is an exemplary flowchart of a model training process according to some embodiments of the present application. In some embodiments, one or more steps in the process 1900 may be stored as instructions in a storage device (for example, the database 140), and called by a processing device (for example, the processing device 112 in the server 110) and/or implement. In some embodiments, the process 1900 may be used to train the first initial model.
步骤1910,获取第一训练集。Step 1910: Obtain the first training set.
第一训练集是指用于训练第一初始模型的训练样本集。第一训练集中包括多个视频对,其中,每个视频对中包含对应的第一样本视频对应的图像特征、第二样本视频对应的图像特征以及两个样本视频对应的标签值。将第一样本视频、第二样本视频获得对应图像特征可以采用特征化提取处理的方式获得。The first training set refers to the training sample set used to train the first initial model. The first training set includes multiple video pairs, where each video pair includes a corresponding image feature corresponding to the first sample video, an image feature corresponding to the second sample video, and label values corresponding to the two sample videos. Obtaining the corresponding image features from the first sample video and the second sample video can be obtained in a characteristic extraction process.
视频对中的标签值反映第一样本视频和第二样本视频之间的相似程度。样本集中的标签值可以由人工标注,也还可以由相对应的机器学习模型对视频对进行自动标注。例如,可以由训练好的分类器模型获得每个视频对的相似程度。The tag value in the video pair reflects the degree of similarity between the first sample video and the second sample video. The label value in the sample set can be manually annotated, or the video pair can be automatically annotated by the corresponding machine learning model. For example, the degree of similarity of each video pair can be obtained from the trained classifier model.
在一些实施例中,获取第一训练集的方式可以是从如摄像头、照相机、智能手机等图像采集器或者终端设备130获取。在一些实施例中,获取第一训练集的方式可以是直接从存储了大量图片的存储系统中读取。在一些实施例中,还可以采用其他任意方式获取第一训练集,本实施例不做限制。In some embodiments, the method for obtaining the first training set may be from an image collector such as a camera, a camera, a smart phone, or the terminal device 130. In some embodiments, the method of obtaining the first training set may be to directly read from a storage system that stores a large number of pictures. In some embodiments, the first training set may also be obtained in any other manner, which is not limited in this embodiment.
第一初始模型可以理解为一个未经训练的神经网络模型或未训练完成的神经网络模型。所述第一初始模型可以是或包括流程1100中所述的训练好的特征提取模型和/或判别模型对应的初始模型。初始模型的各层可以设置有初始参数,参数在训练过程中可以被不断地调整,直至完成训练为止。The first initial model can be understood as an untrained neural network model or an untrained neural network model. The first initial model may be or include the trained feature extraction model and/or the initial model corresponding to the discriminant model described in the process 1100. Each layer of the initial model can be set with initial parameters, and the parameters can be adjusted continuously during the training process until the training is completed.
步骤1920,基于所述第一训练集,通过多轮迭代来训练第一初始模型,以生成训练好的第一神经网络模型。Step 1920: Based on the first training set, train a first initial model through multiple iterations to generate a trained first neural network model.
所述第一神经网络模型可以是或包括流程1100中所述的训练好的特征提取模型和/或判别模型。每轮迭代进一步包括如下步骤。The first neural network model may be or include the trained feature extraction model and/or the discriminant model described in the process 1100. Each iteration further includes the following steps.
步骤1921,利用更新后的第一特征提取模型处理所述视频对中的第一样本视频对应的图像特征,获得对应的第一片段特征。Step 1921: Use the updated first feature extraction model to process the image feature corresponding to the first sample video in the video pair to obtain the corresponding first segment feature.
步骤1922,利用更新后的第二特征提取模型处理同一视频对中的第二样本视频对应的图像特征,获得第二片段特征。Step 1922: Use the updated second feature extraction model to process the image feature corresponding to the second sample video in the same video pair to obtain the second segment feature.
步骤1923,利用更新后的判别模型处理所述第一片段特征和所述第二片段特征,生成判别结果,所述判别结果用以反映第一片段特征和第二片段特征的相似程度。Step 1923: Use the updated discriminant model to process the first segment feature and the second segment feature to generate a discrimination result, and the discrimination result is used to reflect the degree of similarity between the first segment feature and the second segment feature.
步骤1924,基于所述判别结果以及所述标签值判断是否进行下一轮迭代或者就此确定训练好的第一神经网络模型。Step 1924: Determine whether to perform the next iteration or determine the trained first neural network model based on the discrimination result and the label value.
在判别模型前向传播获取到判别结果后,可以基于判别结果以及样本标签构建损失函数,基于损失函数反向传播以更新模型参数。在一些实施例中,可以将训练样本标签数据表示为y 1,判别结果表示为
Figure PCTCN2021101816-appb-000001
计算得到的损失函数值表示为Loss 1。在一些实施例中,根据模型的类型可以选择不同的损失函数,如将均方差损失函数或交叉熵损失函数作为损失函数等,在本说明书中不做限制。示例性地,
Figure PCTCN2021101816-appb-000002
After the discriminant result is obtained by forward propagation of the discriminant model, a loss function can be constructed based on the discriminant result and the sample label, and the model parameters can be updated based on the loss function backpropagation. In some embodiments, the training sample label data can be expressed as y 1 , and the discrimination result can be expressed as
Figure PCTCN2021101816-appb-000001
The calculated loss function value is expressed as Loss 1 . In some embodiments, different loss functions can be selected according to the type of the model, such as the mean square error loss function or the cross entropy loss function as the loss function, which is not limited in this specification. Illustratively,
Figure PCTCN2021101816-appb-000002
在一些实施例中,可以采用梯度反传算法更新模型参数。反向传播算法会对特定训练样本的预测结果与标签数据进行比较,确定模型的每个权重的更新幅度。也就是说,反向传播算法用于确定损失函数相对每个权重的变化情况(亦可称为梯度或误差导数),记为
Figure PCTCN2021101816-appb-000003
更进一步地,梯度反传算法可以将损失函数的值通过输出层,向隐层、输入层逐层反传,依次确定各层的模型参数的修正值(或梯度)。其中,各层的模型参数的修正值(或梯度)包括多个矩阵元素(如梯度元素),其与模型参数一一对应,每个梯度元素反映参数的修正方向(增加或减小)以及修正量。在本说 明书涉及的一个或多个实施例中,判别模型反传完成梯度后,其进一步向第一片段特征提取模型,第二片段特征提取模型逐层反传模型参数,以完成一轮迭代更新。相比于各个模型单独训练而言,采用第一片段特征提取模型、第二片段特征提取模型以及判别模型联合训练的方式采用统一的损失函数进行训练,训练效率更高。
In some embodiments, a gradient backpropagation algorithm can be used to update the model parameters. The backpropagation algorithm compares the prediction results of a specific training sample with the label data, and determines the update range of each weight of the model. In other words, the backpropagation algorithm is used to determine the change of the loss function relative to each weight (also called gradient or error derivative), which is recorded as
Figure PCTCN2021101816-appb-000003
Furthermore, the gradient backpropagation algorithm can pass the value of the loss function through the output layer, pass it back to the hidden layer and the input layer layer by layer, and determine the correction value (or gradient) of the model parameter of each layer in turn. Among them, the correction value (or gradient) of the model parameters of each layer includes multiple matrix elements (such as gradient elements), which correspond to the model parameters one-to-one, and each gradient element reflects the correction direction (increase or decrease) of the parameter, and Correction amount. In one or more embodiments involved in this specification, after the discriminant model is returned to complete the gradient, it further transfers the model parameters to the first segment feature extraction model, and the second segment feature extraction model returns the model parameters layer by layer to complete a round of iterative update . Compared with the individual training of each model, the joint training of the first segment feature extraction model, the second segment feature extraction model, and the discriminant model adopts a unified loss function for training, and the training efficiency is higher.
在一些实施例中,可以基于判别结果以及标签值判断是否进行下一轮迭代或者就此确定训练好的第一神经网络模型。判断的标准可以是迭代次数是否已经达到预设迭代次数、更新后的模型是否满足预设的性能指标阈值等又或者是否收到终止训练的指示。若确定需要进行下一次迭代,则可基于当前次迭代过程更新后的模型的第一部分进行下一次迭代。换句话说,将在下一次迭代中将当前次迭代获得的更新后的模型作为下一轮迭代更新的初始模型。若确定不需要进行下一次迭代,则可将当前次迭代过程中获取的更新后的模型作为最终训练好的模型。In some embodiments, it is possible to determine whether to perform the next iteration or determine the trained first neural network model based on the discrimination result and the label value. The criterion for judgment may be whether the number of iterations has reached the preset number of iterations, whether the updated model meets the preset performance index threshold, etc., or whether an instruction to terminate training is received. If it is determined that the next iteration is required, the next iteration can be performed based on the first part of the updated model during the current iteration. In other words, the updated model obtained in the current iteration will be used as the initial model for the next iteration in the next iteration. If it is determined that the next iteration is not necessary, the updated model obtained during the current iteration can be used as the final trained model.
图20A-20E是根据本申请的一些实施例所示的视频合成系统的示意图。20A-20E are schematic diagrams of video synthesis systems according to some embodiments of the present application.
如图20A所示,多媒体系统2000可以包括获取模块2010、配置模块2020以及生成模块2030。As shown in FIG. 20A, the multimedia system 2000 may include an acquisition module 2010, a configuration module 2020, and a generation module 2030.
获取模块2010可以用于获取多个视频片段。The obtaining module 2010 may be used to obtain multiple video clips.
配置模块2020可以用于获取视频配置信息。The configuration module 2020 may be used to obtain video configuration information.
生成模块2030可以用于基于所述至少部分视频片段和所述视频配置信息,生成一个目标视频。在一些实施例中,生成模块2030也可以称为目标视频生成模块。The generating module 2030 may be configured to generate a target video based on the at least part of the video clip and the video configuration information. In some embodiments, the generating module 2030 may also be referred to as a target video generating module.
在一些实施例中,步骤210可以由获取模块2010实现,步骤220可以由配置模块2020实现,步骤230可以由生成模块2030实现。In some embodiments, step 210 may be implemented by the acquisition module 2010, step 220 may be implemented by the configuration module 2020, and step 230 may be implemented by the generation module 2030.
如图20B所示,获取模块2010还可以包括媒体获取模块2011,分割模块2013以及素材处理模块2015,其中,素材处理模块2015还包括视频处理模块2015a以及图片处理模块2015b。配置模块2020还可以包括识别模块2021,其中,识别模块2021还可以称为主体获取模块。生成模块2030还可以包括筛选模块2031、组合模块2033以及视频合成模块2035。在一些实施例中,多媒体系统2000还可以包括后处理模块2040,其中,后处理模块2040可以包括裁剪模块2041、效果预估模块2043。As shown in FIG. 20B, the acquisition module 2010 may further include a media acquisition module 2011, a segmentation module 2013, and a material processing module 2015. The material processing module 2015 also includes a video processing module 2015a and an image processing module 2015b. The configuration module 2020 may further include an identification module 2021, where the identification module 2021 may also be referred to as a subject acquisition module. The generating module 2030 may also include a screening module 2031, a combination module 2033, and a video synthesis module 2035. In some embodiments, the multimedia system 2000 may further include a post-processing module 2040, where the post-processing module 2040 may include a cropping module 2041, an effect estimation module 2043.
媒体获取模块2011可以用于获取初始视频或初始图像以实现步骤310、步骤610、步骤810等与初始视频或初始图像相关的步骤。媒体获取模块2011还可以用于获取初始音频,以实现步骤1310。The media acquisition module 2011 may be used to acquire the initial video or initial image to implement steps 310, 610, and 810 related to the initial video or initial image. The media acquisition module 2011 may also be used to acquire initial audio to implement step 1310.
分割模块2013可以用于将视频文件按镜头画面进行分割,以实现步骤320、步骤420~440、步骤1330等与分割镜头画面相关的步骤。分割模块2013还可以用于确定音频文件的剪辑点,以实现步骤1320。The segmentation module 2013 may be used to segment the video file according to the lens frame, so as to implement steps 320, steps 420 to 440, step 1330 and other steps related to segmentation of the lens frame. The segmentation module 2013 may also be used to determine the clipping point of the audio file, so as to implement step 1320.
素材处理模块2015可以用于生成视频片段,还可以用于处理分割后的视频素材,例如渲染、美化、套用视频模板等,也可以用于将不同类型的素材进行拼合,例如,步骤1340中将音频文件与视频文件拼合。素材处理模块2015可以具体包括用于处理与初始视频对应的视频素材的视频处理模块2015a以及用于处理与初始图像对应的图像素材的图片处理模块2015b。The material processing module 2015 can be used to generate video clips, and can also be used to process segmented video materials, such as rendering, beautifying, and applying video templates. It can also be used to combine different types of materials. For example, in step 1340, The audio file is merged with the video file. The material processing module 2015 may specifically include a video processing module 2015a for processing video materials corresponding to the initial video and a picture processing module 2015b for processing image materials corresponding to the initial image.
配置模块2020还可以包括用于识别主体的识别模块2021,识别模块2021可以其他模块进行组合,对应的主体可以变更为对应的内容,例如与裁剪模块2041组合,对应的主体可以为裁剪主体。The configuration module 2020 may further include an identification module 2021 for identifying a subject. The identification module 2021 can be combined with other modules, and the corresponding subject can be changed to corresponding content, for example, combined with the cropping module 2041, and the corresponding subject can be a cropped subject.
筛选模块2031具体可以用于根据条件对视频文件进行筛选,例如,步骤820中可以根据第一预设条件从片段中筛选出候选视频片段。筛选模 块2031还可以根据是否包含主体确定与目标视频相关的初始图像、初始视频以及视频片段。The screening module 2031 may be specifically configured to screen video files according to conditions. For example, in step 820, candidate video clips may be screened from clips according to a first preset condition. The screening module 2031 can also determine the initial image, the initial video, and the video segment related to the target video according to whether the subject is included.
组合模块2033具体可以用于根据候选视频片段生成片段组合,组合模块2033还可以用于根据第二预设条件确定用于生成目标视频的片段集合。The combining module 2033 may be specifically configured to generate a segment combination according to the candidate video segments, and the combining module 2033 may also be configured to determine a set of segments used to generate the target video according to a second preset condition.
视频合成模块2035具体可以根据片段集合生成目标视频。The video synthesis module 2035 may specifically generate the target video according to the segment set.
后处理模块2040可以用于在生成目标视频后对目标视频进行再处理,例如,将目标视频定向投放给特定受众。The post-processing module 2040 may be used to reprocess the target video after the target video is generated, for example, to target the target video to a specific audience.
裁剪模块2041可以用于修改视频文件的尺寸,例如,根据播放介质的尺寸修改目标视频的尺寸。The cropping module 2041 may be used to modify the size of the video file, for example, modify the size of the target video according to the size of the playback medium.
效果预估模块2043用于预估目标视频的播放效果。The effect estimation module 2043 is used to estimate the playback effect of the target video.
本申请不对各个目标的层级关系与包含关系进行限制,例如,媒体获取模块2011也可以视为获取模块2010。可以理解的是,上述模块可以根据实际需要进行组合以实现不同的方法。例如,如图20C所示,可以将媒体获取模块2011、主体获取模块(即识别模块2021)、视频处理模块2015a、图片处理模块2015b以及目标视频生成模块(即生成模块2030)组合,以实现利用视频素材与图像素材生成视频。再例如,如图20D所示,可以将获取模块2010、拆分模块(即分割模块2013)、筛选模块2031、组合模块2033以及视频合成模块2035进行组合以实现对视频文件的拆分与重组。再例如,如图20E所示,可以将获取模块2010、分割模块2013、识别模块2021以及裁剪模块2041组合,以实现对特定视频文件的精准裁剪。This application does not limit the hierarchical relationship and containment relationship of each target. For example, the media acquisition module 2011 can also be regarded as the acquisition module 2010. It is understandable that the above modules can be combined according to actual needs to implement different methods. For example, as shown in FIG. 20C, the media acquisition module 2011, the subject acquisition module (ie, the recognition module 2021), the video processing module 2015a, the image processing module 2015b, and the target video generation module (ie, the generation module 2030) can be combined to achieve utilization Video material and image material generate video. For another example, as shown in FIG. 20D, the acquisition module 2010, the splitting module (ie, the splitting module 2013), the filtering module 2031, the combining module 2033, and the video synthesis module 2035 can be combined to realize the splitting and reorganization of the video file. For another example, as shown in FIG. 20E, the acquisition module 2010, the segmentation module 2013, the recognition module 2021, and the cropping module 2041 can be combined to achieve precise cropping of a specific video file.
更多关于模块所实现的功能的具体描述可以参见本说明书其他地方,在此不再赘述。需要注意的是,以上对于生成视频的系统及其模块的描述,仅为描述方便,并不能把本说明书限制在所举实施例范围之内。More detailed descriptions of the functions implemented by the module can be found elsewhere in this manual, and will not be repeated here. It should be noted that the above description of the video generating system and its modules is only for convenience of description, and does not limit this specification to the scope of the embodiments mentioned.
上文已对基本概念做了描述,显然,对于本领域技术人员来说,上述详细披露仅仅作为示例,而并不构成对本说明书的限定。虽然此处并没有明确说明,本领域技术人员可能会对本说明书进行各种修改、改进和修正。该类修改、改进和修正在本说明书中被建议,所以该类修改、改进、修正仍属于本说明书示范实施例的精神和范围。The basic concepts have been described above. Obviously, for those skilled in the art, the above detailed disclosure is only an example, and does not constitute a limitation of the specification. Although it is not explicitly stated here, those skilled in the art may make various modifications, improvements and amendments to this specification. Such modifications, improvements, and corrections are suggested in this specification, so such modifications, improvements, and corrections still belong to the spirit and scope of the exemplary embodiments of this specification.
同时,本说明书使用了特定词语来描述本说明书的实施例。如“一个实施例”、“一实施例”、和/或“一些实施例”意指与本说明书至少一个实施例相关的某一特征、结构或特点。因此,应强调并注意的是,本说明书中在不同位置两次或多次提及的“一实施例”或“一个实施例”或“一个替代性实施例”并不一定是指同一实施例。此外,本说明书的一个或多个实施例中的某些特征、结构或特点可以进行适当的组合。Meanwhile, this specification uses specific words to describe the embodiments of this specification. For example, "one embodiment", "an embodiment", and/or "some embodiments" mean a certain feature, structure, or characteristic related to at least one embodiment of this specification. Therefore, it should be emphasized and noted that “one embodiment” or “one embodiment” or “an alternative embodiment” mentioned twice or more in different positions in this specification does not necessarily refer to the same embodiment. . In addition, some features, structures, or characteristics in one or more embodiments of this specification can be appropriately combined.
此外,本领域技术人员可以理解,本说明书的各方面可以通过若干具有可专利性的种类或情况进行说明和描述,包括任何新的和有用的工序、机器、产品或物质的组合,或对他们的任何新的和有用的改进。相应地,本说明书的各个方面可以完全由硬件执行、可以完全由软件(包括固件、常驻软件、微码等)执行、也可以由硬件和软件组合执行。以上硬件或软件均可被称为“数据块”、“模块”、“引擎”、“单元”、“组件”或“系统”。此外,本说明书的各方面可能表现为位于一个或多个计算机可读介质中的计算机产品,该产品包括计算机可读程序编码。In addition, those skilled in the art can understand that various aspects of this specification can be explained and described through a number of patentable categories or situations, including any new and useful process, machine, product, or combination of substances, or a combination of them. Any new and useful improvements. Correspondingly, various aspects of this specification can be completely executed by hardware, can be completely executed by software (including firmware, resident software, microcode, etc.), or can be executed by a combination of hardware and software. The above hardware or software can be referred to as "data block", "module", "engine", "unit", "component" or "system". In addition, various aspects of this specification may be embodied as a computer product located in one or more computer-readable media, and the product includes computer-readable program codes.
计算机存储介质可能包含一个内含有计算机程序编码的传播数据信号,例如在基带上或作为载波的一部分。该传播信号可能有多种表现形式,包括电磁形式、光形式等,或合适的组合形式。计算机存储介质可以是除计算机可读存储介质之外的任何计算机可读介质,该介质可以通过连接至一个指令执行系统、装置或设备以实现通讯、传播或传输供使用的程序。 位于计算机存储介质上的程序编码可以通过任何合适的介质进行传播,包括无线电、电缆、光纤电缆、RF、或类似介质,或任何上述介质的组合。A computer storage medium may contain a propagated data signal containing a computer program code, for example on a baseband or as part of a carrier wave. The propagated signal may have multiple manifestations, including electromagnetic forms, optical forms, etc., or suitable combinations. The computer storage medium may be any computer readable medium other than the computer readable storage medium, and the medium may be connected to an instruction execution system, device, or device to realize communication, propagation, or transmission of the program for use. The program code located on the computer storage medium can be transmitted through any suitable medium, including radio, cable, fiber optic cable, RF, or similar medium, or any combination of the above medium.
本说明书各部分操作所需的计算机程序编码可以用任意一种或多种程序语言编写,包括面向对象编程语言如Java、Scala、Smalltalk、Eiffel、JADE、Emerald、C++、C#、VB.NET、Python等,常规程序化编程语言如C语言、VisualBasic、Fortran2003、Perl、COBOL2002、PHP、ABAP,动态编程语言如Python、Ruby和Groovy,或其他编程语言等。该程序编码可以完全在用户计算机上运行、或作为独立的软件包在用户计算机上运行、或部分在用户计算机上运行部分在远程计算机运行、或完全在远程计算机或处理设备上运行。在后种情况下,远程计算机可以通过任何网络形式与用户计算机连接,比如局域网(LAN)或广域网(WAN),或连接至外部计算机(例如通过因特网),或在云计算环境中,或作为服务使用如软件即服务(SaaS)。The computer program codes required for the operation of each part of this manual can be written in any one or more programming languages, including object-oriented programming languages such as Java, Scala, Smalltalk, Eiffel, JADE, Emerald, C++, C#, VB.NET, Python Etc., conventional programming languages such as C language, VisualBasic, Fortran2003, Perl, COBOL2002, PHP, ABAP, dynamic programming languages such as Python, Ruby and Groovy, or other programming languages. The program code can run entirely on the user's computer, or as an independent software package on the user's computer, or partly on the user's computer and partly on a remote computer, or entirely on the remote computer or processing equipment. In the latter case, the remote computer can be connected to the user's computer through any network form, such as a local area network (LAN) or a wide area network (WAN), or connected to an external computer (for example, via the Internet), or in a cloud computing environment, or as a service Use software as a service (SaaS).
此外,除非权利要求中明确说明,本说明书所述处理元素和序列的顺序、数字字母的使用、或其他名称的使用,并非用于限定本说明书流程和方法的顺序。尽管上述披露中通过各种示例讨论了一些目前认为有用的发明实施例,但应当理解的是,该类细节仅起到说明的目的,附加的权利要求并不仅限于披露的实施例,相反,权利要求旨在覆盖所有符合本说明书实施例实质和范围的修正和等价组合。例如,虽然以上所描述的系统组件可以通过硬件设备实现,但是也可以只通过软件的解决方案得以实现,如在现有的处理设备或移动设备上安装所描述的系统。In addition, unless explicitly stated in the claims, the order of processing elements and sequences, the use of numbers and letters, or the use of other names in this specification are not used to limit the order of the processes and methods in this specification. Although the foregoing disclosure uses various examples to discuss some embodiments of the invention that are currently considered useful, it should be understood that such details are only for illustrative purposes, and the appended claims are not limited to the disclosed embodiments. On the contrary, the rights The requirements are intended to cover all modifications and equivalent combinations that conform to the essence and scope of the embodiments of this specification. For example, although the system components described above can be implemented by hardware devices, they can also be implemented only by software solutions, such as installing the described system on existing processing devices or mobile devices.
同理,应当注意的是,为了简化本说明书披露的表述,从而帮助对一个或多个发明实施例的理解,前文对本说明书实施例的描述中,有时会将多种特征归并至一个实施例、附图或对其的描述中。但是,这种披露方 法并不意味着本说明书对象所需要的特征比权利要求中提及的特征多。实际上,实施例的特征要少于上述披露的单个实施例的全部特征。For the same reason, it should be noted that, in order to simplify the expressions disclosed in this specification and help the understanding of one or more embodiments of the invention, in the foregoing description of the embodiments of this specification, multiple features are sometimes combined into one embodiment. In the drawings or its description. However, this method of disclosure does not mean that the subject of this specification requires more features than those mentioned in the claims. In fact, the features of the embodiment are less than all the features of the single embodiment disclosed above.
一些实施例中使用了描述成分、属性数量的数字,应当理解的是,此类用于实施例描述的数字,在一些示例中使用了修饰词“大约”、“近似”或“大体上”来修饰。除非另外说明,“大约”、“近似”或“大体上”表明所述数字允许有±20%的变化。相应地,在一些实施例中,说明书和权利要求中使用的数值参数均为近似值,该近似值根据个别实施例所需特点可以发生改变。在一些实施例中,数值参数应考虑规定的有效数位并采用一般位数保留的方法。尽管本说明书一些实施例中用于确认其范围广度的数值域和参数为近似值,在具体实施例中,此类数值的设定在可行范围内尽可能精确。In some embodiments, numbers describing the number of ingredients and attributes are used. It should be understood that such numbers used in the description of the embodiments use the modifiers "approximately", "approximately" or "substantially" in some examples. Retouch. Unless otherwise stated, "approximately", "approximately" or "substantially" indicates that the number is allowed to vary by ±20%. Correspondingly, in some embodiments, the numerical parameters used in the description and claims are approximate values, and the approximate values can be changed according to the required characteristics of individual embodiments. In some embodiments, the numerical parameter should consider the prescribed effective digits and adopt the method of general digit retention. Although the numerical ranges and parameters used to confirm the breadth of the ranges in some embodiments of this specification are approximate values, in specific embodiments, the setting of such numerical values is as accurate as possible within the feasible range.
针对本说明书引用的每个专利、专利申请、专利申请公开物和其他材料,如文章、书籍、说明书、出版物、文档等,特此将其全部内容并入本说明书作为参考。与本说明书内容不一致或产生冲突的申请历史文件除外,对本说明书权利要求最广范围有限制的文件(当前或之后附加于本说明书中的)也除外。需要说明的是,如果本说明书附属材料中的描述、定义、和/或术语的使用与本说明书所述内容有不一致或冲突的地方,以本说明书的描述、定义和/或术语的使用为准。For each patent, patent application, patent application publication and other materials cited in this specification, such as articles, books, specifications, publications, documents, etc., the entire contents are hereby incorporated into this specification as a reference. Except the application history documents that are inconsistent or conflict with the content of this specification, and the documents that restrict the broadest scope of the claims of this specification (currently or later attached to this specification) are also excluded. It should be noted that if there is any inconsistency or conflict between the description, definition, and/or use of terms in the auxiliary materials of this manual and the content of this manual, the description, definition and/or use of terms in this manual shall prevail. .
最后,应当理解的是,本说明书中所述实施例仅用以说明本说明书实施例的原则。其他的变形也可能属于本说明书的范围。因此,作为示例而非限制,本说明书实施例的替代配置可视为与本说明书的教导一致。相应地,本说明书的实施例不仅限于本说明书明确介绍和描述的实施例。Finally, it should be understood that the embodiments described in this specification are only used to illustrate the principles of the embodiments of this specification. Other variations may also fall within the scope of this specification. Therefore, as an example and not a limitation, the alternative configuration of the embodiment of the present specification can be regarded as consistent with the teaching of the present specification. Accordingly, the embodiments of this specification are not limited to the embodiments explicitly introduced and described in this specification.

Claims (83)

  1. 一种生成视频的系统,包括:A system for generating video, including:
    存储有一组指令的至少一个存储介质;以及At least one storage medium storing a set of instructions; and
    至少一个处理器被配置用于与所述至少一个存储介质通信,其中当执行所述指令集时,所述至少一个处理器用于执行一个或多个操作,所述操作包括:At least one processor is configured to communicate with the at least one storage medium, wherein when the instruction set is executed, the at least one processor is configured to perform one or more operations, and the operations include:
    获取多个视频片段;Get multiple video clips;
    获取视频配置信息,所述视频配置信息包括所述多个视频片段中至少部分视频片段的一个或多个配置特征,所述配置特征包括内容特征或排布特征中的至少一个;以及Acquiring video configuration information, where the video configuration information includes one or more configuration features of at least some of the multiple video clips, and the configuration features include at least one of content features or arrangement features; and
    基于所述至少部分视频片段和所述视频配置信息,生成一个目标视频。Based on the at least part of the video segment and the video configuration information, a target video is generated.
  2. 根据权利要求1所述的系统,所述获取多个视频片段包括:The system according to claim 1, wherein said acquiring a plurality of video clips comprises:
    获取初始图像或初始视频中的至少一种;以及Acquiring at least one of an initial image or an initial video; and
    对所述初始图像或初始视频进行编辑处理,得到所述多个视频片段。Perform editing processing on the initial image or initial video to obtain the multiple video clips.
  3. 根据权利要求2所述的系统,所述对初始图像或初始视频进行编辑处理,得到所述多个视频片段包括:The system according to claim 2, wherein the editing process on the initial image or the initial video to obtain the multiple video clips comprises:
    获取所述初始图像或初始视频中每对相邻图像或视频帧的特征;Acquiring features of each pair of adjacent images or video frames in the initial image or initial video;
    确定所述每对相邻图像或视频帧的相似度;Determining the similarity of each pair of adjacent images or video frames;
    基于每对相邻图像或视频帧的相似度,识别片段边界;以及Identify segment boundaries based on the similarity of each pair of adjacent images or video frames; and
    基于所述片段边界,将所述初始图像或初始视频进行分割,得到所述多个视频片段。Based on the segment boundary, the initial image or the initial video is divided to obtain the multiple video segments.
  4. 根据权利要求1所述的系统,所述多个视频片段中的每个视频片段为一个镜头片段。The system according to claim 1, wherein each video segment of the plurality of video segments is a shot segment.
  5. 根据权利要求2所述的系统,所述对初始图像或初始视频进行编辑处理,得到所述多个视频片段包括:The system according to claim 2, wherein the editing process on the initial image or the initial video to obtain the multiple video clips comprises:
    确定所述初始图像或初始视频的主体信息,所述主体信息至少包括主体和主体位置;Determine subject information of the initial image or initial video, where the subject information includes at least the subject and the subject position;
    基于所述主体信息对所述初始图像或初始视频进行编辑处理,得到所述多个视频片段。Perform editing processing on the initial image or initial video based on the subject information to obtain the multiple video clips.
  6. 根据权利要求5所述的系统,所述确定所述初始图像或初始视频中的主体信息包括:The system according to claim 5, wherein the determining the subject information in the initial image or the initial video comprises:
    获取一个主体信息确定模型;Obtain a subject information deterministic model;
    通过将所述初始图像或初始视频输入所述主体信息确定模型,确定所述主体信息。The subject information is determined by inputting the initial image or the initial video into the subject information determination model.
  7. 根据权利要求5或6的系统,所述基于所述主体信息对所述初始图像或初始视频进行编辑处理包括:According to the system of claim 5 or 6, said editing the initial image or initial video based on the subject information comprises:
    基于所述主体信息,识别所述初始图像或初始视频中主体的外轮廓;以及Based on the subject information, identifying the outer contour of the subject in the initial image or initial video; and
    根据所述主体的外轮廓,对所述初始图像或初始视频进行裁剪或缩放。According to the outer contour of the subject, the initial image or initial video is cropped or zoomed.
  8. 根据权利要求1所述的系统,所述视频配置信息包括第一预设条件和第二预设条件。The system according to claim 1, wherein the video configuration information includes a first preset condition and a second preset condition.
  9. 根据权利要求8所述的系统,所述基于所述至少部分视频片段和所述视频配置信息,生成一个目标视频包括:The system according to claim 8, wherein said generating a target video based on said at least part of the video segment and said video configuration information comprises:
    基于所述第一预设条件从所述多个视频片段中获取一个或多个候选视频片段;Obtaining one or more candidate video clips from the multiple video clips based on the first preset condition;
    对所述一个或多个候选视频片段进行分组以确定至少一个片段集合;以及Grouping the one or more candidate video segments to determine at least one segment set; and
    基于所述至少一个片段集合中的每个片段集合,生成一个目标视频。Based on each segment set in the at least one segment set, a target video is generated.
  10. 根据权利要求9所述的系统,所述第一预设条件与多个要素中的至少一个相关,所述多个要素包括目标视频包含特定对象,目标视频包含特定主题,目标视频的总时长,目标视频所包含的镜头画面数量,目标视频所包含特定的镜头画面,目标视频中特定主题的重叠数量,或目标视频中特定主题的聚焦时间。The system according to claim 9, wherein the first preset condition is related to at least one of a plurality of elements, the plurality of elements including a target video containing a specific object, a target video containing a specific subject, and a total duration of the target video, The number of shots contained in the target video, the specific shots contained in the target video, the number of overlaps of a specific subject in the target video, or the focusing time of a specific subject in the target video.
  11. 根据权利要求10所述的系统,所述第一预设条件包括所述至少一个要素的值大于相应的阈值。The system according to claim 10, the first preset condition includes that the value of the at least one element is greater than a corresponding threshold.
  12. 根据权利要求10或11所述的系统,所述第一预设条件还包括所述多个要素中两个或以上特定要素之间的要素约束条件。The system according to claim 10 or 11, the first preset condition further includes an element constraint condition between two or more specific elements in the plurality of elements.
  13. 根据权利要求10-12中任一项所述的系统,所述第一预设条件包括目标视频中镜头画面的绑定条件,所述绑定条件反映至少两个特定镜头画面在目标视频中的关联关系,所述基于第一预设条件从所述多个视频片段中获取一个或多个候选视频片段包括:The system according to any one of claims 10-12, wherein the first preset condition includes a binding condition of a shot frame in the target video, and the binding condition reflects the binding condition of at least two specific shot frames in the target video. The association relationship, the obtaining one or more candidate video clips from the multiple video clips based on the first preset condition includes:
    从所述多个视频片段中确定包含指定镜头画面的视频片段;Determine a video clip containing a specified lens picture from the multiple video clips;
    基于所述绑定条件将包含指定镜头画面的视频片段组合,以作为一个候选视频片段。Based on the binding condition, the video clips including the specified shots are combined to serve as a candidate video clip.
  14. 根据权利要求9所述的系统,所述至少一个片段集合包括两个及以上片段集合,所述两个及以上片段集合满足所述第二预设条件,所述第二预设条件与所述两个及以上片段集合之间的候选视频片段的组合差异度相关。The system according to claim 9, wherein the at least one fragment set includes two or more fragment sets, the two or more fragment sets satisfy the second preset condition, and the second preset condition is consistent with the The combination differences of candidate video segments between two or more segment sets are related.
  15. 根据权利要求14所述的系统,所述对所述一个或多个候选视频片段进行分组以确定所述至少一个片段集合包括:The system of claim 14, wherein the grouping the one or more candidate video segments to determine the at least one segment set comprises:
    确定所述两个及以上片段集合中的每个片段集合与其他片段集合之间的候选视频片段的组合差异度;以及Determining the degree of difference in combination of candidate video segments between each of the two or more segment sets and other segment sets; and
    将与其他片段集合的组合差异度高于预设阈值的片段集合作为所述至少一个片段集合。A segment set whose combination difference with other segment sets is higher than a preset threshold is used as the at least one segment set.
  16. 根据权利要求15所述的系统,所述确定所述两个及以上片段集合中的每个片段集合与其他片段集合之间的候选视频片段的组合差异度包括:The system according to claim 15, wherein the determining the combination difference of candidate video segments between each of the two or more segment sets and other segment sets comprises:
    为所述一个或多个候选视频片段中的每一个赋予一个标识字符;Assigning an identification character to each of the one or more candidate video segments;
    基于所述一个或多个候选视频片段的标识字符,确定对应于所述片段集合与其他片段集合的字符串;以及Based on the identification characters of the one or more candidate video segments, determine character strings corresponding to the segment set and other segment sets; and
    将所述片段集合与其他片段集合对应的字符串的编辑距离确定为所述片段集合与其他片段集合之间的候选视频片段的组合差异度。The edit distance of the character strings corresponding to the segment set and the other segment sets is determined as the combination difference degree of the candidate video segments between the segment set and the other segment sets.
  17. 根据权利要求15所述的系统,所述确定所述两个及以上片段集合中的每个片段集合与其他片段集合之间的候选视频片段的组合差异度包括:The system according to claim 15, wherein the determining the combination difference of candidate video segments between each of the two or more segment sets and other segment sets comprises:
    基于训练好的特征提取模型以及所述两个及以上片段集合中的候选视频片段,生成每个候选视频片段对应的片段特征;Generating a segment feature corresponding to each candidate video segment based on the trained feature extraction model and the candidate video segments in the two or more segment sets;
    基于所述片段特征生成每个片段集合对应的集合特征向量;Generating a collection feature vector corresponding to each segment collection based on the segment features;
    基于训练好的判别模型以及所述每个片段集合对应的集合特征向量确定每个片段集合与其他片段集合之间的相似程度;以及Determine the degree of similarity between each fragment set and other fragment sets based on the trained discriminant model and the set feature vector corresponding to each fragment set; and
    基于所述相似程度确定每个片段集合与其他片段集合之间的组合差异度。Based on the degree of similarity, the degree of combined difference between each segment set and other segment sets is determined.
  18. 根据权利要求17所述的系统,所述基于训练好的判别模型以及所述每个片段集合对应的集合特征向量确定每个片段集合与其他片段集合之间的相似程度包括:The system according to claim 17, wherein the determining the degree of similarity between each fragment set and other fragment sets based on the trained discriminant model and the set feature vector corresponding to each fragment set comprises:
    基于聚类算法对所述集合特征向量进行聚类,获得多个集合聚类簇。Clustering the set feature vector based on a clustering algorithm to obtain a plurality of set cluster clusters.
  19. 根据权利要求17所述的系统,所述特征提取模型是基于序列的机器学习模型,所述基于训练好的特征提取模型以及所述两个及以上片段集合中的候选视频片段,生成所述候选视频片段对应的片段特征包括:The system according to claim 17, wherein the feature extraction model is a sequence-based machine learning model, and the candidate video segment is generated based on the trained feature extraction model and the candidate video segments in the set of two or more segments. The segment features corresponding to the video segment include:
    获取每一个候选视频片段包含的多个视频帧;Acquire multiple video frames included in each candidate video segment;
    确定每个视频帧对应的一个或多个图像特征;以及Determine one or more image features corresponding to each video frame; and
    基于训练好的特征提取模型处理所述多个视频帧中的图像特征以及多个视频帧中图像特征之间的相互关系,确定候选视频片段对应的片段特征。Based on the trained feature extraction model, the image features in the multiple video frames and the relationship between the image features in the multiple video frames are processed to determine the segment feature corresponding to the candidate video segment.
  20. 根据权利要求19所述的系统,所述视频帧对应的图像特征包括视频帧中对象的形状信息、视频帧中多个对象之间的位置关系信息、视频帧中对象的色彩信息、视频帧中对象的完整程度或视频帧中的亮度中的至少一个。The system according to claim 19, wherein the image features corresponding to the video frame include the shape information of the object in the video frame, the positional relationship information between multiple objects in the video frame, the color information of the object in the video frame, and the information in the video frame. At least one of the completeness of the object or the brightness in the video frame.
  21. 根据权利要求8所述的系统,所述基于所述至少部分视频片段和所述视频配置信息,生成一个目标视频包括:The system according to claim 8, wherein said generating a target video based on said at least part of the video segment and said video configuration information comprises:
    基于所述多个视频片段生成多个候选片段集合,所述多个候选片段集合满足第二预设条件;Generating a plurality of candidate fragment sets based on the plurality of video fragments, the plurality of candidate fragment sets satisfying a second preset condition;
    基于第一预设条件从所述多个候选片段集合中筛选出至少一个片段集合;以及At least one segment set is selected from the plurality of candidate segment sets based on the first preset condition; and
    基于所述至少一个目标片段集合中的每个片段集合,生成一个目标视频。Based on each segment set in the at least one target segment set, a target video is generated.
  22. 根据权利要求9或21所述的系统,所述视频配置信息进一步包括序列信息,所述基于所述至少一个片段集合中的每个片段集合,生成一个目标视频包括:The system according to claim 9 or 21, wherein the video configuration information further includes sequence information, and the generating a target video based on each segment set in the at least one segment set includes:
    基于所述序列信息,将所述每个片段合集中的候选视频片段进行排序组合,生成一个目标视频。Based on the sequence information, the candidate video segments in each of the segment collections are sorted and combined to generate a target video.
  23. 根据权利要求1所述的系统,所述视频配置信息进一步包括美化参数,所述美化参数包括滤镜参数、动画参数、布局参数中的至少一个。The system according to claim 1, wherein the video configuration information further includes a beautification parameter, and the beautification parameter includes at least one of a filter parameter, an animation parameter, and a layout parameter.
  24. 根据权利要求1所述的系统,所述操作进一步包括:The system of claim 1, the operation further comprising:
    基于所述视频配置信息,获取文字层、背景层或装饰层以及载入参数;Based on the video configuration information, obtaining a text layer, a background layer or a decoration layer and loading parameters;
    根据所述载入参数确定所述文字层、所述背景层以及所述装饰层在所述目标视频中的布局情况。The layout of the text layer, the background layer, and the decoration layer in the target video is determined according to the loading parameters.
  25. 根据权利要求1所述的系统,所述操作进一步包括:The system of claim 1, the operation further comprising:
    对所述多个视频片段进行归一化处理。Perform normalization processing on the multiple video clips.
  26. 根据权利要求1所述的系统,所述操作进一步包括:The system of claim 1, the operation further comprising:
    获取初始音频;Get the initial audio;
    对所述初始音频基于节奏进行标记得到至少一个音频切分点;Marking the initial audio based on rhythm to obtain at least one audio segmentation point;
    至少部分基于所述视频配置信息确定所述目标视频的至少一个视频切分点;Determine at least one video segmentation point of the target video based at least in part on the video configuration information;
    将所述至少一个音频切分点与所述至少一个视频切分点进行匹配;以及Matching the at least one audio segmentation point with the at least one video segmentation point; and
    基于匹配结果将切分后的音频与所述目标视频进行合成。Based on the matching result, the segmented audio is synthesized with the target video.
  27. 根据权利要求1所述的系统,所述操作进一步包括:The system of claim 1, the operation further comprising:
    对所述目标视频进行后处理以满足至少一个视频输出条件,所述至少一个视频输出条件与所述目标视频的播放媒介相关。Post-processing is performed on the target video to satisfy at least one video output condition, and the at least one video output condition is related to a playback medium of the target video.
  28. 根据权利要求27所述的系统,所述至少一个视频输出条件包括视频尺寸条件,所述对目标视频进行后处理包括:The system according to claim 27, wherein the at least one video output condition includes a video size condition, and the post-processing of the target video includes:
    根据所述视频尺寸条件对所述目标视频的画面进行裁剪。The frame of the target video is cropped according to the video size condition.
  29. 根据权利要求28所述的系统,所述根据所述视频尺寸条件对所述目标视频的画面进行裁剪包括:The system according to claim 28, wherein said cropping the frame of said target video according to said video size condition comprises:
    获取所述目标视频包含的各个视频片段的裁剪主体信息,所述裁剪主体信息反应该视频片段的特定裁剪主体以及所述特定裁剪主体的位置信息;Acquiring cropping subject information of each video segment included in the target video, where the cropping subject information reflects a specific cropping subject of the video segment and position information of the specific cropping subject;
    根据所述视频尺寸条件对应的预设画面尺寸以及所述裁剪主体信息,对所述目标视频所包含的各个视频片段的画面进行裁剪。According to the preset picture size corresponding to the video size condition and the cropping subject information, the picture of each video segment included in the target video is cropped.
  30. 根据权利要求29所述的系统,所述根据所述视频尺寸条件对应的预设画面尺寸以及所述裁剪主体信息,对所述目标视频所包含的各个视频片段的画面进行裁剪包括:The system according to claim 29, wherein the cropping of the screen of each video segment included in the target video according to the preset screen size corresponding to the video size condition and the cropping subject information comprises:
    对于所述目标视频所包含的每个视频片段,For each video segment contained in the target video,
    根据所述裁剪主体信息和所述预设画面尺寸,确定所述视频片段中至少一个视频帧的裁剪框的尺寸和初始位置;Determining the size and initial position of the cropping frame of at least one video frame in the video clip according to the cropping subject information and the preset picture size;
    对所述至少一个视频帧的裁剪框的初始位置进行处理,确定所述至少一个视频帧的裁剪框对应的最终位置;Processing the initial position of the cropping frame of the at least one video frame, and determining the final position corresponding to the cropping frame of the at least one video frame;
    根据所述裁剪框的最终位置,将所述视频片段包含的各个视频帧的画面进行裁剪,以保留所述裁剪框内的画面。According to the final position of the cropping frame, the pictures of each video frame included in the video clip are cropped to preserve the pictures in the cropping frame.
  31. 根据权利要求30所述的系统,所述对所述至少一个视频帧的裁剪框的初始位置进行处理,确定所述至少一个视频帧的裁剪框对应的最终位置包括:The system according to claim 30, wherein the processing the initial position of the cropping box of the at least one video frame and determining the final position corresponding to the cropping box of the at least one video frame comprises:
    将所述视频片段的所述至少一个视频帧的裁剪框的参考点的初始坐标信息根据时间进行平滑处理;Smoothing the initial coordinate information of the reference point of the crop frame of the at least one video frame of the video segment according to time;
    根据平滑处理的结果,确定所述参考点的最终坐标信息;以及Determine the final coordinate information of the reference point according to the result of the smoothing process; and
    基于所述最终坐标信息确定所述参考点的位置。The position of the reference point is determined based on the final coordinate information.
  32. 根据权利要求31所述的系统,所述将所述参考点的初始坐标信息进行平滑处理包括:The system according to claim 31, said smoothing the initial coordinate information of the reference point comprises:
    对所述参考点的初始坐标进行线性回归处理,以获取线性回归方程及其斜率。Perform linear regression processing on the initial coordinates of the reference point to obtain the linear regression equation and its slope.
  33. 根据权利要求32所述的系统,所述根据平滑处理的结果,确定所述参考点的最终坐标信息包括:The system according to claim 32, wherein the determining the final coordinate information of the reference point according to the result of the smoothing process comprises:
    将所述斜率的绝对值与一个斜率阈值进行比较;Comparing the absolute value of the slope with a slope threshold;
    响应于所述斜率的绝对值小于所述斜率阈值,In response to the absolute value of the slope being less than the slope threshold,
    将线性回归方程的趋势线的中点的位置作为每个所述视频帧的裁剪框的参考点的最终位置;以及Taking the position of the midpoint of the trend line of the linear regression equation as the final position of the reference point of the cropping frame of each video frame; and
    响应于所述斜率的绝对值大于或等于所述斜率阈值,In response to the absolute value of the slope being greater than or equal to the slope threshold,
    将所述线性回归方程的趋势线上每个视频帧的时间点所对应的位置作为该视频帧的裁剪框的参考点的最终位置。The position corresponding to the time point of each video frame on the trend line of the linear regression equation is used as the final position of the reference point of the crop frame of the video frame.
  34. 根据权利要求30所述的系统,所述根据所述裁剪主体信息和所述预设画面尺寸,确定所述视频片段中至少一个视频帧的裁剪框的尺寸和位置包括:The system according to claim 30, wherein the determining the size and position of the cropping frame of at least one video frame in the video clip according to the cropping subject information and the preset picture size comprises:
    根据所述目标视频的主题信息和所述裁剪主体信息,确定所述裁剪主体信息中的一个或多个特定裁剪主体与所述主题信息之间的相关度;Determine the correlation between one or more specific cropped subjects in the cropped subject information and the subject information according to the subject information of the target video and the cropped subject information;
    根据所述预设画面尺寸和所述特定裁剪主体,确定与所述至少一个视频帧对应的至少一个备选裁剪框;Determining at least one candidate cropping frame corresponding to the at least one video frame according to the preset picture size and the specific cropping subject;
    根据所述裁剪主体信息和所述相关度,为所述至少一个备选裁剪框打分;以及Scoring the at least one candidate cropping frame according to the cropping subject information and the correlation; and
    基于所述备选裁剪框的打分结果,确定所述至少一个视频帧的裁剪框的尺寸和位置。Based on the scoring result of the candidate cropping frame, the size and position of the cropping frame of the at least one video frame are determined.
  35. 根据权利要求34所述的系统,所述获取所述目标视频包含的各个视频片段的裁剪主体信息包括:The system according to claim 34, wherein said obtaining the cropping subject information of each video segment contained in the target video comprises:
    使用机器学习模型获取每个所述视频片段中的候选裁剪主体;Using a machine learning model to obtain candidate cropped subjects in each of the video clips;
    根据所述目标视频的主题信息,从所述候选裁剪主体中挑选出所述一个或多个特定裁剪主体。According to the subject information of the target video, the one or more specific crop subjects are selected from the candidate crop subjects.
  36. 根据权利要求1所述的系统,所述操作进一步包括:The system of claim 1, the operation further comprising:
    将所述目标视频向特定受众人群投放。The target video is delivered to a specific audience group.
  37. 根据权利要求36所述的系统,所述特定受众人群符合特定受众特征条件,所述基于所述至少部分视频片段和所述视频配置信息,生成一个目标视频包括:The system according to claim 36, wherein said specific audience group meets specific demographic characteristics, and said generating a target video based on said at least part of the video fragment and said video configuration information comprises:
    获取所述多个视频片段的受众接受度;Obtaining audience acceptance of the multiple video clips;
    对于所述特定的受众群体,根据对应的受众特征条件从所述多个视频片段中确定受众接受度高于阈值的候选片段,用以生成所述目标视频。For the specific audience group, a candidate segment whose audience acceptance is higher than a threshold is determined from the plurality of video segments according to corresponding demographic conditions, so as to generate the target video.
  38. 根据权利要求37所述的系统,所述操作进一步包括:The system of claim 37, the operations further comprising:
    获取所述目标视频的投放效果反馈,并根据所述投放效果反馈调整所述受众特征条件或受众接受度中的至少一个。Obtain the delivery effect feedback of the target video, and adjust at least one of the demographic condition or audience acceptance according to the delivery effect feedback.
  39. 根据权利要求38所述的系统,其中,所述投放效果反馈与所述目标视频的完播率、重播次数或观看人数中的至少一个相关。The system according to claim 38, wherein the delivery effect feedback is related to at least one of the completion rate, the number of replays, or the number of viewers of the target video.
  40. 根据权利要求36-39中任一项所述的系统,所述目标视频包括一个创意广告,所述操作进一步包括:The system according to any one of claims 36-39, wherein the target video includes a creative advertisement, and the operation further includes:
    基于所述创意广告的至少一个广告元素的元素效果参数,确定所述目标视频的预估效果数据。Determine the estimated effect data of the target video based on the element effect parameter of at least one advertisement element of the creative advertisement.
  41. 根据权利要求40所述的系统,基于所述创意广告的至少一个广告元素的元素效果参数,确定所述目标视频的预估效果数据包括:The system according to claim 40, based on the element effect parameter of at least one advertisement element of the creative advertisement, determining the estimated effect data of the target video comprises:
    获取广告元素效果预估模型;Obtain the effect prediction model of advertising elements;
    将标记有至少一个元素标签的所述广告元素输入所述广告元素效果预估模型,确定所述广告元素的元素效果参数,所述至少一个元素标签包括所述广告元素与所述创意广告的关系;The advertisement element marked with at least one element tag is input into the advertisement element effect prediction model to determine the element effect parameter of the advertisement element, and the at least one element tag includes the relationship between the advertisement element and the creative advertisement ;
    基于所述至少一个广告元素的元素效果参数,在所述至少一个广告元素中确定符合预期的广告元素,所述符合预期的广告元素的元素效果参数大于参数阈值;Based on the element effect parameter of the at least one advertisement element, determining an expected advertisement element in the at least one advertisement element, and the element effect parameter of the expected advertisement element is greater than a parameter threshold;
    确定所述符合预期的广告元素占所述创意广告中所述至少一个广告元素的比例;Determining the proportion of the advertising element that meets expectations in the at least one advertising element in the creative advertisement;
    基于所述比例确定所述目标视频的所述预估效果数据。The estimated effect data of the target video is determined based on the ratio.
  42. 一种生成视频的方法,所述方法由包括至少一个存储器和至少一个处理器的处理设备执行,所述方法包括:A method for generating a video, the method being executed by a processing device including at least one memory and at least one processor, the method including:
    获取多个视频片段;Get multiple video clips;
    获取视频配置信息,所述视频配置信息包括所述多个视频片段中至少部分视频片段的一个或多个配置特征,所述配置特征包括内容特征或排布特征中的至少一个;以及Acquiring video configuration information, where the video configuration information includes one or more configuration features of at least some of the multiple video clips, and the configuration features include at least one of content features or arrangement features; and
    基于所述至少部分视频片段和所述视频配置信息,生成一个目标视频。Based on the at least part of the video segment and the video configuration information, a target video is generated.
  43. 根据权利要求42所述的方法,所述获取多个视频片段包括:The method according to claim 42, said acquiring a plurality of video clips comprises:
    获取初始图像或初始视频中的至少一种;以及Acquiring at least one of an initial image or an initial video; and
    对所述初始图像或初始视频进行编辑处理,得到所述多个视频片段。Perform editing processing on the initial image or initial video to obtain the multiple video clips.
  44. 根据权利要求43所述的方法,所述对初始图像或初始视频进行编辑处理,得到所述多个视频片段包括:The method according to claim 43, wherein said editing the initial image or the initial video to obtain the plurality of video clips comprises:
    获取所述初始图像或初始视频中每对相邻图像或视频帧的特征;Acquiring features of each pair of adjacent images or video frames in the initial image or initial video;
    确定所述每对相邻图像或视频帧的相似度;Determining the similarity of each pair of adjacent images or video frames;
    基于每对相邻图像或视频帧的相似度,识别片段边界;以及Identify segment boundaries based on the similarity of each pair of adjacent images or video frames; and
    基于所述片段边界,将所述初始图像或初始视频进行分割,得到所述多个视频片段。Based on the segment boundary, the initial image or the initial video is divided to obtain the multiple video segments.
  45. 根据权利要求42所述的方法,所述多个视频片段中的每个视频片段为一个镜头片段。The method according to claim 42, wherein each video segment of the plurality of video segments is a shot segment.
  46. 根据权利要求43所述的方法,所述对初始图像或初始视频进行编辑处理,得到所述多个视频片段包括:The method according to claim 43, wherein said editing the initial image or the initial video to obtain the plurality of video clips comprises:
    确定所述初始图像或初始视频的主体信息,所述主体信息至少包括主体和主体位置;Determine subject information of the initial image or initial video, where the subject information includes at least the subject and the subject position;
    基于所述主体信息对所述初始图像或初始视频进行编辑处理,得到所述多个视频片段。Perform editing processing on the initial image or initial video based on the subject information to obtain the multiple video clips.
  47. 根据权利要求46所述的方法,所述确定所述初始图像或初始视频中的主体信息包括:The method according to claim 46, wherein the determining the subject information in the initial image or the initial video comprises:
    获取一个主体信息确定模型;Obtain a subject information deterministic model;
    通过将所述初始图像或初始视频输入所述主体信息确定模型,确定所述主体信息。The subject information is determined by inputting the initial image or the initial video into the subject information determination model.
  48. 根据权利要求46或47的方法,所述基于所述主体信息对所述初始图像或初始视频进行编辑处理包括:According to the method of claim 46 or 47, said editing the initial image or initial video based on the subject information comprises:
    基于所述主体信息,识别所述初始图像或初始视频中主体的外轮廓;以及Based on the subject information, identifying the outer contour of the subject in the initial image or initial video; and
    根据所述主体的外轮廓,对所述初始图像或初始视频进行裁剪或缩放。According to the outer contour of the subject, the initial image or initial video is cropped or zoomed.
  49. 根据权利要求42所述的方法,所述视频配置信息包括第一预设条件和第二预设条件。The method according to claim 42, wherein the video configuration information includes a first preset condition and a second preset condition.
  50. 根据权利要求49所述的方法,所述基于所述至少部分视频片段和所述视频配置信息,生成一个目标视频包括:The method according to claim 49, wherein said generating a target video based on said at least part of the video segment and said video configuration information comprises:
    基于所述第一预设条件从所述多个视频片段中获取一个或多个候选视频片段;Obtaining one or more candidate video clips from the multiple video clips based on the first preset condition;
    对所述一个或多个候选视频片段进行分组以确定至少一个片段集合;以及Grouping the one or more candidate video segments to determine at least one segment set; and
    基于所述至少一个片段集合中的每个片段集合,生成一个目标视频。Based on each segment set in the at least one segment set, a target video is generated.
  51. 根据权利要求50所述的方法,所述第一预设条件与多个要素中的至少一个相关,所述多个要素包括目标视频包含特定对象,目标视频包含特定主题,目标视频的总时长,目标视频所包含的镜头画面数量,目标视频所包含特定的镜头画面,目标视频中特定主题的重叠数量,或目标视频中特定主题的聚焦时间。The method of claim 50, wherein the first preset condition is related to at least one of a plurality of elements, the plurality of elements including a target video containing a specific object, a target video containing a specific subject, and a total duration of the target video, The number of shots contained in the target video, the specific shots contained in the target video, the number of overlaps of a specific subject in the target video, or the focusing time of a specific subject in the target video.
  52. 根据权利要求51所述的方法,所述第一预设条件包括所述至少一个要素的值大于相应的阈值。The method according to claim 51, the first preset condition includes that the value of the at least one element is greater than a corresponding threshold.
  53. 根据权利要求51或52所述的方法,所述第一预设条件还包括所述多个要素中两个或以上特定要素之间的要素约束条件。The method according to claim 51 or 52, wherein the first preset condition further comprises an element constraint condition between two or more specific elements in the plurality of elements.
  54. 根据权利要求51-53中任一项所述的方法,所述第一预设条件包括目标视频中镜头画面的绑定条件,所述绑定条件反映至少两个特定镜头画面在目标视频中的关联关系,所述基于第一预设条件从所述多个视频片段中获取一个或多个候选视频片段包括:The method according to any one of claims 51-53, wherein the first preset condition comprises a binding condition of a shot frame in the target video, and the binding condition reflects the binding condition of at least two specific shot frames in the target video. The association relationship, the obtaining one or more candidate video clips from the multiple video clips based on the first preset condition includes:
    从所述多个视频片段中确定包含指定镜头画面的视频片段;Determine a video clip containing a specified lens picture from the multiple video clips;
    基于所述绑定条件将包含指定镜头画面的视频片段组合,以作为一个候选视频片段。Based on the binding condition, the video clips including the specified shots are combined to serve as a candidate video clip.
  55. 根据权利要求50所述的方法,所述至少一个片段集合包括两个及以上片段集合,所述两个及以上片段集合满足所述第二预设条件,所述第二预设条件与所述两个及以上片段集合之间的候选视频片段的组合差异度相关。The method according to claim 50, wherein the at least one segment set includes two or more segment sets, the two or more segment sets satisfy the second preset condition, and the second preset condition is consistent with the The combination differences of candidate video segments between two or more segment sets are related.
  56. 根据权利要求55所述的方法,所述对所述一个或多个候选视频片段进行分组以确定所述至少一个片段集合包括:The method of claim 55, the grouping the one or more candidate video segments to determine the at least one set of segments comprises:
    确定所述两个及以上片段集合中的每个片段集合与其他片段集合之间的候选视频片段的组合差异度;以及Determining the degree of difference in combination of candidate video segments between each of the two or more segment sets and other segment sets; and
    将与其他片段集合的组合差异度高于预设阈值的片段集合作为所述至少一个片段集合。A segment set whose combination difference with other segment sets is higher than a preset threshold is used as the at least one segment set.
  57. 根据权利要求56所述的方法,所述确定所述两个及以上片段集合中的每个片段集合与其他片段集合之间的候选视频片段的组合差异度包括:The method according to claim 56, wherein the determining the combination difference of candidate video segments between each of the two or more segment sets and other segment sets comprises:
    为所述一个或多个候选视频片段中的每一个赋予一个标识字符;Assigning an identification character to each of the one or more candidate video segments;
    基于所述一个或多个候选视频片段的标识字符,确定对应于所述片段集合与其他片段集合的字符串;以及Based on the identification characters of the one or more candidate video segments, determine character strings corresponding to the segment set and other segment sets; and
    将所述片段集合与其他片段集合对应的字符串的编辑距离确定为所述片段集合与其他片段集合之间的候选视频片段的组合差异度。The edit distance of the character strings corresponding to the segment set and the other segment sets is determined as the combination difference degree of the candidate video segments between the segment set and the other segment sets.
  58. 根据权利要求56所述的方法,所述确定所述两个及以上片段集合中的每个片段集合与其他片段集合之间的候选视频片段的组合差异度包括:The method according to claim 56, wherein the determining the combination difference of candidate video segments between each of the two or more segment sets and other segment sets comprises:
    基于训练好的特征提取模型以及所述两个及以上片段集合中的候选视频片段,生成每个候选视频片段对应的片段特征;Generating a segment feature corresponding to each candidate video segment based on the trained feature extraction model and the candidate video segments in the two or more segment sets;
    基于所述片段特征生成每个片段集合对应的集合特征向量;Generating a collection feature vector corresponding to each segment collection based on the segment features;
    基于训练好的判别模型以及所述每个片段集合对应的集合特征向量确定每个片段集合与其他片段集合之间的相似程度;以及Determine the degree of similarity between each fragment set and other fragment sets based on the trained discriminant model and the set feature vector corresponding to each fragment set; and
    基于所述相似程度确定每个片段集合与其他片段集合之间的组合差异度。Based on the degree of similarity, the degree of combined difference between each segment set and other segment sets is determined.
  59. 根据权利要求58所述的方法,所述基于训练好的判别模型以及所述每个片段集合对应的集合特征向量确定每个片段集合与其他片段集合之间的相似程度包括:The method according to claim 58, wherein the determining the degree of similarity between each fragment set and other fragment sets based on the trained discriminant model and the set feature vector corresponding to each fragment set comprises:
    基于聚类算法对所述集合特征向量进行聚类,获得多个集合聚类簇。Clustering the set feature vector based on a clustering algorithm to obtain a plurality of set cluster clusters.
  60. 根据权利要求58所述的方法,所述特征提取模型是基于序列的机器学习模型,所述基于训练好的特征提取模型以及所述两个及以上片段集合中的候选视频片段,生成所述候选视频片段对应的片段特征包括:The method according to claim 58, wherein the feature extraction model is a sequence-based machine learning model, and the candidate is generated based on the trained feature extraction model and candidate video segments in the set of two or more segments The segment features corresponding to the video segment include:
    获取每一个候选视频片段包含的多个视频帧;Acquire multiple video frames included in each candidate video segment;
    确定每个视频帧对应的一个或多个图像特征;以及Determine one or more image features corresponding to each video frame; and
    基于训练好的特征提取模型处理所述多个视频帧中的图像特征以及多个视频帧中图像特征之间的相互关系,确定候选视频片段对应的片段特征。Based on the trained feature extraction model, the image features in the multiple video frames and the relationship between the image features in the multiple video frames are processed to determine the segment feature corresponding to the candidate video segment.
  61. 根据权利要求60所述的方法,所述视频帧对应的图像特征包括视频帧中对象的形状信息、视频帧中多个对象之间的位置关系信息、视频帧中对象的色彩信息、视频帧中对象的完整程度或视频帧中的亮度中的至少一个。The method of claim 60, wherein the image features corresponding to the video frame include shape information of the object in the video frame, positional relationship information between multiple objects in the video frame, color information of the object in the video frame, and At least one of the completeness of the object or the brightness in the video frame.
  62. 根据权利要求49所述的方法,所述基于所述至少部分视频片段和所述视频配置信息,生成一个目标视频包括:The method according to claim 49, wherein said generating a target video based on said at least part of the video segment and said video configuration information comprises:
    基于所述多个视频片段生成多个候选片段集合,所述多个候选片段集合满足第二预设条件;Generating a plurality of candidate fragment sets based on the plurality of video fragments, the plurality of candidate fragment sets satisfying a second preset condition;
    基于第一预设条件从所述多个候选片段集合中筛选出至少一个片段集合;以及At least one segment set is selected from the plurality of candidate segment sets based on the first preset condition; and
    基于所述至少一个目标片段集合中的每个片段集合,生成一个目标视频。Based on each segment set in the at least one target segment set, a target video is generated.
  63. 根据权利要求50或62所述的方法,所述视频配置信息进一步包括序列信息,所述基于所述至少一个片段集合中的每个片段集合,生成一个目标视频包括:The method according to claim 50 or 62, wherein the video configuration information further includes sequence information, and the generating a target video based on each segment set in the at least one segment set includes:
    基于所述序列信息,将所述每个片段合集中的候选视频片段进行排序组合,生成一个目标视频。Based on the sequence information, the candidate video segments in each of the segment collections are sorted and combined to generate a target video.
  64. 根据权利要求42所述的方法,所述视频配置信息进一步包括美化参数,所述美化参数包括滤镜参数、动画参数、布局参数中的至少一个。The method according to claim 42, wherein the video configuration information further includes a beautification parameter, and the beautification parameter includes at least one of a filter parameter, an animation parameter, and a layout parameter.
  65. 根据权利要求42所述的方法,进一步包括:The method of claim 42, further comprising:
    基于所述视频配置信息,获取文字层、背景层或装饰层以及载入参数;Based on the video configuration information, obtaining a text layer, a background layer or a decoration layer and loading parameters;
    根据所述载入参数确定所述文字层、所述背景层以及所述装饰层在所述目标视频中的布局情况。The layout of the text layer, the background layer, and the decoration layer in the target video is determined according to the loading parameters.
  66. 根据权利要求42所述的方法,进一步包括:The method of claim 42, further comprising:
    对所述多个视频片段进行归一化处理。Perform normalization processing on the multiple video clips.
  67. 根据权利要求42所述的方法,进一步包括:The method of claim 42, further comprising:
    获取初始音频;Get the initial audio;
    对所述初始音频基于节奏进行标记得到至少一个音频切分点;Marking the initial audio based on rhythm to obtain at least one audio segmentation point;
    至少部分基于所述视频配置信息确定所述目标视频的至少一个视频切分点;Determine at least one video segmentation point of the target video based at least in part on the video configuration information;
    将所述至少一个音频切分点与所述至少一个视频切分点进行匹配;以及Matching the at least one audio segmentation point with the at least one video segmentation point; and
    基于匹配结果将切分后的音频与所述目标视频进行合成。Based on the matching result, the segmented audio is synthesized with the target video.
  68. 根据权利要求42所述的方法,进一步包括:The method of claim 42, further comprising:
    对所述目标视频进行后处理以满足至少一个视频输出条件,所述至少一个视频输出条件与所述目标视频的播放媒介相关。Post-processing the target video to satisfy at least one video output condition, and the at least one video output condition is related to a playback medium of the target video.
  69. 根据权利要求68所述的方法,所述至少一个视频输出条件包括视频尺寸条件,所述对目标视频进行后处理包括:The method according to claim 68, wherein the at least one video output condition comprises a video size condition, and the post-processing of the target video comprises:
    根据所述视频尺寸条件对所述目标视频的画面进行裁剪。The frame of the target video is cropped according to the video size condition.
  70. 根据权利要求69所述的方法,所述根据所述视频尺寸条件对所述目标视频的画面进行裁剪包括:The method according to claim 69, wherein the cropping the frame of the target video according to the video size condition comprises:
    获取所述目标视频包含的各个视频片段的裁剪主体信息,所述裁剪主体信息反应该视频片段的特定裁剪主体以及所述特定裁剪主体的位置信息;Acquiring cropping subject information of each video segment included in the target video, where the cropping subject information reflects a specific cropping subject of the video segment and position information of the specific cropping subject;
    根据所述视频尺寸条件对应的预设画面尺寸以及所述裁剪主体信息,对所述目标视频所包含的各个视频片段的画面进行裁剪。According to the preset picture size corresponding to the video size condition and the cropping subject information, the picture of each video segment included in the target video is cropped.
  71. 根据权利要求70所述的方法,所述根据所述视频尺寸条件对应的预设画面尺寸以及所述裁剪主体信息,对所述目标视频所包含的各个视频片段的画面进行裁剪包括:The method according to claim 70, wherein, according to the preset picture size corresponding to the video size condition and the cropping subject information, cropping the picture of each video segment included in the target video comprises:
    对于所述目标视频所包含的每个视频片段,For each video segment contained in the target video,
    根据所述裁剪主体信息和所述预设画面尺寸,确定所述视频片段中至少一个视频帧的裁剪框的尺寸和初始位置;Determining the size and initial position of the cropping frame of at least one video frame in the video clip according to the cropping subject information and the preset picture size;
    对所述至少一个视频帧的裁剪框的初始位置进行处理,确定所述至少一个视频帧的裁剪框对应的最终位置;Processing the initial position of the cropping frame of the at least one video frame, and determining the final position corresponding to the cropping frame of the at least one video frame;
    根据所述裁剪框的最终位置,将所述视频片段包含的各个视频帧的画面进行裁剪,以保留所述裁剪框内的画面。According to the final position of the cropping frame, the pictures of each video frame included in the video clip are cropped to preserve the pictures in the cropping frame.
  72. 根据权利要求71所述的方法,所述对所述至少一个视频帧的裁剪框的初始位置进行处理,确定所述至少一个视频帧的裁剪框对应的最终位置包括:The method according to claim 71, wherein the processing the initial position of the cropping frame of the at least one video frame, and determining the final position corresponding to the cropping frame of the at least one video frame comprises:
    将所述视频片段的所述至少一个视频帧的裁剪框的参考点的初始坐标信息根据时间进行平滑处理;Smoothing the initial coordinate information of the reference point of the crop frame of the at least one video frame of the video segment according to time;
    根据平滑处理的结果,确定所述参考点的最终坐标信息;以及Determine the final coordinate information of the reference point according to the result of the smoothing process; and
    基于所述最终坐标信息确定所述参考点的位置。The position of the reference point is determined based on the final coordinate information.
  73. 根据权利要求72所述的方法,所述将所述参考点的初始坐标信息进行平滑处理包括:The method according to claim 72, said smoothing the initial coordinate information of the reference point comprises:
    对所述参考点的初始坐标进行线性回归处理,以获取线性回归方程及其斜率。Perform linear regression processing on the initial coordinates of the reference point to obtain the linear regression equation and its slope.
  74. 根据权利要求73所述的方法,所述根据平滑处理的结果,确定所述参考点的最终坐标信息包括:The method according to claim 73, wherein the determining the final coordinate information of the reference point according to the result of the smoothing process comprises:
    将所述斜率的绝对值与一个斜率阈值进行比较;Comparing the absolute value of the slope with a slope threshold;
    响应于所述斜率的绝对值小于所述斜率阈值,In response to the absolute value of the slope being less than the slope threshold,
    将线性回归方程的趋势线的中点的位置作为每个所述视频帧的裁剪框的参考点的最终位置;以及Taking the position of the midpoint of the trend line of the linear regression equation as the final position of the reference point of the cropping frame of each video frame; and
    响应于所述斜率的绝对值大于或等于所述斜率阈值,In response to the absolute value of the slope being greater than or equal to the slope threshold,
    将所述线性回归方程的趋势线上每个视频帧的时间点所对应的位置作为该视频帧的裁剪框的参考点的最终位置。The position corresponding to the time point of each video frame on the trend line of the linear regression equation is used as the final position of the reference point of the crop frame of the video frame.
  75. 根据权利要求71所述的方法,所述根据所述裁剪主体信息和所述预设画面尺寸,确定所述视频片段中至少一个视频帧的裁剪框的尺寸和位置包括:The method according to claim 71, wherein the determining the size and position of the cropping frame of at least one video frame in the video clip according to the cropping subject information and the preset picture size comprises:
    根据所述目标视频的主题信息和所述裁剪主体信息,确定所述裁剪主体信息中的一个或多个特定裁剪主体与所述主题信息之间的相关度;Determine the correlation between one or more specific cropped subjects in the cropped subject information and the subject information according to the subject information of the target video and the cropped subject information;
    根据所述预设画面尺寸和所述特定裁剪主体,确定与所述至少一个视频帧对应的至少一个备选裁剪框;Determining at least one candidate cropping frame corresponding to the at least one video frame according to the preset picture size and the specific cropping subject;
    根据所述裁剪主体信息和所述相关度,为所述至少一个备选裁剪框打分;以及Scoring the at least one candidate cropping frame according to the cropping subject information and the correlation; and
    基于所述备选裁剪框的打分结果,确定所述至少一个视频帧的裁剪框的尺寸和位置。Based on the scoring result of the candidate cropping frame, the size and position of the cropping frame of the at least one video frame are determined.
  76. 根据权利要求75所述的方法,所述获取所述目标视频包含的各个视频片段的裁剪主体信息包括:The method according to claim 75, wherein said obtaining the cropping subject information of each video segment included in the target video comprises:
    使用机器学习模型获取每个所述视频片段中的候选裁剪主体;Using a machine learning model to obtain candidate cropped subjects in each of the video clips;
    根据所述目标视频的主题信息,从所述候选裁剪主体中挑选出所述一个或多个特定裁剪主体。According to the subject information of the target video, the one or more specific crop subjects are selected from the candidate crop subjects.
  77. 根据权利要求42所述的方法,进一步包括:The method of claim 42, further comprising:
    将所述目标视频向特定受众人群投放。The target video is delivered to a specific audience group.
  78. 根据权利要求77所述的方法,所述特定受众人群符合特定受众特征条件,所述基于所述至少部分视频片段和所述视频配置信息,生成一个目标视频包括:The method according to claim 77, wherein said specific audience group meets specific demographic characteristics, and said generating a target video based on said at least part of the video fragment and said video configuration information comprises:
    获取所述多个视频片段的受众接受度;Obtaining audience acceptance of the multiple video clips;
    对于所述特定的受众群体,根据对应的受众特征条件从所述多个视频片段中确定受众接受度高于阈值的候选片段,用以生成所述目标视频。For the specific audience group, a candidate segment whose audience acceptance is higher than a threshold is determined from the plurality of video segments according to corresponding demographic conditions, so as to generate the target video.
  79. 根据权利要求78所述的方法,进一步包括:The method of claim 78, further comprising:
    获取所述目标视频的投放效果反馈,并根据所述投放效果反馈调整所述受众特征条件或受众接受度中的至少一个。Obtain the delivery effect feedback of the target video, and adjust at least one of the demographic condition or audience acceptance according to the delivery effect feedback.
  80. 根据权利要求79所述的方法,其中,所述投放效果反馈与所述目标视频的完播率、重播次数或观看人数中的至少一个相关。The method of claim 79, wherein the delivery effect feedback is related to at least one of the completion rate, the number of replays, or the number of viewers of the target video.
  81. 根据权利要求77-80中任一项所述的方法,所述目标视频包括一个创意广告,所述方法进一步包括:The method according to any one of claims 77-80, wherein the target video includes a creative advertisement, and the method further comprises:
    基于所述创意广告的至少一个广告元素的元素效果参数,确定所述目标视频的预估效果数据。Determine the estimated effect data of the target video based on the element effect parameter of at least one advertisement element of the creative advertisement.
  82. 根据权利要求81所述的方法,基于所述创意广告的至少一个广告元素的元素效果参数,确定所述目标视频的预估效果数据包括:The method according to claim 81, based on an element effect parameter of at least one advertisement element of the creative advertisement, determining the estimated effect data of the target video comprises:
    获取广告元素效果预估模型;Obtain the effect prediction model of advertising elements;
    将标记有至少一个元素标签的所述广告元素输入所述广告元素效果预估模型,确定所述广告元素的元素效果参数,所述至少一个元素标签包括所述广告元素与所述创意广告的关系;The advertisement element marked with at least one element tag is input into the advertisement element effect prediction model to determine the element effect parameter of the advertisement element, and the at least one element tag includes the relationship between the advertisement element and the creative advertisement ;
    基于所述至少一个广告元素的元素效果参数,在所述至少一个广告元素中确定符合预期的广告元素,所述符合预期的广告元素的元素效果参数大于参数阈值;Based on the element effect parameter of the at least one advertisement element, determining an expected advertisement element in the at least one advertisement element, and the element effect parameter of the expected advertisement element is greater than a parameter threshold;
    确定所述符合预期的广告元素占所述创意广告中所述至少一个广告元素的比例;Determining the proportion of the advertising element that meets expectations in the at least one advertising element in the creative advertisement;
    基于所述比例确定所述目标视频的所述预估效果数据。The estimated effect data of the target video is determined based on the ratio.
  83. 一种非暂时性计算机可读介质,包括用于确定最佳策略的至少一组指令,其中当由计算设备的一个或以上处理器执行时,所述至少一组指令使所述计算设备用于执行方法,所述方法包括:A non-transitory computer-readable medium comprising at least one set of instructions for determining an optimal strategy, wherein when executed by one or more processors of a computing device, the at least one set of instructions causes the computing device to Execution method, the method includes:
    获取多个视频片段;Get multiple video clips;
    获取视频配置信息,所述视频配置信息包括所述多个视频片段中至少部分视频片段的一个或多个配置特征,所述配置特征包括内容特征或排布特征中的至少一个;以及Acquiring video configuration information, where the video configuration information includes one or more configuration features of at least some of the multiple video clips, and the configuration features include at least one of content features or arrangement features; and
    基于所述至少部分视频片段和所述视频配置信息,生成一个目标视频。Based on the at least part of the video segment and the video configuration information, a target video is generated.
PCT/CN2021/101816 2020-06-23 2021-06-23 System and method for generating video WO2021259322A1 (en)

Applications Claiming Priority (8)

Application Number Priority Date Filing Date Title
CN202010578632.1 2020-06-23
CN202010578632.1A CN111815645B (en) 2020-06-23 2020-06-23 Method and system for cutting advertisement video picture
CN202010738213.X 2020-07-28
CN202010738213.XA CN111918146B (en) 2020-07-28 2020-07-28 Video synthesis method and system
CN202010741962.8 2020-07-29
CN202010741962.8A CN111739128B (en) 2020-07-29 2020-07-29 Target video generation method and system
CN202110503297.3A CN112989116B (en) 2021-05-10 2021-05-10 Video recommendation method, system and device
CN202110503297.3 2021-05-10

Publications (1)

Publication Number Publication Date
WO2021259322A1 true WO2021259322A1 (en) 2021-12-30

Family

ID=79282010

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/101816 WO2021259322A1 (en) 2020-06-23 2021-06-23 System and method for generating video

Country Status (1)

Country Link
WO (1) WO2021259322A1 (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114449346A (en) * 2022-02-14 2022-05-06 腾讯科技(深圳)有限公司 Video processing method, device, equipment and storage medium
CN114466145A (en) * 2022-01-30 2022-05-10 北京字跳网络技术有限公司 Video processing method, device, equipment and storage medium
CN114520931A (en) * 2021-12-31 2022-05-20 脸萌有限公司 Video generation method and device, electronic equipment and readable storage medium
CN114615513A (en) * 2022-03-08 2022-06-10 北京字跳网络技术有限公司 Video data generation method and device, electronic equipment and storage medium
CN115134646A (en) * 2022-08-25 2022-09-30 荣耀终端有限公司 Video editing method and electronic equipment
CN116307218A (en) * 2023-03-27 2023-06-23 松原市邹佳网络科技有限公司 Meta-universe experience user behavior prediction method and system based on artificial intelligence
CN116567350A (en) * 2023-05-19 2023-08-08 上海国威互娱文化科技有限公司 Panoramic video data processing method and system
CN116612060A (en) * 2023-07-19 2023-08-18 腾讯科技(深圳)有限公司 Video information processing method, device and storage medium
CN116634233A (en) * 2023-04-12 2023-08-22 北京优贝卡科技有限公司 Media editing method, device, equipment and storage medium
US20230412887A1 (en) * 2022-05-21 2023-12-21 Vmware, Inc. Personalized informational user experiences using visual content
CN117557689A (en) * 2024-01-11 2024-02-13 腾讯科技(深圳)有限公司 Image processing method, device, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110392281A (en) * 2018-04-20 2019-10-29 腾讯科技(深圳)有限公司 Image synthesizing method, device, computer equipment and storage medium
CN110913271A (en) * 2019-11-29 2020-03-24 Oppo广东移动通信有限公司 Video processing method, mobile terminal and non-volatile computer-readable storage medium
CN111739128A (en) * 2020-07-29 2020-10-02 广州筷子信息科技有限公司 Target video generation method and system
CN111815645A (en) * 2020-06-23 2020-10-23 广州筷子信息科技有限公司 Method and system for cutting advertisement video picture
CN111918146A (en) * 2020-07-28 2020-11-10 广州筷子信息科技有限公司 Video synthesis method and system
CN112989116A (en) * 2021-05-10 2021-06-18 广州筷子信息科技有限公司 Video recommendation method, system and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110392281A (en) * 2018-04-20 2019-10-29 腾讯科技(深圳)有限公司 Image synthesizing method, device, computer equipment and storage medium
CN110913271A (en) * 2019-11-29 2020-03-24 Oppo广东移动通信有限公司 Video processing method, mobile terminal and non-volatile computer-readable storage medium
CN111815645A (en) * 2020-06-23 2020-10-23 广州筷子信息科技有限公司 Method and system for cutting advertisement video picture
CN111918146A (en) * 2020-07-28 2020-11-10 广州筷子信息科技有限公司 Video synthesis method and system
CN111739128A (en) * 2020-07-29 2020-10-02 广州筷子信息科技有限公司 Target video generation method and system
CN112989116A (en) * 2021-05-10 2021-06-18 广州筷子信息科技有限公司 Video recommendation method, system and device

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114520931A (en) * 2021-12-31 2022-05-20 脸萌有限公司 Video generation method and device, electronic equipment and readable storage medium
CN114520931B (en) * 2021-12-31 2024-01-23 脸萌有限公司 Video generation method, device, electronic equipment and readable storage medium
CN114466145A (en) * 2022-01-30 2022-05-10 北京字跳网络技术有限公司 Video processing method, device, equipment and storage medium
CN114466145B (en) * 2022-01-30 2024-04-12 北京字跳网络技术有限公司 Video processing method, device, equipment and storage medium
CN114449346A (en) * 2022-02-14 2022-05-06 腾讯科技(深圳)有限公司 Video processing method, device, equipment and storage medium
CN114449346B (en) * 2022-02-14 2023-08-15 腾讯科技(深圳)有限公司 Video processing method, device, equipment and storage medium
CN114615513B (en) * 2022-03-08 2023-10-20 北京字跳网络技术有限公司 Video data generation method and device, electronic equipment and storage medium
CN114615513A (en) * 2022-03-08 2022-06-10 北京字跳网络技术有限公司 Video data generation method and device, electronic equipment and storage medium
US20230412887A1 (en) * 2022-05-21 2023-12-21 Vmware, Inc. Personalized informational user experiences using visual content
CN115134646A (en) * 2022-08-25 2022-09-30 荣耀终端有限公司 Video editing method and electronic equipment
CN116307218A (en) * 2023-03-27 2023-06-23 松原市邹佳网络科技有限公司 Meta-universe experience user behavior prediction method and system based on artificial intelligence
CN116634233A (en) * 2023-04-12 2023-08-22 北京优贝卡科技有限公司 Media editing method, device, equipment and storage medium
CN116634233B (en) * 2023-04-12 2024-02-09 北京七彩行云数字技术有限公司 Media editing method, device, equipment and storage medium
CN116567350A (en) * 2023-05-19 2023-08-08 上海国威互娱文化科技有限公司 Panoramic video data processing method and system
CN116567350B (en) * 2023-05-19 2024-04-19 上海国威互娱文化科技有限公司 Panoramic video data processing method and system
CN116612060B (en) * 2023-07-19 2023-09-22 腾讯科技(深圳)有限公司 Video information processing method, device and storage medium
CN116612060A (en) * 2023-07-19 2023-08-18 腾讯科技(深圳)有限公司 Video information processing method, device and storage medium
CN117557689A (en) * 2024-01-11 2024-02-13 腾讯科技(深圳)有限公司 Image processing method, device, electronic equipment and storage medium
CN117557689B (en) * 2024-01-11 2024-03-29 腾讯科技(深圳)有限公司 Image processing method, device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
WO2021259322A1 (en) System and method for generating video
US9570107B2 (en) System and method for semi-automatic video editing
US10679063B2 (en) Recognizing salient video events through learning-based multimodal analysis of visual features and audio-based analytics
US20180330152A1 (en) Method for identifying, ordering, and presenting images according to expressions
US9554111B2 (en) System and method for semi-automatic video editing
Amato et al. AI in the media and creative industries
JP5507386B2 (en) Generating video content from image sets
US8948515B2 (en) Method and system for classifying one or more images
KR101348521B1 (en) Personalizing a video
WO2017190639A1 (en) Media information display method, client and server
US8126763B2 (en) Automatic generation of trailers containing product placements
KR102119868B1 (en) System and method for producting promotional media contents
US20080172293A1 (en) Optimization framework for association of advertisements with sequential media
JP2006510113A (en) Method for generating customized photo album pages and prints based on person and gender profiles
CN111739128A (en) Target video generation method and system
US20210117471A1 (en) Method and system for automatically generating a video from an online product representation
JP2020005309A (en) Moving image editing server and program
Zhang et al. A comprehensive survey on computational aesthetic evaluation of visual art images: Metrics and challenges
US11942116B1 (en) Method and system for generating synthetic video advertisements
JP6730757B2 (en) Server and program, video distribution system
WO2023045635A1 (en) Multimedia file subtitle processing method and apparatus, electronic device, computer-readable storage medium, and computer program product
Colombo et al. Retrieval of commercials by semantic content: The semiotic perspective
JP6730760B2 (en) Server and program, video distribution system
JP2019220098A (en) Moving image editing server and program
CN112235516B (en) Video generation method, device, server and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21829313

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 23.05.2023 DATED 23.05.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21829313

Country of ref document: EP

Kind code of ref document: A1