CN114501058A

CN114501058A - Video generation method and device, electronic equipment and storage medium

Info

Publication number: CN114501058A
Application number: CN202111599915.5A
Authority: CN
Inventors: 顾廷飞; 冯雪菲; 刘旭东; 段训瑞; 李�杰; 李武波; 赵媛媛; 雷刚; 耿旺; 张宏伟; 莫晓; 赵俊; 徐青国; 蒋惠康
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-12-24
Filing date: 2021-12-24
Publication date: 2022-05-13

Abstract

The present disclosure relates to a video generation method, apparatus, electronic device and storage medium, including: processing live video stream data to obtain a video to be processed, performing content understanding processing on the video to be processed to obtain an object description fragment set corresponding to preset category information, determining live broadcast characteristic information of each object description fragment, wherein the live broadcast characteristic information comprises sharing degree information, the sharing degree information represents the sharing degree of objects in the object description fragments, determining target fragments from the object description fragment set based on video duration information, fragment duration information and the sharing degree information, and combining the target fragments to obtain a target video. According to the method and the device for processing the target video, the plurality of object description segments are obtained through processing of the video to be processed, and the segments which are higher in quality and more consistent with user requirements are determined from the object description segments on the basis of the sharing degree information and the preset category information corresponding to each object description segment, so that the finally obtained target video has higher quality and more coherent content.

Description

Video generation method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of internet technologies, and in particular, to a video generation method and apparatus, an electronic device, and a storage medium.

Background

With the rapid development of the current mobile internet, information dissemination based on mobile terminals has become more and more mature. Generally, information can be distributed by various browsers or applications embedded in the terminal. The applications may generally include social applications, video playback applications, gaming applications, and the like.

However, the current information is usually obtained by shooting a video based on a certain object, which is relatively single in mode and has no pertinence to the popularization of different channels, so that the quality of the obtained information does not meet the requirement.

Disclosure of Invention

The present disclosure provides a video generation method, apparatus, electronic device and storage medium, and the technical scheme of the present disclosure is as follows:

according to a first aspect of the embodiments of the present disclosure, there is provided a video generation method, including:

processing live video stream data to obtain a video to be processed;

performing content understanding processing on a video to be processed to obtain an object description fragment set corresponding to preset category information; each object description fragment in the object description fragment set comprises fragment duration information;

determining live broadcast characteristic information of each object description fragment; the live broadcast characteristic information comprises sharing degree information; the sharing degree information represents the sharing degree of the objects in the object description fragment;

determining a target segment from the object description segment set based on the video duration information, the segment duration information of the object description segment and the sharing degree information;

and combining the target segments to obtain a target video.

In some possible embodiments, determining the target segment from the object description segment set based on the video duration information, the segment duration information of the object description segment, and the sharing degree information includes:

determining a first set of object description fragments from a set of object description fragments; sharing degree information of each first object description fragment in the first object description fragment set is a first preset sharing degree;

determining segment total duration information based on the segment duration information of each first object description segment;

and if the total segment duration information is greater than or equal to the video duration information, determining the target segment based on the first object description segment set.

In some possible embodiments, the live broadcast feature information includes live broadcast attribute information, and the live broadcast attribute information includes live broadcast room traffic, object operation amount, and behavior interaction amount; the method further comprises the following steps:

if the total duration information of the segments is less than the video duration information, determining a second object description segment set from the object description segment set; sharing degree information of each second object description fragment in the second object description fragment set is a second preset sharing degree;

determining supplementary duration information according to the total duration information of the segments and the video duration information;

determining supplementary segments from the second object description segment set based on the supplementary duration information, the segment duration information of the second object description segment, and the live broadcast attribute information;

a target segment is determined based on the first set of object description segments and the supplemental segment.

In some possible embodiments, determining live feature information for each object description segment includes:

dividing each object description fragment into a plurality of sub-fragments according to a preset time interval;

determining the flow, the object operation amount and the behavior interaction amount of each sub-segment in the plurality of sub-segments;

determining the live broadcast room flow, the object operation amount and the behavior interaction amount of each object description segment according to the live broadcast room flow, the object operation amount and the behavior interaction amount corresponding to each sub-segment;

and determining the sharing degree information of each object description fragment.

In some possible embodiments, determining the sharing degree information of each object description fragment includes:

inputting each object description fragment into a sharing information recognition model to obtain time sharing identification information, quality sharing identification information, resource sharing identification information and/or positioning sharing identification information of each object description fragment;

and determining sharing degree information of each object description fragment based on the time sharing identification information, the quality sharing identification information, the resource sharing identification information and/or the positioning sharing identification information of each object description fragment.

In some possible embodiments, combining the target segments into the target video includes:

if the target segment carries the initial background music, deleting the background music of the target segment to obtain a transition segment;

combining the transition segments to obtain a transition video;

and carrying out music matching on the transition video to obtain a target video.

In some possible embodiments, the method further comprises:

carrying out keyword recognition processing on the target video based on preset keywords, and determining a key frame in the target video; the key frame is an image frame corresponding to a preset keyword;

and adding subtitles, stickers or expression packages corresponding to preset keywords on the key frames.

In some possible embodiments, before processing the live video stream data to obtain the video to be processed, the method further includes:

and receiving a video generation instruction, wherein the video generation instruction comprises an identifier corresponding to live video stream data, video duration information and preset category information.

According to a second aspect of the embodiments of the present disclosure, there is provided a video generating apparatus including:

the video acquisition module is configured to execute the processing of the live video stream data to obtain a video to be processed;

the description fragment acquisition module is configured to execute content understanding processing on a video to be processed to obtain an object description fragment set corresponding to preset category information; each object description fragment in the object description fragment set comprises fragment duration information;

a characteristic information determination module configured to perform determining live characteristic information of each object description segment; the live broadcast characteristic information comprises sharing degree information; the sharing degree information represents the sharing degree of the objects in the object description fragment;

a target segment determination module configured to perform determining a target segment from the set of object description segments based on the video duration information, the segment duration information of the object description segments, and the sharing degree information;

and the combining module is configured to perform the combination of the target segments to obtain the target video.

In some possible embodiments, the target segment determination module is configured to perform:

In some possible embodiments, the live broadcast feature information includes live broadcast attribute information, and the live broadcast attribute information includes live broadcast room traffic, object operation amount, and behavior interaction amount; a target segment determination module configured to perform:

In some possible embodiments, the feature information determination module is configured to perform:

determining the flow, object operation amount and behavior interaction amount of each live broadcast room of the plurality of sub-segments;

In some possible embodiments, the combining module is configured to perform:

combining the transition segments to obtain a transition video;

In some possible embodiments, the apparatus further comprises:

the keyword determining module is configured to perform keyword recognition processing on the target video based on preset keywords and determine a key frame in the target video; the key frame is an image frame corresponding to a preset keyword; and adding subtitles, stickers or expression packages corresponding to preset keywords on the key frames.

In some possible embodiments, the apparatus further comprises:

the receiving module is configured to execute a video generation receiving instruction, and the video generation instruction comprises an identifier corresponding to live video stream data, video duration information and preset category information.

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to execute the instructions to implement the method of any one of the first aspect as described above.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium, wherein instructions, when executed by a processor of an electronic device, enable the electronic device to perform the method of any one of the first aspects of the embodiments of the present disclosure.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product comprising a computer program, the computer program being stored in a readable storage medium, from which at least one processor of a computer device reads and executes the computer program, causing the computer device to perform the method of any one of the first aspects of embodiments of the present disclosure.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

processing live video stream data to obtain a to-be-processed video, performing content understanding processing on the to-be-processed video to obtain an object description fragment set corresponding to preset category information, wherein each object description fragment in the object description fragment set comprises fragment duration information, determining live broadcast characteristic information of each object description fragment, the live broadcast characteristic information comprises sharing degree information, the sharing degree information represents the sharing degree of objects in the object description fragments, determining target fragments from the object description fragment set based on the video duration information, the fragment duration information of the object description fragments and the sharing degree information, and combining the target fragments to obtain a target video. According to the method and the device for processing the target video, the plurality of object description segments are obtained through processing of the video to be processed, and the segments which are higher in quality and more consistent with user requirements are determined from the object description segments on the basis of the sharing degree information and the preset category information corresponding to each object description segment, so that the finally obtained target video has higher quality and more coherent content.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

FIG. 1 is a schematic diagram illustrating an application environment in accordance with an illustrative embodiment;

FIG. 2 is a flow diagram illustrating a video generation method in accordance with an exemplary embodiment;

fig. 3 is a flow diagram illustrating a method for live feature information determination in accordance with an example embodiment;

FIG. 4 is a flow diagram illustrating a method for target segment determination in accordance with an exemplary embodiment;

FIG. 5 is a flow diagram illustrating a method for target segment determination in accordance with an exemplary embodiment;

FIG. 6 is a block diagram illustrating a video generation apparatus in accordance with an exemplary embodiment;

FIG. 7 is a block diagram illustrating an electronic device for video generation in accordance with an exemplary embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

It should be noted that, the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data for presentation, analyzed data, etc.) referred to in the present disclosure are information and data authorized by the user or sufficiently authorized by each party.

Referring to fig. 1, fig. 1 is a schematic diagram illustrating an application environment of a video generation method according to an exemplary embodiment, and as shown in fig. 1, the application environment may include a server 01 and a client 02.

In some possible embodiments, the client 02 may include, but is not limited to, a smartphone, a desktop computer, a tablet computer, a laptop computer, a smart speaker, a digital assistant, an Augmented Reality (AR)/Virtual Reality (VR) device, a smart wearable device, and the like. The software running on the client may also be an application program, an applet, or the like. Alternatively, the operating system running on the client may include, but is not limited to, an android system, an IOS system, linux, windows, Unix, and the like.

In some possible embodiments, the server 01 processes live video stream data to obtain a to-be-processed video, performs content understanding processing on the to-be-processed video to obtain an object description fragment set corresponding to preset category information, wherein each object description fragment in the object description fragment set includes fragment duration information, determines live broadcast feature information of each object description fragment, the live broadcast feature information includes sharing degree information, the sharing degree information represents sharing degree of an object in the object description fragment, determines a target fragment from the object description fragment set based on the video duration information, the fragment duration information of the object description fragment, and the sharing degree information, and combines the target fragment to obtain a target video.

Optionally, the server 01 may include an independent physical server, or a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network), a big data and artificial intelligence platform, and the like. The operating system running on the server may include, but is not limited to, an android system, an IOS system, linux, windows, Unix, and the like.

Fig. 2 is a flowchart illustrating a video generation method according to an exemplary embodiment, where as shown in fig. 2, the video generation method may be applied to a server and may also be applied to other node devices, and includes the following steps:

in step S201, the live video stream data is processed to obtain a video to be processed.

In this embodiment, the live video stream data may be live video stream data of a certain live broadcast room. This live broadcast room can belong to certain live broadcast platform, for example short video live broadcast platform, music live broadcast platform, the live broadcast platform of recreation, article share live broadcast platform etc..

In some possible embodiments, the live video stream data is processed, and the obtained target video is to push the target video to a live platform to which the live broadcast room belongs, or to other platforms, so that more users can enter the live broadcast room through the target video. Based on the method, the server can process the live video stream data according to various application conditions to obtain the video to be processed.

In an alternative embodiment, the server may process live video stream data of a live broadcast room to obtain a target video for the purpose of promoting the live broadcast room.

In another alternative embodiment, the description is given by taking the live platform as an article sharing live platform as an example. An item provider of an item shared by a main broadcast or a live broadcast room of the live broadcast room can request the server to process live broadcast video stream data to obtain a video to be processed and further obtain a target video from a target of sharing the live broadcast room or the item sharing the live broadcast room.

Optionally, the server may receive a video generation instruction sent by the client, where the video generation instruction may include an identifier corresponding to live video stream data and video duration information. After the server receives the video generation instruction, the server can analyze the video generation instruction to obtain an identifier and video duration information corresponding to the live video stream data, and obtain the live video stream data based on the identifier corresponding to the live video stream data.

Since items shared in a live broadcast room may include a wide variety of objects, such as different types of cosmetics, lipstick, eyebrow pencil, eye shadow, and so forth. If there is no requirement for the objects contained in the finally obtained target video, different kinds of objects may appear in one target video. However, this situation may cause the narrative segments of multiple types of objects to be spliced together to form a target video, and the information is relatively messy.

For the above reasons, each received video generation instruction may include a preset category information, such as a lipstick category. Thus, the target video corresponding to the video generation instruction may contain a segment describing the lipstick. Optionally, the number of target videos corresponding to the video generation instruction may be one or multiple.

Alternatively, each received video generation instruction may include a plurality of preset category information, such as a lipstick category and an eye shadow category. Thus, the target video corresponding to the video generation instruction may include two target videos, one of the target videos may include a segment describing lipstick, and the other target video may include a segment describing eye shadow. Alternatively, the number of target videos included in each target video may be one or more.

In the embodiment of the application, the video duration information may be used to limit the duration of the finally obtained target video.

In some possible embodiments for obtaining the video to be processed, the server may automatically intercept the live video stream data along with the live video stream data playing process to obtain the video to be processed. For example, the server may intercept live video stream data every ten minutes along with the playing process of the live broadcast room, so as to obtain a video to be processed. Each time a video to be processed is obtained, the server may obtain a target video corresponding to the video to be processed according to the method in steps S203-S209.

In step S203, performing content understanding processing on the video to be processed to obtain an object description fragment set corresponding to the preset category information; each object description fragment in the object description fragment set includes fragment duration information.

Since it has been mentioned above that the present application is intended to push the target video to the live platform to which the live broadcast room belongs, or the rest of the platforms, so that more users can enter the live broadcast room through the target video. Based on this, the content contained in the target video needs to be able to attract the user, which causes the server to determine some associated content, such as content describing the object, from the video to be processed in the process of determining the target video. That is, some irrelevant segments need to be deleted from the video to be processed, such as segments that the anchor is not speaking, or segments that the anchor is not speaking or is not relevant to the subject, etc.

In an alternative embodiment, the server may perform content understanding processing on the video to be processed by using a content understanding algorithm to obtain a plurality of object description segments. Alternatively, each object description fragment may be a description fragment corresponding to one object.

After the server obtains the plurality of object description fragments, the plurality of object description fragments may be classified based on the object identifier carried by each object description fragment to obtain a plurality of object description fragment sets corresponding to the plurality of object identifiers. For example, assume that the plurality of object identifications include an object identification corresponding to a lipstick, an object identification corresponding to an eyebrow pencil, and an object identification corresponding to an eye shadow.

In this embodiment of the application, if the preset category information includes all the object identifiers, the object description fragment set corresponding to the obtained preset category information may include an object description fragment set corresponding to a lipstick, an object description fragment set corresponding to an eyebrow pencil, and an object description fragment set corresponding to an eye shadow. If the preset category information refers to a certain or several specific object identifiers, such as object identifiers corresponding to lipstick, the object description fragment set corresponding to the preset category information is obtained to include the object description fragment set corresponding to lipstick.

Optionally, the object described in each object description segment in the object description segment set corresponding to one preset category information may be an object under the same brand, for example, a lipstick with a different color number under the same brand, or a lipstick with a different color number under different brands. These can all be set flexibly based on the specific objects shared by the live rooms.

Optionally, each object description fragment in the object description fragment set includes fragment duration information.

In step S205, determining live feature information of each object description segment; the live broadcast characteristic information comprises sharing degree information; the sharing degree information characterizes the sharing degree of the objects in the object description fragment.

An object description fragment set corresponding to preset category information is taken as an object description fragment set corresponding to lipstick, and the object description fragment set includes the following 10 fragments as an example for description:

segment 1, segment duration information is 20 seconds;

segment 2, the segment duration information is 10 seconds;

segment 3, the segment duration information is 15 seconds;

segment 4, the segment duration information is 16 seconds;

segment 5, the segment duration information is 14 seconds;

segment 6, the segment duration information is 8 seconds;

segment 7, the segment duration information is 5 seconds;

segment 8, the segment duration information is 25 seconds;

9, the segment duration information is 12 seconds;

segment 10, segment duration information 21 seconds.

In the embodiment of the application, the live broadcast feature information may include live broadcast attribute information and sharing degree information.

Optionally, the live attribute information may include live room traffic, an object operation amount, and a behavior interaction amount. Optionally, the live broadcast room traffic refers to a magnitude of a user entering the live broadcast room in the corresponding object description segment. The object operation amount may refer to a magnitude of a click of the user to view an object in the object description segment or a magnitude of a purchase of the object in the object description segment by the user in the corresponding object description segment. The behavior interaction amount can refer to the magnitude of approval of the user or the magnitude of comment uploaded by the user in the corresponding object description segment.

Alternatively, different magnitudes may represent different numbers, or ranges of numbers.

Optionally, the sharing degree information is used to indicate the sharing degree of the object to be shared by the object description fragment. Or, the sharing degree information refers to the recommendation degree of the anchor to the object to be shared in the object description fragment and the contribution degree made for sharing the object.

Fig. 3 is a flowchart illustrating a live feature information determining method according to an exemplary embodiment, where as shown in fig. 3, the method includes:

in step S2051, each object description fragment is divided into a plurality of sub-fragments according to a preset time interval.

In this embodiment of the application, the server may divide each object description fragment into a plurality of sub-fragments according to a preset time interval. Assuming a preset time interval of 1 second, the server may divide segment 1 into 20 sub-segments, segment 2 into 10 sub-segments, and segment 3 into 15 sub-segments … …

In step S2053, live view traffic, object operation amount, and behavior interaction amount of each of the plurality of sub-segments are determined.

Optionally, taking 20 sub-segments of the segment 1 as an example to explain further, the server may determine live broadcast room traffic, object operation amount, and behavior interaction amount of each of the 20 sub-segments.

In step S2055, the live broadcast room traffic, the object operation amount, and the behavior interaction amount of each object description segment are determined according to the live broadcast room traffic, the object operation amount, and the behavior interaction amount corresponding to each sub segment.

Optionally, the server may determine, from the live broadcast room traffic corresponding to each sub-segment included in one object description segment, live broadcast room traffic with a maximum magnitude or meeting other preset conditions, and determine the live broadcast room traffic with the maximum magnitude or meeting other preset conditions as the live broadcast room traffic of the object description segment. The server may determine, from the object operation amounts corresponding to the sub-segments included in one object description segment, an object operation amount with the maximum magnitude or meeting other preset conditions, and determine the object operation amount with the maximum magnitude or meeting other preset conditions as the object operation amount of the object description segment. The server may determine the maximum magnitude or other behavior interaction amount meeting the preset condition from the behavior interaction amounts corresponding to the sub-segments included in one object description segment, and determine the maximum magnitude or other behavior interaction amount meeting the preset condition as the behavior interaction amount of the object description segment.

Optionally, the server may determine an average of live broadcast room traffic according to live broadcast room traffic corresponding to each sub-segment included in one object description segment, and determine the average of the live broadcast room traffic as the live broadcast room traffic of the object description segment. The server may determine an average of the object operation amounts according to the object operation amounts corresponding to the sub-segments included in one object description segment, and determine the average of the object operation amounts as the object operation amount of the object description segment. The server may determine an average of the behavior interaction amounts according to the behavior interaction amounts corresponding to the sub-segments included in one object description segment, and determine the average of the behavior interaction amounts as the behavior interaction amount of the object description segment.

In step S2057, sharing degree information of each object description fragment is determined.

In this embodiment of the application, determining the sharing degree information of each object description fragment may refer to determining, by the server, whether the sharing degree information exists in each object description fragment. In the embodiment of the application, the sharing degree information can be determined based on time sharing identification information, quality sharing identification information, resource sharing identification information and/or positioning sharing identification information.

Optionally, the time sharing identification information may be determined based on whether the time-related information appears in the object description fragment. For example, if information indicating time limitation, such as "time-limited first-time purchase", "countdown XX seconds", and the like, occurs, time sharing identification information exists in the object description fragment.

Optionally, the quality sharing identification information may be determined based on whether information related to quality appears in the object description fragment. For example, if information such as "good quality", "good material", and the like indicating that an article needs to satisfy a certain quality appears, quality sharing identification information exists in the object description fragment.

Optionally, the resource sharing identification information may be determined based on whether information related to the resource appears in the object description fragment. For example, if information such as "sales promotion", "offer", "buy X give X", etc. is present to indicate that the provider or platform of the article provides some resources, the object description fragment has resource sharing identification information.

Optionally, the location sharing identification information may be determined based on whether information related to object location appears in the object description fragment. For example, if there is information indicating that the article is rare, such as "a small number" or "can be customized", the object description fragment has location sharing identification information.

In an alternative embodiment, the server may use an automatic speech recognition technique to recognize the speech in each object description segment, and obtain the speech text information. Subsequently, the server can compare the voice text information with the words corresponding to the time sharing identification information, the words corresponding to the quality sharing identification information, the words corresponding to the resource sharing identification information and the words corresponding to the positioning sharing identification information, so as to determine whether each object description fragment has the time sharing identification information, the quality sharing identification information, the resource sharing identification information and/or the positioning sharing identification information.

Optionally, if the server determines that any one of the time sharing identification information, the quality sharing identification information, the resource sharing identification information, and the positioning sharing identification information exists, it may be determined that the sharing degree information exists in the object description fragment.

In another optional embodiment, the server may input each object description fragment into the sharing information recognition model, obtain time sharing identification information, quality sharing identification information, resource sharing identification information, and/or location sharing identification information of each object description fragment, and determine sharing degree information of each object description fragment based on the time sharing identification information, the quality sharing identification information, the resource sharing identification information, and/or the location sharing identification information of each object description fragment.

Specifically, the time sharing identification information, the quality sharing identification information, the resource sharing identification information, and the positioning sharing identification information may be identified by an identifier "0" or an identifier "1". Taking the time sharing identification information as an example, if the time sharing identification information of a certain object description fragment is an identification "0", it indicates that the time sharing identification information exists in the object description fragment, and if the time sharing identification information of the object description fragment is an identification "1", it indicates that the time sharing identification information does not exist in the object description fragment. Similarly, the quality sharing identifier information, the resource sharing identifier information, and the location sharing identifier information may refer to the application of the time sharing identifier information, which is not described herein again.

Therefore, the server can obtain the live broadcast attribute information and the sharing degree information of each object description fragment and lay a cushion for subsequently selecting the object description fragments meeting the requirements to form the target video.

Through the hypothetical steps, the 10 segments included in the object description segment set can be updated as:

segment 1, segment duration information is 20 seconds; live room traffic 150, object operand 70, behavior interaction 10; sharing degree information exists;

segment 2, the segment duration information is 10 seconds; live room traffic 100, object operand 55, behavior interaction amount 30; sharing degree information does not exist;

segment 3, the segment duration information is 15 seconds; live room traffic 120, object operand 300, behavior interaction amount 60; sharing degree information does not exist;

segment 4, the segment duration information is 16 seconds; live room traffic 50, object operand 20, behavior interaction 120; sharing degree information exists;

segment 5, the segment duration information is 14 seconds; live room traffic 85, object operand 80, behavior interaction 20; sharing degree information does not exist;

segment 6, the segment duration information is 8 seconds; live room traffic 60, object operand 55, behavior interaction 60; sharing degree information exists;

segment 7, the segment duration information is 5 seconds; live room traffic 300, object operand 70, behavior interaction amount 20; sharing degree information exists;

segment 8, the segment duration information is 12 seconds; live room traffic 200, object operand 200, behavior interaction 150; sharing degree information does not exist;

segment 9, segment duration information is 12 seconds; live room traffic 150, object operand 20, behavior interaction amount 90; sharing degree information exists;

segment 10, segment duration information is 21 seconds; live room traffic 90, object operand 50, behavior interaction 180; there is no sharing level information.

In step S207, a target segment is determined from the object description segment set based on the video duration information, the segment duration information of the object description segment, and the sharing degree information.

Since the sharing degree information is stated above to refer to the recommendation degree of the anchor to the object to be shared in the object description fragment and the contribution degree made for sharing the object. Therefore, when the target segment is selected, the object description segment having the sharing degree information can be determined as the target segment as much as possible. However, the duration of the target video promoted to the live platform may be required, and not all durations are feasible, and based on this, the server may determine the target segment from the object description segment set based on the video duration information, the segment duration information of each object description segment, and the sharing degree information.

Fig. 4 is a flowchart illustrating a target segment determining method according to an exemplary embodiment, as shown in fig. 4, including:

in step S2071, a first object description fragment set is determined from the object description fragment set; the sharing degree information of each first object description fragment in the first object description fragment set is a first preset sharing degree.

In the embodiment of the present application, the first preset sharing degree indicates that there is sharing program information, and the second preset sharing degree indicates that there is no sharing degree information. In this way, the server may determine the object description fragment in which the shared program information exists as the first object description fragment, and form the first object description fragment set (including fragment 1, fragment 4, fragment 6, fragment 7, and fragment 9). And determining the object description fragment without sharing program information as a second object description fragment to form a second object description fragment set (comprising the fragment 2, the fragment 3, the fragment 5, the fragment 8 and the fragment 10).

In step S2072, the clip total duration information is determined based on the clip duration information of each first object description clip.

The server may determine the total segment duration information from the sum of the segment duration information of each first object description segment, where the total segment duration information is 61 seconds based on the above example.

In step S2073, if the total clip duration information is greater than or equal to the video duration information, the target clip is determined based on the first object description clip set.

Assuming that the video duration information of the final target video is 60 seconds, and the total video duration information (61 seconds) is equal to or greater than the video duration information (60 seconds), the server may determine the target clip based on the first object description clip set.

In step S209, the target segments are combined to obtain a target video.

In an alternative embodiment, if the total segment duration information (61 seconds) is greater than or equal to the video duration information (60 seconds), and the difference between the total segment duration information (61 seconds) and the video duration information (60 seconds) is within a preset range (for example, 0 to 5 seconds), the first object description segment may be directly used as the target object without being processed, and the target segments (segment 1, segment 4, segment 6, segment 7, and segment 9) are combined to obtain the target video. Optionally, the combination mode may be any combination mode, and is not limited herein.

In another alternative embodiment, if the total duration information (61 seconds) of the segments is greater than or equal to the duration information (60 seconds), a segment can be determined from the first object description segments (segment 1, segment 4, segment 6, segment 7, and segment 9), and the segment can be cut out for 1 second, for example, segment 1 can be cut out for 1 second, so as to obtain segment 1 of 19 seconds. The server may then determine new segments 1, 4, 6, 7, and 9 as target segments, and combine these target segments to obtain the target video.

The application also provides a mode how to determine the target segment when the total segment duration information is less than the video duration information. FIG. 5 is a flowchart illustrating a target segment determination method according to an exemplary embodiment, as shown in FIG. 5, including:

in step S2074, if the total segment duration information is less than the video duration information, determining a second object description segment set from the object description segment set; the sharing degree information of each second object description fragment in the second object description fragment set is a second preset sharing degree.

Assuming that the video duration information of the final target video is 90 seconds, it is apparent that the clip total duration information (61 seconds) obtained based on the clip duration of each first object description clip is smaller than the video duration information (90 seconds), and the server may determine the second object description clip set from the object description clip set.

In step S2075, supplemental duration information is determined from the clip total duration information and the video duration information.

Alternatively, the server may determine the supplemental duration information to be 29 seconds according to a difference between the clip total duration information and the video duration information. That is, a second object description fragment of about 29 seconds may be added as the target fragment.

In step S2076, a supplementary piece is determined from the second object description piece set based on the supplementary duration information, the piece duration information of the second object description piece, and the live property information.

In an alternative embodiment, the server may sort at least the second object description segments from the plurality according to the live room traffic, resulting in segment 8, segment 3, segment 2, segment 10, and segment 5, and determine the complementary segments from the set of second object description segments based on the complementary duration information (29 seconds), the segment duration information for each second object description segment, and the live room traffic.

Alternatively, the server may select the segment 8 from at least a plurality of segments according to the live room traffic, but at this time, the duration of the segment 8 is 12 seconds and is not longer than the supplemental duration information for 29 seconds, and the difference between 12 seconds and 29 seconds is not within a preset range (e.g., 0-5 seconds). The server may again select the second ranked segment 3, at which point the sum of the segments 8 and 3 of 27 seconds is still insufficient to supplement the duration information for 29 seconds, but the difference between 27 seconds and 29 seconds is already within a preset range (e.g., 0-5 seconds). Thus, the server may determine segment 8 and segment 3 as supplemental segments.

In an alternative embodiment, the server may sort at least the second object description segments by the object operand, resulting in segment 3, segment 8, segment 5, segment 2, and segment 10, and determine the supplementary segments from the set of second object description segments based on the supplementary duration information (29 seconds), the segment duration information for each second object description segment, and the object operand. The specific determined complementary segment may refer to the upper segment content.

In an alternative embodiment, the server may sort at least the second object description segments by the amount of behavioral interaction, resulting in segment 10, segment 8, segment 3, segment 2, and segment 5, and determine supplemental segments from the set of second object description segments based on supplemental duration information (29 seconds), segment duration information for each second object description segment, and the amount of behavioral interaction. The specific determined supplementary segment can refer to the content of the upper segment.

In step S2077, a target segment is determined based on the first object description segment set and the supplementary segment.

The server may determine the target segment from the first set of object description segments and the supplemental segment.

Optionally, when the sum of the duration of each first object description segment in the first object description segment set and the duration of the supplemental segment is equal to the video duration information, the server may directly combine all the first object description segments and the supplemental segments to obtain the target video. Optionally, the combination mode may be any combination mode, and is not limited herein.

Optionally, when the sum of the duration of each first object description fragment in the first object description fragment set and the duration of the supplementary fragment is not equal to the video duration information, and the difference between the sum of the duration of each first object description fragment and the duration of the supplementary fragment and the video duration information is within a preset range, the server may directly combine all the first object description fragments and the supplementary fragments to obtain the target video. Optionally, the combination mode may be any combination mode, and is not limited herein.

Optionally, when the sum of the duration of each first object description fragment in the first object description fragment set and the duration of the supplementary fragment is greater than the video duration information, and the difference between the sum of the duration of each first object description fragment and the duration of the supplementary fragment and the video duration information is not within the preset range, the server may clip the supplementary fragment so that the difference between the sum of the duration of each first object description fragment and the duration of the clipped supplementary fragment and the video duration information is within the preset range. Therefore, the server can directly combine all the first object description segments and the cut supplementary segments to obtain the target video. Optionally, the combination mode may be any combination mode, and is not limited herein.

In the embodiment of the application, the server can determine whether the target segment carries the initial background music, and if not, the server can match the target video to obtain the final target video. If so, the server may not process the target video, and the target video obtained by combining the target videos is the final target video.

Or, when the target segment carries the original background music, the background music of the combined target video is also incoherent because the target video is composed of a plurality of object description segments. Based on the method, the server can delete the background music of the target segment to obtain transition segments, and combine the transition segments to obtain the transition video. Subsequently, the transition video can be dubbed to obtain the target video.

In the embodiment of the application, in order to increase the visibility of the video and the attraction degree for the user, after the target video is obtained, the server can also process the target video. Optionally, the server may perform keyword recognition processing on the target video based on preset keywords, and determine a key frame in the target video. The key frame is an image frame corresponding to a preset keyword, and the server can add a subtitle, a sticker or an expression package corresponding to the preset keyword on the key frame. Optionally, the subtitles may be highlighted. The keywords may be some that represent exclamations or keywords that represent the role of the object.

Optionally, the server may further add a preset leader and a preset trailer before and after the target video.

In the embodiment of the application, if the server generates the target video based on the received video generation instruction sent by the client, the target video can be fed back to the client after the server generates the target video.

Optionally, when the number of the target videos fed back to the client is multiple, the target videos can be presented on an interface of the client according to a preset rule.

Optionally, the preset rule includes: the client may be sorted according to the generation time point of the target video, for example, the later generated target video is sorted the earlier. Optionally, the preset rule includes: the client can sort according to the live broadcast attribute information (live broadcast room traffic, object operation amount and behavior interaction amount), for example, the higher the data of the live broadcast attribute information is, the earlier the sort is.

Optionally, when the multiple target videos are presented on the interface of the client, a deletion control may also be displayed, so that the anchor may select a target video required by the user from the multiple target videos through the deletion control. Optionally, a preview control may be further included on the interface of the client, so that the anchor can preview the target video through the preview control.

Optionally, the client interface of the owner of the article may further display target videos corresponding to the multiple live broadcast rooms, and the owner of the article may delete the target videos corresponding to the target live broadcast rooms through the deletion control element.

In summary, according to the embodiment of the application, a plurality of object description segments are obtained through processing of a video to be processed, and segments which are higher in quality and more consistent with user requirements are determined from the object description segments based on the sharing degree information and the preset category information corresponding to each object description segment, so that the finally obtained target video has higher quality and more coherent content.

Fig. 6 is a block diagram illustrating a video generation apparatus according to an example embodiment. Referring to fig. 6, the apparatus includes a video acquisition module 601, a description section acquisition module 602, a feature information determination module 603, a target section determination module 604, and a combination module 605.

The video acquisition module 601 is configured to perform processing on live video stream data to obtain a video to be processed;

the description fragment acquisition module 602 is configured to perform content understanding processing on a video to be processed to obtain an object description fragment set corresponding to preset category information; each object description fragment in the object description fragment set comprises fragment duration information;

a characteristic information determination module 603 configured to perform determining live characteristic information of each object description segment; the live broadcast characteristic information comprises sharing degree information; the sharing degree information represents the sharing degree of the objects in the object description fragment;

a target segment determining module 604 configured to perform determining a target segment from the set of object description segments based on the video duration information, the segment duration information of the object description segments, and the sharing degree information;

a combining module 605 configured to perform combining the target segments into a target video.

inputting each object description fragment into a sharing identification information recognition model to obtain time sharing identification information, quality sharing identification information, resource sharing identification information and/or positioning sharing identification information of each object description fragment;

and determining sharing degree information of each object description fragment based on the time sharing identification information, the quality sharing identification information and/or the resource sharing identification information and the positioning sharing identification information of each object description fragment.

In some possible embodiments, the combining module is configured to perform:

combining the transition segments to obtain a transition video;

In some possible embodiments, the apparatus further comprises:

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Fig. 7 is a block diagram illustrating an apparatus 700 for video generation according to an example embodiment. For example, the apparatus 700 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 7, apparatus 700 may include one or more of the following components: a processing component 702, a memory 704, a power component 706, a multimedia component 708, an audio component 710, an interface to input/output (I/O) 712, a sensor component 714, and a communication component 716.

The processing component 702 generally controls overall operation of the device 700, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 702 may include one or more processors 720 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 702 may include one or more modules that facilitate interaction between the processing component 702 and other components. For example, the processing component 702 may include a multimedia module to facilitate interaction between the multimedia component 708 and the processing component 702.

The memory 704 is configured to store various types of data to support operation at the device 700. Examples of such data include instructions for any application or method operating on device 700, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 704 may be implemented by any type or combination of volatile or non-volatile storage devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 706 provides power to the various components of the device 700. The power components 706 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the apparatus 700.

The multimedia component 708 includes a screen that provides an output interface between the device 700 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 708 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the device 700 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 710 is configured to output and/or input audio signals. For example, audio component 710 includes a Microphone (MIC) configured to receive external audio signals when apparatus 700 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in the memory 704 or transmitted via the communication component 716. In some embodiments, audio component 710 also includes a speaker for outputting audio signals.

The I/O interface 712 provides an interface between the processing component 702 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 714 includes one or more sensors for providing status assessment of various aspects of the apparatus 700. For example, sensor assembly 714 may detect an open/closed state of device 700, the relative positioning of components, such as a display and keypad of apparatus 700, sensor assembly 714 may also detect a change in position of apparatus 700 or a component of apparatus 700, the presence or absence of user contact with apparatus 700, orientation or acceleration/deceleration of apparatus 700, and a change in temperature of apparatus 700. The sensor assembly 714 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 714 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 714 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 716 is configured to facilitate wired or wireless communication between the apparatus 700 and other devices. The apparatus 700 may access a wireless network based on a communication standard, such as WiFi, an operator network (such as 2G, 3G, 4G, or 5G), or a combination thereof. In an exemplary embodiment, the communication component 716 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 716 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 700 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a storage medium comprising instructions, such as the memory 704 comprising instructions, executable by the processor 720 of the apparatus 700 to perform the method described above is also provided. Alternatively, the storage medium may be a non-transitory computer readable storage medium, which may be, for example, a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

Claims

1. A method of video generation, comprising:

processing live video stream data to obtain a video to be processed;

performing content understanding processing on the video to be processed to obtain an object description fragment set corresponding to preset category information; each object description fragment in the object description fragment set comprises fragment duration information;

determining live characteristic information of each object description fragment; the live broadcast characteristic information comprises sharing degree information; the sharing degree information represents the sharing degree of the objects in the object description fragment;

determining a target segment from the object description segment set based on video duration information, segment duration information of the object description segment and sharing degree information;

and combining the target segments to obtain a target video.

2. The video generation method according to claim 1, wherein the determining a target segment from the set of object description segments based on video duration information, segment duration information of the object description segments, and sharing degree information comprises:

determining a first set of object description fragments from the set of object description fragments; sharing degree information of each first object description fragment in the first object description fragment set is a first preset sharing degree;

3. The video generation method according to claim 2, wherein the live feature information includes live attribute information including live room traffic, an object operation amount, and a behavior interaction amount; the method further comprises the following steps:

if the total segment duration information is less than the video duration information, determining a second object description segment set from the object description segment set; sharing degree information of each second object description fragment in the second object description fragment set is a second preset sharing degree;

determining a supplementary segment from the second object description segment set based on the supplementary duration information, the segment duration information of the second object description segment, and the live broadcast attribute information;

determining a target segment based on the first object description segment set and the supplemental segment.

4. The video generation method of claim 1, wherein the determining live feature information of each object description segment comprises:

determining live broadcast room flow, object operation amount and behavior interaction amount of each sub-segment in the plurality of sub-segments;

5. The video generation method according to claim 4, wherein the determining the sharing degree information of each object description fragment includes:

inputting each object description fragment into a sharing information recognition model to obtain time sharing identification information, quality sharing identification information and resource sharing identification information positioning sharing identification information of each object description fragment;

and determining sharing degree information of each object description fragment based on time sharing identification information, quality sharing identification information, resource sharing identification information and/or positioning sharing identification information of each object description fragment.

6. The video generation method of any of claims 1 to 5, wherein the combining the target segments to obtain the target video comprises:

if the target segment carries initial background music, deleting the background music of the target segment to obtain a transition segment;

combining the transition segments to obtain a transition video;

and carrying out music matching on the transition video to obtain the target video.

7. A video generation apparatus, comprising:

the description fragment acquisition module is configured to execute content understanding processing on the video to be processed to obtain an object description fragment set corresponding to preset category information; each object description fragment in the object description fragment set comprises fragment duration information;

a target segment determination module configured to perform determining a target segment from the set of object description segments based on video duration information, segment duration information of the object description segments, and sharing degree information;

and the combining module is configured to combine the target segments to obtain the target video.

8. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the video generation method of any of claims 1 to 6.

9. A computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the video generation method of any of claims 1 to 6.

10. A computer program product, characterized in that the computer program product comprises a computer program stored in a readable storage medium, from which at least one processor of a computer device reads and executes the computer program, causing the computer device to perform the video generation method according to any one of claims 1 to 6.