CN115665505A

CN115665505A - Video generation method and device, electronic equipment and storage medium

Info

Publication number: CN115665505A
Application number: CN202211228786.3A
Authority: CN
Inventors: 王俊; 邓峰; 李欣阳; 王通; 李�杰
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2022-10-09
Filing date: 2022-10-09
Publication date: 2023-01-31

Abstract

The disclosure provides a video generation method, a video generation device, electronic equipment and a storage medium, and belongs to the technical field of multimedia. The method comprises the following steps: classifying live videos through a multi-mode technology to obtain live categories of the live videos; extracting at least one first video clip from the live video based on the live category, wherein the first video clip is used for presenting at least one live event belonging to the live category in the live video; for any first video segment, deleting a video frame with an event association degree smaller than an association degree threshold value from the first video segment to obtain a second video segment corresponding to the first video segment, wherein the event association degree is used for expressing the association degree of content in the video frame and a live event; and generating the target video based on the at least one second video segment. The method can enable the target video to be a video formed by video frames with high correlation degree with the live event, and improve the quality of the target video.

Description

Video generation method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of multimedia technologies, and in particular, to a video generation method and apparatus, an electronic device, and a storage medium.

Background

With the development of internet technology, watching live video becomes an entertainment mode commonly used by a large number of users. The video clips including the wonderful content in the live video are often fewer, and if a user wants to miss the wonderful content, the user needs to concentrate on watching the live video for a long time, so that the efficiency of watching the live video by the user is not high. How to improve the efficiency of watching the live video by the user is a research direction.

In the prior art, the start and end times of the highlight content in the live video are generally identified by using an artificial intelligence technology, and then video segments containing the highlight content are automatically clipped to generate the video containing the highlight content. The user can obtain the wonderful content in the live video broadcast by watching the video containing the wonderful content without watching the live video broadcast for a long time.

However, with the above scheme, the start-stop time of the highlight identified by the terminal may not be accurate, so that the generated video may include non-highlight, and the quality of the generated video is reduced.

Disclosure of Invention

The present disclosure provides a video generation method, apparatus, electronic device, and storage medium, which can prompt the quality of a generated target video. The technical scheme of the disclosure is as follows:

according to an aspect of the embodiments of the present disclosure, there is provided a video generation method, including:

classifying live videos through a multi-modal technology to obtain live categories of the live videos;

extracting at least one first video clip from the live video based on the live category, wherein the first video clip is used for presenting at least one live event belonging to the live category in the live video;

for any first video clip, deleting a video frame of which the event association degree is smaller than an association degree threshold value from the first video clip to obtain a second video clip corresponding to the first video clip, wherein the event association degree is used for expressing the association degree of content in the video frame and a live event;

and generating the target video based on the at least one second video segment.

In some embodiments, for any first video segment, deleting a video frame of which the event association degree is smaller than the association degree threshold from the first video segment to obtain a second video segment corresponding to the first video segment, including:

for any first video clip, determining the event association degree of each video frame in the first video clip and a corresponding live event through at least one of a multi-modal technology, an action detection technology and a scene detection technology;

and deleting the video frames with the event correlation degree smaller than the correlation degree threshold value from the first video clip to obtain a second video clip corresponding to the first video clip.

In some embodiments, the classifying the live video through the multi-modal technique to obtain the live category of the live video includes:

slicing the live video to obtain at least one third video segment;

for any third video clip, classifying the third video clip through the multi-modal technology to obtain the live broadcast category of the third video clip;

determining a live category of the live video based on the live category of the at least one third video segment.

In some embodiments, the determining the live category of the live video based on the live category of the at least one third video segment includes:

determining a target live broadcast category with the highest proportion based on the live broadcast category of the at least one third video clip;

and determining the target live broadcast category as the live broadcast category of the live broadcast video.

In some embodiments, the method further comprises:

acquiring at least one newly acquired fourth video clip;

updating the live category of the live video based on the live category of the at least one fourth video segment.

In some embodiments, said extracting at least one first video segment from said live video based on said live category comprises:

for any third video segment, determining at least one live event belonging to the live category based on the live category;

for any live event, determining a first time instant of earliest occurrence and a second time instant of latest occurrence of the live event in the third video segment;

based on the first time and the second time, extracting a first video segment comprising the live event from the third video segment.

In some embodiments, the number of the first video segments is plural;

the method further comprises the following steps:

and for a plurality of first video clips with the duration less than the preset duration, splicing the adjacent first video clips belonging to the same live event according to the time sequence to obtain at least one first video clip.

According to another aspect of the embodiments of the present disclosure, there is provided a video generating apparatus including:

the classification unit is configured to classify live videos through a multi-modal technology to obtain live categories of the live videos;

an extracting unit configured to extract at least one first video segment from the live video based on the live category, wherein the first video segment is used for presenting at least one live event belonging to the live category in the live video;

the deleting unit is configured to delete a video frame with an event association degree smaller than an association degree threshold value from any first video segment to obtain a second video segment corresponding to the first video segment, wherein the event association degree is used for expressing the association degree of content in the video frame and a live event;

a generating unit configured to generate a target video based on the at least one second video segment.

In some embodiments, the deleting unit is configured to determine, for any first video segment, an event association degree of each video frame in the first video segment with a corresponding live event through at least one of a multi-modal technique, an action detection technique, and a scene detection technique; and deleting the video frames with the event relevance smaller than the relevance threshold value from the first video clip to obtain a second video clip corresponding to the first video clip.

In some embodiments, the classification unit includes:

a slicing subunit configured to slice the live video to obtain at least one third video segment;

the classification subunit is configured to classify any third video segment through the multi-modal technology to obtain a live broadcast category of the third video segment;

a determining subunit configured to determine a live category of the live video based on the live category of the at least one third video segment.

In some embodiments, the determining subunit is configured to determine, based on the live category of the at least one third video segment, a highest-priority target live category; and determining the target live broadcast category as the live broadcast category of the live broadcast video.

In some embodiments, the apparatus further comprises:

an acquisition unit configured to acquire at least one newly acquired fourth video segment;

an updating unit configured to update a live category of the live video based on a live category of the at least one fourth video segment.

In some embodiments, the extracting unit is configured to determine, for any third video segment, based on the live category, at least one live event belonging to the live category; for any live event, determining a first time instant of earliest occurrence and a second time instant of latest occurrence of the live event in the third video segment; based on the first time and the second time, extracting a first video segment comprising the live event from the third video segment.

In some embodiments, the number of the first video segments is plural; the device further comprises:

the splicing unit is configured to splice a plurality of first video clips with the duration less than the preset duration, wherein the first video clips are adjacent and belong to the same live event, and the first video clips are obtained by splicing the first video clips according to the time sequence.

According to another aspect of the embodiments of the present disclosure, there is provided an electronic device including:

one or more processors;

a memory for storing the processor executable program code;

wherein the processor is configured to execute the program code to implement the video generation method described above.

According to another aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium in which program codes, when executed by a processor of an electronic device, enable the electronic device to perform the above-described video generation method.

According to another aspect of embodiments of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the above-described video generation method.

The embodiment of the disclosure provides a video generation method, which includes classifying live videos through a multi-modal technology to obtain live categories of the live videos, extracting at least one first video clip from the live videos based on the live categories, deleting video frames with event association smaller than an association threshold from the first video clips for any first video clip to obtain a second video clip corresponding to the first video clip, and generating a target video based on the at least one second video clip, so that the target video is a video formed by the live events, and the quality of the target video is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

Fig. 1 is a schematic diagram illustrating an implementation environment of a video generation method according to an exemplary embodiment.

Fig. 2 is a flow chart illustrating a video generation method according to an exemplary embodiment.

Fig. 3 is a flow diagram illustrating another video generation method according to an example embodiment.

FIG. 4 is a schematic diagram illustrating one manner of processing according to an exemplary embodiment.

FIG. 5 is a flow diagram illustrating a method for generating a target video according to an exemplary embodiment.

FIG. 6 is another flow diagram illustrating the generation of a target video according to an exemplary embodiment.

FIG. 7 is a flow diagram illustrating another generation of a target video according to an example embodiment.

Fig. 8 is a block diagram illustrating a video generation apparatus according to an example embodiment.

Fig. 9 is a block diagram illustrating another video generation apparatus according to an example embodiment.

Fig. 10 is a block diagram illustrating an electronic device 1000 in accordance with an example embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below do not represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

It should be noted that information (including but not limited to user equipment information, user personal information, etc.), data (including but not limited to data for analysis, stored data, presented data, etc.), and signals referred to in this application are authorized by the user or sufficiently authorized by various parties, and the collection, use, and processing of the relevant data is required to comply with relevant laws and regulations and standards in relevant countries and regions. For example, live video data referred to in this application is obtained with sufficient authorization.

The electronic device may be provided as a terminal or a server, and when the electronic device is provided as a terminal, operations performed by the method of video generation may be implemented by the terminal; when provided as a server, the operations performed by the method of video generation may be implemented by the server, which generates a target video based on a live video; the server and the terminal can interact to realize the operation executed by the video generation method; alternatively, the terminal may send a video generation request to the server, the server may generate a video, and the terminal may feed back the generated target video to the terminal and output the target video.

Fig. 1 is a schematic diagram illustrating an implementation environment of a video generation method according to an exemplary embodiment. Taking the electronic device as an example provided as a server, referring to fig. 1, the implementation environment specifically includes: a terminal 101 and a server 102.

The terminal 101 may be at least one of a smart phone, a smart watch, a desktop computer, a laptop computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion Picture Experts Group Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion Picture Experts Group Audio Layer 4), a laptop computer, and the like. An application may be installed and run on the terminal 101, and a user may log in the application through the terminal 101 to obtain a service provided by the application. The terminal 101 may be connected to the server 102 through a wireless network or a wired network.

The terminal 101 may generally refer to one of a plurality of terminals, and the embodiment is illustrated with the terminal 101. Those skilled in the art will appreciate that the number of terminals described above may be greater or fewer. For example, the number of the terminals may be several, or the number of the terminals may be several tens or several hundreds, or more, and the number of the terminals and the type of the device are not limited in the embodiments of the present disclosure.

The server 102 is an independent physical server, may be a server cluster or a distributed system configured by a plurality of physical servers, and may be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a web service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network), and a big data and artificial intelligence platform. In some embodiments, the server 102 undertakes primary computing work and the terminal 101 undertakes secondary computing work; or, the server 102 undertakes the secondary computing work, and the terminal 101 undertakes the primary computing work; alternatively, the server 102 and the terminal 101 perform cooperative computing by using a distributed computing architecture. The server 102 may be connected to the terminal 101 and other terminals through a wireless network or a wired network, and optionally, the number of the servers may be more or less, which is not limited in the embodiment of the present disclosure. Of course, the server 102 may also include other functional servers to provide more comprehensive and diverse services.

Fig. 2 is a flow chart illustrating a method of video generation, as illustrated in fig. 2, performed by an electronic device, according to an exemplary embodiment, including the steps of:

in step S201, the live video is classified by a multi-modal technique to obtain a live category of the live video.

In the embodiment of the present disclosure, the electronic device may be installed and run with a target application program, and the target application program can provide a function of viewing live video. Optionally, the live video is a video recorded through a target virtual space, and the target virtual space may be a live broadcast room, a three-dimensional virtual space, or a two-dimensional virtual space provided by a target application program, which is not limited in this disclosure.

The electronic equipment acquires live videos in the target virtual space and classifies the live videos in the target virtual space through a multi-modal technology so as to obtain live categories of the live videos.

In the embodiment of the present disclosure, live broadcast categories of different live broadcast videos are different, or contents in different time periods within the same live broadcast video are different, so that live broadcast categories to which contents included in live broadcast videos belong are different. Live categories of live video include, but are not limited to: talent performance, language performance, daily life, interest sharing, science popularization education, multi-person performance and the like, and the embodiment of the disclosure is not limited. Optionally, each of the live broadcast categories includes a plurality of live broadcast events belonging to the category, and the live broadcast events belonging to each of the live broadcast categories are shown in table 1.

TABLE 1

In the embodiment of the disclosure, the electronic device may classify the live video from the perspective of voice, vision, text, and the like through a multi-modal technology to determine the live category of the live video. Correspondingly, for any live video, the electronic device can extract the characteristics of multiple modals in the live video through a multi-modal technology. And then, the electronic equipment fuses the characteristics of the multiple modes, and the live broadcast category of the live broadcast video is determined based on the characteristics obtained through fusion. Among the features of the various modalities are, but not limited to: audio features, video features, text features, image features, and the like. Alternatively, the electronic device may extract audio, video, text, and image features in the live video by inputting the live video to an audio processing model, a video processing model, a text processing model, and an image processing model, respectively. The live broadcast category of the live broadcast video is determined based on the characteristics of multiple modes, the live broadcast video can be classified from multiple angles, and therefore the live broadcast category of the live broadcast video can be determined more accurately.

In step S202, at least one first video segment is extracted from the live video based on the live category, where the first video segment is used to present at least one live event belonging to the live category in the live video.

In the embodiment of the disclosure, the electronic device performs event detection on the live video based on the live category to obtain at least one live event belonging to the live category in the live video. The electronic equipment extracts a plurality of video frames used for presenting at least one live event belonging to a live broadcast category in a live broadcast video, and at least one first video segment can be obtained.

In step S203, for any first video segment, deleting a video frame whose event association degree is smaller than the association degree threshold from the first video segment, and obtaining a second video segment corresponding to the first video segment, where the event association degree is used to represent the association degree between the content in the video frame and the live event.

In an embodiment of the present disclosure, for any first video segment, the electronic device determines an event relevancy for each video frame in the first video segment. The event association degree is used for representing the association degree of the content in the video frame and the live event, and the higher the event association degree is, the higher the association degree of the content in the video frame and the live event is, the higher the probability that the video frame contains the live event is. And the electronic equipment deletes the video frames with the event correlation degree smaller than the correlation degree threshold value from the first video clip to obtain a second video clip corresponding to the first video clip. The second video clip comprises a plurality of video frames for presenting the association degree with the live event not less than the association degree threshold, and relative to the first video clip, the association degree of the video frames in the second video clip with the live event is higher, so that the quality of the second video clip is improved.

In step S204, a target video is generated based on the at least one second video segment.

In the embodiment of the disclosure, after the electronic device obtains the at least one second video segment, the electronic device performs editing operation on the at least one second video operation to generate a target video, and the target video is a video formed by live events, so that the quality of the target video is improved. Wherein the editing operation includes but is not limited to: aggregating at least one second video clip, generating a title of a target video, adding background music to the at least one second video clip, adding a filter to the at least one second video clip, adding a sticker to the at least one second video clip, adding a subtitle to the at least one second video clip, visually creating the at least one second video clip, rendering a special effect to the at least one second video clip, and the like.

The embodiment of the disclosure provides a video generation method, which classifies live videos through a multi-modal technology to obtain live categories of the live videos, extracts at least one first video clip from the live videos based on the live categories, deletes a video frame with an event relevance smaller than a relevance threshold from the first video clip for any first video clip to obtain a second video clip corresponding to the first video clip, and generates a target video based on the at least one second video clip, so that the target video is a video formed by live events, and the quality of the target video is improved.

In some embodiments, for any first video segment, deleting a video frame of which the event relevance is smaller than the relevance threshold from the first video segment to obtain a second video segment corresponding to the first video segment, including:

for any first video clip, determining the event association degree of each video frame in the first video clip and the corresponding live event through at least one of a multi-modal technology, an action detection technology and a scene detection technology;

and deleting the video frames with the event relevance smaller than the relevance threshold value from the first video clip to obtain a second video clip corresponding to the first video clip.

In the embodiment of the disclosure, the video frames with the event relevance smaller than the relevance threshold are deleted from the first video segment to obtain the second video segment corresponding to the first video segment, so that the relevance degree of the video frames in the second video segment and the live broadcast event is higher, and the quality of the second video segment is improved.

In some embodiments, classifying the live video through a multi-modal technique to obtain a live category of the live video includes:

slicing the live video to obtain at least one third video segment;

for any third video clip, classifying the third video clip through a multi-modal technology to obtain a live broadcast category of the third video clip;

and determining the live broadcast category of the live broadcast video based on the live broadcast category of the at least one third video segment.

In the embodiment of the disclosure, the live video can be sliced to obtain at least one third video segment, and the at least one third video segment is classified through a multi-mode technology to obtain the live category of the third video segment, so that parallel processing of the live video is facilitated, and the processing efficiency is improved.

In some embodiments, determining the live category of the live video based on the live category of the at least one third video segment includes:

In the embodiment of the present disclosure, the target live broadcast category with the highest duty ratio is used as the live broadcast category of the live broadcast video, so that the accuracy of determining the live broadcast category is improved.

In some embodiments, the method further comprises:

acquiring at least one newly acquired fourth video segment;

and updating the live category of the live video based on the live category of the at least one fourth video segment.

In the embodiment of the disclosure, the live category of the live video is updated, so that the real-time performance of determining the live category of the live video is improved.

In some embodiments, extracting at least one first video segment from the live video based on the event category corresponding to the live category includes:

for any third video segment, determining at least one live event belonging to a live category based on the live category;

for any live event, determining a first moment of earliest occurrence and a second moment of latest occurrence of the live event in a third video segment;

based on the first time and the second time, a first video segment including a live event is extracted from the third video segment.

In the embodiment of the disclosure, a first video segment is obtained by determining a first moment of a live event appearing earliest and a second moment of the live event appearing latest in a third video segment and capturing a video frame containing the live event from the third video segment, wherein the first video segment includes the at least one live event, and the quality of the first video segment is improved.

In some embodiments, the number of the first video segments is plural;

the method further comprises the following steps:

for a plurality of first video clips with the duration less than the preset duration, splicing the adjacent first video clips belonging to the same live event according to the time sequence to obtain at least one first video clip.

In the embodiment of the present disclosure, by splicing the first video segments with shorter time lengths according to the time sequence, the number of the first video segments can be reduced, and the processing efficiency is improved. The length of the first video segment can be increased, and the processing precision can be improved.

Fig. 2 is a basic flow chart of the present disclosure, and the scheme provided by the present disclosure is further explained based on an application scenario, and fig. 3 is a flow chart of another video generation method, which is executed by an electronic device according to an exemplary embodiment. Referring to fig. 3, the method includes:

in step S301, the live video is sliced to obtain at least one third video segment.

In some embodiments, the electronic device may capture live video in real-time during the live process. After the electronic device acquires the live video, the live video can be sliced at certain time intervals and the live video is segmented into at least one third video segment. Wherein each third video segment contains at least one video frame.

In some embodiments, in the process of slicing the live video by the electronic device according to the time interval, the time interval may be fixed or random, and the time interval is not limited in the embodiments of the present disclosure. The time interval is the length of the third video segment, and may be 10 seconds, 1 minute, or 5 minutes. The live video is sliced to obtain at least one third video segment, so that the at least one third video segment can be conveniently and subsequently processed in parallel, and the processing efficiency is improved.

For example, if the time interval is set to be one minute by the electronic device, the live video is sliced at the time interval of one minute, and the duration of the obtained third video segment is one minute. Or, if the time interval set by the electronic device is a random time length, the time length of the obtained third video segment is random.

In step S302, for any third video segment, the third video segment is classified through a multi-modal technique to obtain a live category of the third video segment.

In some embodiments, the electronic device can classify the third video segments from speech, visual, and textual perspectives via multimodal techniques to determine the live category of each third video segment. Accordingly, for any third video segment, the electronic device can extract features of multiple modalities in the third video segment through a multi-modality technique. Then, the electronic equipment fuses the features of the multiple modalities, and determines the live broadcast category of the third video clip based on the fused features. Among the features of the various modalities are, but not limited to: audio features, video features, text features, image features, and the like. Alternatively, the electronic device may extract the audio feature, the video feature, the text feature, and the image feature in the third video segment by inputting the third video segment to the audio processing model, the video processing model, the text processing model, and the image processing model, respectively. By determining the live broadcast category of the third video clip based on the characteristics of the multiple modalities, the third video clip can be classified from multiple angles, so that the live broadcast category of the third video clip can be determined more accurately.

In step S303, a live category of the live video is determined based on the live category of the at least one third video segment.

In the embodiment of the present disclosure, in the live broadcasting process, the electronic device may fuse the live broadcasting category of the at least one third video segment to obtain the live broadcasting category of the live video.

In some embodiments, the electronic device determines a highest-priority target live category based on the live category of the at least one third video segment; and determining the target live broadcast category as the live broadcast category of the live broadcast video. The electronic device can determine a live category of the live video based on a proportion of the live category. For any live broadcast category, under the condition that the third video clips belonging to the live broadcast category account for the highest proportion of all the third video clips, the electronic equipment can take the live broadcast category as the live broadcast category of the live broadcast video.

For example, the electronic device determines that the live broadcast category of the live broadcast video is the talent performance if the percentage of the third video segments participating in the fusion is the highest among all the third video segments participating in the fusion.

In some embodiments, the electronic device obtains at least one newly captured fourth video segment; and updating the live broadcast category of the live broadcast video based on the live broadcast category of the at least one fourth video segment. In the live broadcast process, the electronic equipment can continuously fuse the live broadcast category of at least one video clip between the live broadcast starting time and the current time, update the live broadcast category of the live broadcast video until the live broadcast is finished, and obtain the final live broadcast category of the live broadcast video.

In step S304, for any third video segment, based on the live category, at least one live event belonging to the live category is determined.

In the embodiment of the present disclosure, the electronic device may input any third video segment into the event detection model, perform event detection on the third video segment by the event detection model, and determine a plurality of video frames in the third video segment, where the plurality of video frames include at least one live event. The event detection model may be one or more of a voice detection model, a text detection model, an image detection model, and the like, and the category and the number of the event detection models are not limited in the embodiment of the present disclosure. The live event refers to content with viewing value, such as singing, dancing and musical instrument playing, which are related to talent performance, and the language output which is related to language performance. Correspondingly, the contents except the live event refer to the contents without watching value in the live video, such as the ineffective contents in talent and skill performances, the picture without main broadcasting, the arrangement of the field before dancing, the process of selecting songs before singing, and the like.

In some embodiments, for any third video segment, the electronic device may obtain, based on a live category of the third video segment, at least one event detection model related to the live category. The electronic device can then input the third video segment into the at least one event detection model to obtain at least one live event belonging to the live category. Optionally, for any third video segment, the electronic device may further determine a live event belonging to a live category of the third video segment, and obtain at least one event detection model related to the live event. Wherein, the event detection model related to the live event includes but is not limited to: musical instrument performance models, music models, singing models, dance models, and the like.

For example, after the electronic device determines that the live category of a third video segment is a talent performance, the electronic device determines that a live event related to the talent performance is musical instrument performance. The electronic device acquires an instrument performance model, and then inputs the third video segment to the instrument performance model, and the instrument performance model detects a plurality of video frames containing at least one instrument performance content.

In step S305, for any live event, a first time instant of earliest occurrence and a second time instant of latest occurrence of the live event are determined in a third video segment.

In this disclosure, for any live event, the electronic device may determine, in the third video segment, the earliest occurring video frame and the latest occurring video frame of the live event, and determine, based on the video frames, the earliest occurring first time and the latest occurring second time of the live event.

For example, the live category of the third video segment is talent performance, and the third video segment includes a singing live event. The electronic equipment determines the video frame of the accompaniment music starting to play and the video frame of the accompaniment music stopping to play in the third video clip. The electronic equipment takes the video frame of the accompaniment music which starts playing as the video frame which appears earliest in the live event, and takes the video frame of the accompaniment music which stops playing as the video frame which appears latest.

In step S306, a first video segment including the live event is extracted from the third video segment based on the first time and the second time.

In the embodiment of the disclosure, the electronic device determines, based on the first time and the second time, a start-stop time of a video frame containing a live event in the third video segment, and captures the video frame containing the live event from the third video segment to obtain a first video segment, where the first video segment includes the at least one live event.

For example, for a third video segment, the length of the third video segment is 60 seconds. And the electronic equipment determines that the start-stop time of the video frame containing the live event is from 10 th to 60 th seconds, and then intercepts the video frames from 10 th to 60 th seconds in the third video clip to obtain the first video clip. Or the electronic device determines that the start-stop time of the video frame containing the live event is from 20 th to 30 th seconds and from 50 th to 60 th seconds, and then intercepts the video frames from 20 th to 60 th seconds in the third video clip to obtain the first video clip. For a certain third video segment, the length of the third video segment is 60 seconds.

In some embodiments, the number of the first video segments is multiple, and for multiple first video segments with duration less than a preset duration, the electronic device splices adjacent first video segments belonging to the same live event according to a time sequence to obtain at least one first video segment. Because the time interval between two adjacent first video clips is small, after the two first video clips are spliced according to the time sequence, a jumping feeling is not generated, and the continuity of the generated first video clips is not influenced, so that the two first video clips can be spliced even if the two first video clips are not really adjacent to each other, and the video frames for presenting the live event are reserved. By splicing the first video clips which are short in time length and belong to the same live event, the number of the first video clips can be reduced, and the processing efficiency is improved. The length of the first video segment can be increased, and the processing precision can be improved.

For example, for a third video segment, the length of the third video segment is 60 seconds. The electronic equipment determines that the starting and ending moments of video frames belonging to the same live event are from 20 th to 30 th seconds and from 50 th to 60 th seconds, then the electronic equipment intercepts the video frames from 20 th to 30 th seconds and from 50 th to 60 th seconds in a third video clip to obtain two first video clips with the duration of 10 seconds, and the two first video clips are spliced according to the time sequence to obtain a first video clip with the duration of 20 seconds. As another example, for two adjacent third video segments, for the first third video segment, the electronic device determines that the start-stop time of the video frame of the live event is 50 seconds to 60 seconds. For the second third video segment, the electronic device determines the start-stop moments of the video frames of the live event to be 5 th to 25 th seconds. The electronic device intercepts the 50 th to 60 th second video frames in the first third video segment and the 5 th to 25 th second video frames in the second third video segment. Since the length of both the two first video segments that are captured is less than 30 seconds. The electronic equipment splices the two first video clips according to the time sequence to obtain a first video clip with the duration of 30 seconds.

In step S307, for any first video segment, an event association degree between each video frame in the first video segment and the corresponding live event is determined through at least one of a multi-modal technique, an action detection technique, and a scene detection technique.

In the embodiment of the present disclosure, the first video segment includes at least one live event, and the first video segment further includes a video frame with a low event correlation degree with the corresponding live event. In which video frames with higher event relevance, such as video frames showing singing process, video frames showing dancing process, etc., need to be extracted. The electronic device may determine an event association degree of each video frame in the first video segment with a corresponding live event through at least one of a multi-modal technique, an action detection technique, and a scene detection technique.

For example, in a first video segment comprising singing content, it may comprise playing video frames of an original song; in the first video segment including dancing, video frames with speaking, background music playing, low performance level and single action may be contained; in a first video segment comprising a chat interaction, video frames of an answer gift, a dialect live, and illegal content may be included. The above are all video frames with low event correlation. Table 2 lists some video frames with low event correlation.

TABLE 2

In step S308, the video frames with the event relevance smaller than the relevance threshold are deleted from the first video segment, so as to obtain a second video segment corresponding to the first video segment.

In the embodiment of the disclosure, for first video segments of different live broadcast categories, after determining the event association degree of each video frame in the first video segment and a corresponding live broadcast event through at least one of a multi-modal technology, an action detection technology, and a scene detection technology, the electronic device deletes the video frame of which the event association degree is smaller than an association degree threshold value from the first video segment to obtain a second video segment corresponding to the first video segment, so as to improve the precision of the second video segment. See the following six modes.

In a first mode, in the case that the first video segment presents a live event related to the talent performance, the electronic device performs at least one of non-performance detection, debt detection and vocal detection on the first video segment, and determines the event association degree of each video frame in the first video segment with the talent performance. And the electronic equipment cuts the first video clip based on the event correlation degree, deletes the video frame with the event correlation degree smaller than the correlation degree threshold value, and obtains a second video clip. The non-performance detection is used for detecting video frames which are not subjected to talent and skill performance, the blank elimination detection is used for detecting video frames with blanks, the original singing detection is used for detecting video frames with original singing, and the event correlation degree of the video frames is small. By performing at least one of non-performance detection, white-left-out detection and original singing detection on the first video segment, video frames with event relevance smaller than a relevance threshold in the first video segment can be removed, and the accuracy of the generated second video is improved.

In a second mode, in the case that the first video segment presents a live event related to a language performance, the electronic device performs at least one of dialect detection, voice content detection and voice security detection on the first video segment, and determines the event association degree of each video frame in the first video segment with the language performance. The electronic equipment cuts the first video clip based on the event correlation degree, and deletes the video frame of which the event correlation degree is smaller than the correlation degree threshold value to obtain a second video clip. The dialect detection is used for detecting video frames of dialect content, the voice content detection is used for detecting video frames of voice content, and the voice safety detection is used for detecting video frames of safety content. By performing at least one of dialect detection, voice content detection and voice safety detection on the first video segment, video frames with event correlation degree smaller than a correlation degree threshold value in the first video segment can be removed, and the accuracy of the generated second video is improved.

In a third mode, in the case that the first video clip presents a live event related to daily life, the electronic device performs at least one of scene detection, object detection and motion detection on the first video clip, and determines the association degree of each video frame in the first video clip with the event of daily life. The electronic equipment cuts the first video clip based on the event correlation degree, and deletes the video frame of which the event correlation degree is smaller than the correlation degree threshold value to obtain a second video clip corresponding to the first video clip. The scene detection is used for detecting a shooting scene of the first video segment, the object detection is used for detecting a shooting object in the first video segment, and the action detection is used for detecting an object action in the first video segment. By performing at least one of scene detection, object detection and action detection on the first video clip, video frames with event relevance smaller than a relevance threshold in the first video clip can be removed, and the accuracy of the generated second video is improved.

In a fourth mode, when the first video segment shows the live event related to interest sharing, the electronic device performs at least one of object detection, action detection, voice detection and audio detection on the first video segment, and determines the association degree between each video frame in the first video segment and the event related to interest sharing. The electronic equipment cuts the first video clip based on the event correlation degree, deletes the video frame of which the event correlation degree is smaller than the correlation degree threshold value, and obtains a second video clip corresponding to the first video clip. The voice detection is used for detecting the video frame of the target voice content in the first video segment, and the audio detection is used for detecting the video frame of the target audio content in the first video segment. By carrying out at least one processing mode of object detection, action detection, voice detection and audio detection on the first video clip, video frames with event relevance smaller than a relevance threshold in the first video clip can be removed, and the accuracy of the generated second video is improved.

In a fifth mode, in a case that the first video segment presents a science popularization education-related live event, the electronic device performs at least one of object detection, voice detection and text detection on the first video segment, and determines the association degree of each video frame in the first video segment with the science popularization education event. The electronic equipment cuts the first video clip based on the event correlation degree, and deletes the video frame of which the event correlation degree is smaller than the correlation degree threshold value to obtain a second video clip corresponding to the first video clip. Wherein the text detection is used to detect video frames of text content in the first video segment. By performing at least one of object detection, voice detection and text detection on the first video segment, video frames with event relevance smaller than a relevance threshold in the first video segment can be removed, and the accuracy of the generated second video is improved.

In a sixth mode, in a case that the first video segment represents a live event related to multi-person performance, the electronic device performs at least one of object detection, voice detection, motion detection, music detection and dance detection on the first video segment, and determines an event association degree of each video frame in the first video segment with the multi-person performance. The electronic equipment cuts the first video clip based on the event correlation degree, and deletes the video frame of which the event correlation degree is smaller than the correlation degree threshold value to obtain a second video clip corresponding to the first video clip. Wherein the music detection is used to detect a video frame of the music content in the first video segment and the dance detection is used to detect a video frame of the dance content in the first video segment. By means of at least one processing mode of object detection, voice detection, motion detection, music detection and dance detection of the first video segment, video frames with event correlation degree smaller than a correlation degree threshold value in the first video segment can be removed, and accuracy of the generated second video is improved.

Fig. 4 is a schematic diagram illustrating a processing manner according to an exemplary embodiment, and as shown in fig. 4, the processing manner for the first video segment includes: audio event detection, audio fingerprinting, cross-modal retrieval, face recognition, lip motion detection, etc. The processing mode can effectively remove the contents of original singing, low performance level, safety problems, long time left, non-performance, inferior content and the like. The method comprises the steps that audio event detection is used for detecting audio content corresponding to a live event in a first video segment, audio fingerprints are used for identifying target sounds in the first video segment, cross-modal retrieval can be carried out on the live event based on characteristics of multiple modalities in the first video segment, face identification is used for identifying a target object in the first video segment, and lip motion detection is used for identifying lip motions of the target object in the first video segment. The server can also solve the problems of contents and insubstantial contents such as illicit words, contents not wonderful enough and the like based on an ASR (Automatic Speech Recognition) model, an NLP (Natural Language Processing) model, a speaker log and vision combined related technologies, and can effectively ensure the quality and the safety of the generated second video clip.

It should be noted that the association degree of the video frame in the second video segment with the live event is higher than the association degree threshold, and the second video segment may also be referred to as a highlight video segment, a highlight content segment, or the like.

In step S309, a target video is generated based on the at least one second video segment.

In the embodiment of the disclosure, the electronic device may adopt the intelligent creation module to perform processing such as title survival, audio generation, visual creation, special effect rendering on the at least one second video segment, so as to obtain the target video.

In some embodiments, the electronic device may further acquire a recorded live video after the live broadcast is finished, and then perform event detection, cutting, editing and creation on the live video through the scheme provided by the present disclosure to generate a target video.

The following describes a process of generating a target video in a case where the live event presented in the first video segment is a live event related to talent performance, as shown in fig. 5, where the live event related to talent performance includes musical instrument performance, dance/fitness, singing, and the like. Firstly, the electronic equipment acquires a live video, slices the live video according to a certain time interval, and segments the live video into at least one third video segment. And then the electronic equipment classifies the third video segment and determines that the live broadcast category label of the third video segment is played by the musical instrument. The electronic device then inputs the third video segment into an instrument performance model that detects a timestamp containing the talent performance content of the instrument performance. The electronic equipment intercepts video frames containing the musical instrument performance from the third video segment based on the time stamp to obtain at least one first video segment. And then the electronic equipment performs musical instrument playing detection and blank removing detection on the first video segment, and determines the association degree of each video frame in the first video segment with the event played by the musical instrument. And the electronic equipment cuts the first video clip based on the event correlation degree, deletes the video frame with the event correlation degree smaller than the correlation degree threshold value, and obtains a second video clip. And the electronic equipment carries out editing operation on the second video clip through the intelligent creation module to generate a target video. And when the live broadcast type labels of the third video segment are dance, fitness and singing, inputting the third video segment into a model related to the live broadcast type to detect the event, determining the timestamp of the live broadcast event belonging to the live broadcast type, and then performing the cutting and intelligent creation to generate the target video, which is not repeated herein.

A description will be given below of a flow of generating a target video in a case where the live event presented by the first video segment is a language performance-related live event, as shown in fig. 6. Live events related to language performance include chicken soup excitement, emotional interaction, chat interaction and the like. For the live events related to language performance, the generation of the target video is to cut out video clips which are good in interactivity or interesting and generate the target video. Firstly, the electronic equipment acquires a live video, slices the live video according to a certain time interval, and segments the live video into at least one third video segment. And then the electronic equipment classifies the third video clip and determines that the live broadcast category label of the third video clip is chatting interaction. The electronic device inputs the third video segment to a speech recognition model, which detects a timestamp comprising the voice interaction. The electronic device intercepts video frames containing the voice interaction from the live video clip based on the time stamp to obtain at least one first video clip. And then the electronic equipment performs dialect detection, wonderful degree detection, non-substantive content detection and security problem detection on the first video segment, and determines the event correlation degree of each video frame in the first video segment and live events related to chatting interaction. And the electronic equipment cuts the first video clip based on the event correlation degree, deletes the video frame with the event correlation degree smaller than the correlation degree threshold value, and obtains a second video clip. And the electronic equipment carries out editing operation on the second video clip through the intelligent creation module to generate a target video. The wonderful degree detection is used for performing voice detection, motion detection and text detection on the second video segment to judge the speaking intensity and speaking emotion of the target object in the second video segment, and for video frames with high speaking intensity and more positive emotion, the wonderful degree can be considered to be higher, and for video frames with lower speaking intensity and less emotion, the wonderful degree can be considered to be lower. The security problem detection is used to detect a video frame in the second video segment that has a security problem.

In some embodiments, the electronic device may further acquire a recorded live video after the live broadcast is finished, and then perform event detection, clipping, editing and authoring on the live video through the scheme provided by the present disclosure to generate a target video, as shown in fig. 7. The electronic equipment divides a live video into at least one third video segment, classifies the at least one third video segment to obtain a live category of the at least one third video segment, then carries out event detection on audio information and video information in any third video segment to obtain a first video segment comprising a live event, cuts the first video segment, removes video frames with event correlation smaller than a correlation threshold value in the first video segment to obtain a second video segment, and edits and creates the second video segment to generate a target video. The electronic equipment can also share the generated target video to other terminals, and the target video is displayed by the other terminals, so that the influence of the live video can be further improved.

Fig. 7 shows an overall process of generating a target video based on a live video, in which an electronic device slices the live video at certain time intervals and segments the live video into at least one third video segment. The electronic device classifies the at least one third video segment to obtain a live broadcast category of the at least one third video segment, wherein the live broadcast category of the third video segment includes but is not limited to: talent and art performances, language, daily life, interest sharing, science popularization education, multi-person performances and the like. And then the electronic equipment detects the event of the audio information and the video information in any third video segment to obtain a first video segment comprising the live event. The event detection method includes, but is not limited to: singing detection, dance detection, musical instrument detection, laughter detection, face detection, and the like. And the electronic equipment identifies the occurrence moment of the live event in the third video clip to obtain the timestamp of the live event. The electronic device extracts a first video segment that includes the live event from the third video segment based on the timestamp of the live event. The electronic device can also delete the video frames with the event relevance smaller than the relevance threshold from the first video clip to obtain a second video clip corresponding to the first video clip. The method for determining the event relevancy includes, but is not limited to: original singing detection, non-performing detection, white left detection, dialect detection and object detection. And the electronic equipment edits and creates the second video clip to generate a target video. The electronic equipment can also share the generated target video to other terminals, and the target video is displayed by the other terminals, so that the influence of live video can be further improved,

all the above optional technical solutions may be combined arbitrarily to form optional embodiments of the present disclosure, and are not described in detail herein.

Fig. 8 is a block diagram illustrating a video generation apparatus according to an example embodiment. As shown in fig. 8, the apparatus includes: a classification unit 801, an extraction unit 802, a deletion unit 803, and a generation unit 804.

The classification unit 801 is configured to classify the live video through a multi-modal technology, so as to obtain a live category of the live video;

an extracting unit 802 configured to extract at least one first video clip from the live video based on the live category, the first video clip being used for presenting at least one live event belonging to the live category in the live video;

a deleting unit 803, configured to, for any first video segment, delete a video frame whose event association degree is smaller than the association degree threshold from the first video segment, to obtain a second video segment corresponding to the first video segment, where the event association degree is used to indicate the association degree between content in the video frame and a live event;

a generating unit 804 configured to generate the target video based on the at least one second video segment.

In some embodiments, the deleting unit 803 is configured to determine, for any first video segment, an event association degree of each video frame in the first video segment with a corresponding live event through at least one of a multi-modal technique, an action detection technique, and a scene detection technique; and deleting the video frames with the event correlation degree smaller than the correlation degree threshold value from the first video clip to obtain a second video clip corresponding to the first video clip.

In some embodiments, fig. 9 is a block diagram illustrating another video generation apparatus according to an example embodiment. Referring to fig. 9, the classification unit 801 includes:

a slicing subunit 8011 configured to slice the live video into at least one third video segment;

the classifying sub-unit 8012 is configured to classify any third video segment by a multi-modal technology to obtain a live broadcast category of the third video segment;

a determining subunit 8013 configured to determine a live category of the live video based on the live category of the at least one third video segment.

In some embodiments, the determining sub-unit 8013 is configured to determine, based on the live category of the at least one third video segment, a highest target live category; and determining the target live broadcast category as the live broadcast category of the live broadcast video.

In some embodiments, referring to fig. 9, the apparatus further comprises:

an obtaining unit 805 configured to obtain at least one newly acquired fourth video segment;

an updating unit 806 configured to update the live category of the live video based on the live category of the at least one fourth video segment.

In some embodiments, the extracting unit 802 is configured to determine, for any third video segment, based on the live category, at least one live event belonging to the live category; for any live event, determining a first moment of earliest occurrence and a second moment of latest occurrence of the live event in a third video segment; based on the first time and the second time, a first video segment including a live event is extracted from a third video segment.

In some embodiments, the number of the first video segments is plural; referring to fig. 9, the apparatus further comprises:

the splicing unit 807 is configured to splice, for a plurality of first video segments with duration less than a preset duration, the first video segments that are adjacent and belong to the same live event according to a time sequence order, so as to obtain at least one first video segment.

The embodiment of the disclosure provides a video generation device, which classifies live videos through a multi-modal technology to obtain live categories of the live videos, extracts at least one first video clip from the live videos based on the live categories, deletes a video frame with an event relevance smaller than a relevance threshold from the first video clip for any first video clip to obtain a second video clip corresponding to the first video clip, and generates a target video based on the at least one second video clip, so that the target video is a video formed by the live events, and the quality of the target video is improved.

It should be noted that, when the apparatus provided in the foregoing embodiment performs video generation, only the division of each functional unit is illustrated, and in practical applications, the function distribution may be completed by different functional units as needed, that is, the internal structure of the electronic device may be divided into different functional units to complete all or part of the functions described above. In addition, the video generation apparatus and the video generation method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments in detail and are not described herein again.

With regard to the apparatus in the above-described embodiment, the specific manner in which each unit performs the operation has been described in detail in the embodiment related to the method, and will not be described in detail here.

Fig. 10 is a block diagram illustrating an electronic device 1000 in accordance with an example embodiment. In general, the electronic device 1000 includes: a processor 1001 and a memory 1002.

Processor 1001 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so forth. The processor 1001 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 1001 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in a wake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 1001 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, the processor 1001 may further include an AI (Artificial Intelligence) processor for processing a calculation operation related to machine learning.

Memory 1002 may include one or more computer-readable storage media, which may be non-transitory. The memory 1002 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in the memory 1002 is used to store at least one program code for execution by the processor 1001 to implement the video generation methods provided by the method embodiments in the present disclosure.

In some embodiments, the electronic device 1000 may further optionally include: a peripheral interface 1003 and at least one peripheral. The processor 1001, memory 1002 and peripheral interface 1003 may be connected by a bus or signal line. Various peripheral devices may be connected to peripheral interface 1003 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of a radio frequency circuit 1004, a display screen 1005, a camera assembly 1006, an audio circuit 1007, and a power supply 1008.

The peripheral interface 1003 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 1001 and the memory 1002. In some embodiments, processor 1001, memory 1002, and peripheral interface 1003 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 1001, the memory 1002, and the peripheral interface 1003 may be implemented on separate chips or circuit boards, which are not limited by this embodiment.

The Radio Frequency circuit 1004 is used to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. The radio frequency circuitry 1004 communicates with communication networks and other communication devices via electromagnetic signals. The radio frequency circuit 1004 converts an electrical signal into an electromagnetic signal for transmission, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 1004 comprises: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuitry 1004 may communicate with other electronic devices via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the radio frequency circuit 1004 may also include NFC (Near Field Communication) related circuits, which are not limited by this disclosure.

A display screen 1005 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 1005 is a touch display screen, the display screen 1005 also has the ability to capture touch signals on or over the surface of the display screen 1005. The touch signal may be input to the processor 1001 as a control signal for processing. At this point, the display screen 1005 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display screen 1005 may be one, providing a front panel of the electronic device 1000; in other embodiments, the display screens 1005 may be at least two, which are respectively disposed on different surfaces of the electronic device 1000 or in a foldable design; in still other embodiments, the display screen 1005 may be a flexible display screen disposed on a curved surface or a folded surface of the electronic device 1000. Even more, the display screen 1005 may be arranged in a non-rectangular irregular figure, i.e., a shaped screen. The Display screen 1005 may be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), or other materials.

The camera assembly 1006 is used to capture images or video. Optionally, the camera assembly 1006 includes a front camera and a rear camera. Generally, a front camera is disposed on a front panel of an electronic apparatus, and a rear camera is disposed on a rear surface of the electronic apparatus. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, the camera assembly 1006 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

The audio circuit 1007 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals into the processor 1001 for processing or inputting the electric signals into the radio frequency circuit 1004 for realizing voice communication. For the purpose of stereo sound collection or noise reduction, a plurality of microphones may be provided at different portions of the electronic device 1000. The microphone may also be an array microphone or an omni-directional acquisition microphone. The speaker is used to convert electrical signals from the processor 1001 or the radio frequency circuit 1004 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, the audio circuit 1007 may also include a headphone jack.

The power supply 1008 is used to power the various components in the electronic device 1000. The power source 1008 may be alternating current, direct current, disposable batteries, or rechargeable batteries. When power source 1008 comprises a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery can also be used to support fast charge technology.

Those skilled in the art will appreciate that the configuration shown in fig. 10 is not limiting of the electronic device 1000 and may include more or fewer components than shown, or combine certain components, or employ a different arrangement of components.

In an exemplary embodiment, there is also provided a computer readable storage medium comprising instructions, such as a memory 1002 comprising instructions executable by a processor 1001 of the terminal 1000 to perform the above-described multimedia resource downloading method. Alternatively, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

A computer program product comprising a computer program which, when executed by a processor, implements the video generation method described above.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of video generation, comprising:

classifying live videos through a multi-mode technology to obtain live categories of the live videos;

for any first video segment, deleting a video frame with an event association degree smaller than an association degree threshold value from the first video segment to obtain a second video segment corresponding to the first video segment, wherein the event association degree is used for expressing the association degree of content in the video frame and a live event;

and generating the target video based on the at least one second video segment.

2. The video generation method according to claim 1, wherein for any first video segment, deleting a video frame whose event relevance is smaller than a relevance threshold from the first video segment to obtain a second video segment corresponding to the first video segment, includes:

3. The method according to claim 1, wherein the classifying the live video through a multi-modal technique to obtain a live category of the live video comprises:

slicing the live video to obtain at least one third video segment;

4. The video generation method of claim 3, wherein the determining the live category of the live video based on the live category of the at least one third video segment comprises:

5. The video generation method of claim 4, wherein the method further comprises:

acquiring at least one newly acquired fourth video clip;

6. The video generation method according to claim 3, wherein the extracting at least one first video segment from the live video based on the live category comprises:

7. The video generation method according to claim 6, wherein the number of the first video segments is plural;

the method further comprises the following steps:

8. A video generation apparatus, characterized in that the apparatus comprises:

9. An electronic device, characterized in that the electronic device comprises:

one or more processors;

a memory for storing the processor executable program code;

wherein the processor is configured to execute the program code to implement the video generation method of any of claims 1 to 7.

10. A computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the video generation method of any of claims 1 to 7.