CN117641073A

CN117641073A - Video cover generation method, device, equipment and storage medium

Info

Publication number: CN117641073A
Application number: CN202311705431.3A
Authority: CN
Inventors: 杨亚斌; 尹浩; 项琳舒; 党正军
Original assignee: Guangzhou Kugou Computer Technology Co Ltd
Current assignee: Guangzhou Kugou Computer Technology Co Ltd
Priority date: 2023-12-12
Filing date: 2023-12-12
Publication date: 2024-03-01

Abstract

The application discloses a video cover generation method, a device, equipment and a storage medium, and belongs to the technical field of image processing. The method comprises the following steps: obtaining m video frames from a material video, wherein m is a positive integer; for each video frame in m video frames, carrying out picture element detection on the video frames to obtain element detection results of the video frames, wherein the element detection results are used for representing the area occupied by picture elements in the video frames, and the picture elements are related to picture contents in the video frames; clipping the video frame according to the element detection result to obtain a clipping image corresponding to the video frame, wherein the size of the clipping image is smaller than or equal to that of the video frame; and generating a video cover of the material video according to the clipping images corresponding to the m video frames respectively. By carrying out identification clipping processing on the video frames, negative contents in the video frames are removed while effective picture contents in the video are maintained, and the quality of the generated video covers is improved.

Description

Video cover generation method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for generating a video cover.

Background

With the advent of multimedia technology, more and more users have attempted to join in video production, sharing lines. Before uploading the material video to the network, the video cover is assigned to the material video, so that other users can quickly know the video content of the material video.

In the related art, when generating a video cover of a material video, a user needs to designate an initial cover in advance, and a computer device perfects the initial cover according to video frames included in the material video to obtain a video aspect of a material image. Specifically, the computer device finds a target video frame similar to the initial cover from the video of the material video, corrects the pixel value in the initial cover according to the pixel value in the target video frame, and obtains the video cover with higher definition.

However, in the related art, the user is required to participate in the generation process of the video cover, the quality of the video cover is related to the content of the initial cover uploaded by the user, and the upper limit of the quality of the video cover is limited.

Disclosure of Invention

The application provides a video cover generation method, device, equipment and storage medium. The technical scheme is as follows:

according to an aspect of an embodiment of the present application, there is provided a video cover generation method, including:

Obtaining m video frames from a material video, wherein m is a positive integer;

for each video frame in the m video frames, performing picture element detection on the video frame to obtain an element detection result of the video frame, wherein the element detection result is used for representing the area occupied by picture elements in the video frame, and the picture elements are related to picture content in the video frame;

cutting the video frame according to the element detection result to obtain a cutting image corresponding to the video frame, wherein the size of the cutting image is smaller than or equal to that of the video frame;

and generating a video cover of the material video according to the clipping images respectively corresponding to the m video frames.

According to an aspect of an embodiment of the present application, there is provided a video cover generating apparatus, including:

the video frame acquisition module is used for acquiring m video frames from the material video, wherein m is a positive integer;

the element detection module is used for carrying out picture element detection on each video frame in the m video frames to obtain an element detection result of the video frames, wherein the element detection result is used for representing the area occupied by picture elements in the video frames, and the picture elements are related to picture content in the video frames;

The video frame clipping module is used for clipping the video frame according to the element detection result to obtain a clipping image corresponding to the video frame, wherein the size of the clipping image is smaller than or equal to that of the video frame;

and the cover generation module is used for generating a video cover of the material video according to the clipping images corresponding to the m video frames respectively.

According to an aspect of the embodiments of the present application, there is provided a computer device, including a processor and a memory, from which the processor loads and executes the computer instructions to implement the video cover generating method described above.

According to an aspect of the embodiments of the present application, there is provided a computer readable storage medium having stored therein computer instructions that are loaded and executed by a processor from the storage medium to implement the video cover generation method described above.

According to one aspect of embodiments of the present application, there is provided a computer program product comprising computer instructions stored in a computer readable storage medium, the computer instructions being loaded and executed by a processor from the computer readable storage medium to implement the video cover generation method described above.

The beneficial effects that technical scheme that this application embodiment provided include at least:

according to the method and the device for generating the video cover of the material video, the video cover of the material video can be automatically generated based on the video frames in the material video, and automatic and batch generation of the video cover of the material video can be achieved. In the process of generating the video cover of the material video, the video frame is subjected to picture element detection, and the video frame is cut according to an element detection result generated by the picture element detection, so that a cut image is obtained. By the method, the picture content in the video frame is selectively removed, so that the picture content included in the cut image is suitable for serving as the picture content of the video cover, and the quality of the video cover of the material video obtained according to the cut image is further improved. The method is beneficial to realizing the automatic high-quality generation of the video cover of the material video.

Drawings

FIG. 1 is a schematic illustration of an implementation environment for an approach provided by an exemplary embodiment of the present application;

FIG. 2 is a schematic diagram of the inventive concept of the present application;

FIG. 3 is a flowchart of a video cover generation method provided by an exemplary embodiment of the present application;

FIG. 4 is a schematic diagram of a video frame acquisition process provided in one exemplary embodiment of the present application;

FIG. 5 is a schematic illustration of a second scoring model provided in an exemplary embodiment of the present application;

FIG. 6 is a schematic diagram of visual element detection provided by an exemplary embodiment of the present application;

FIG. 7 is a schematic diagram of invalid region detection provided by an exemplary embodiment of the present application;

FIG. 8 is a schematic diagram of invalid region detection provided by another exemplary embodiment of the present application;

FIG. 9 is a schematic diagram of invalid region detection provided by another exemplary embodiment of the present application;

FIG. 10 is a schematic diagram of invalid region detection provided by another exemplary embodiment of the present application;

FIG. 11 is a schematic representation of the generation of a video cover provided in an exemplary embodiment of the present application;

FIG. 12 is a diagram of the effect of generating a video cover provided by an exemplary embodiment of the present application;

FIG. 13 is a block diagram of a video cover generation apparatus provided in one exemplary embodiment of the present application;

fig. 14 is a block diagram of a computer device according to an exemplary embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

FIG. 1 is a schematic diagram of an implementation environment for an approach provided by an exemplary embodiment of the present application. The implementation environment of the scheme can comprise: a terminal device 10 and a server 20.

The terminal device 10 may be an electronic device such as a PC (Personal Computer ), tablet, cell phone, wearable device, smart home appliance, vehicle-mounted terminal, aircraft, etc. The terminal device 10 has at least an arithmetic and a memory function. The terminal device 10 is configured to implement a video cover generation function. For example, the terminal device 10 has a target application running therein. The target application is used to generate a video cover of the material video. The target application may be a multimedia playing application, a social application, a music playing application, etc., and the present application is not limited herein.

The server 20 is used to provide background services for clients of target applications in the terminal device 10. For example, the server 20 may be a separate physical server or may be a server cluster composed of a plurality of servers. The server 20 has at least data receiving and processing capabilities. By the data receiving capability of the server 20, the terminal device 10 and the server 20 can communicate with each other via a network. The network may be a wired network or a wireless network.

In one possible implementation, the terminal device 10 has the capability of independently generating a video cover of the material video. After receiving the material video uploaded by the user, the target application program of the terminal device 10 generates a clip image by performing element content detection and clipping processing on the video frames in the material video, and obtains a video cover of the material video based on the clip image.

In another possible embodiment, the process of generating the video cover of the material video is completed jointly by the terminal device 10 and the server 20. Illustratively, the terminal device 10 transmits the material video to the server 20 after receiving the material video; the server 20 receives the material video, and executes the method for generating the video aspect provided in the embodiment of the present application to obtain the video cover of the material video. The server 20 transmits a video cover of the material video to the terminal device 10.

In another possible embodiment, the process of generating the video cover of the material video is completed jointly by the terminal device 10 and the server 20. Illustratively, after receiving the material video, the terminal device 10 acquires a plurality of candidate video frames from the material video, and the terminal device 10 uploads the plurality of candidate video frames to the server 20; the server 20 receives the candidate video frame, and executes the video aspect generating method provided in the embodiment of the present application to obtain the video cover of the material video. The server 20 transmits a video cover of the material video to the terminal device 10. In this way, transmission resources consumed in the transmission of the material video from the terminal device 10 to the server 20 are reduced.

Fig. 2 is a schematic diagram of the inventive concept provided in an exemplary embodiment of the present application.

As shown in fig. 2, the method obtains element detection results of video frames by respectively carrying out picture element detection on m video frames obtained from a material video; so as to cut m video frames respectively according to the element detection result to obtain cut images corresponding to the m video frames respectively; the picture content in the clipping images is centralized, so that the picture quality of the video cover of the material video generated by the clipping images corresponding to the m video frames is improved, and the watching experience of a user is improved.

FIG. 3 is a flowchart of a video cover generation method provided in an exemplary embodiment of the present application. The method may be executed by the terminal device 10 in the scenario implementation environment shown in fig. 1, or by the server 20 in the scenario implementation environment shown in fig. 1. For convenience of description, the following describes each execution step of the present application with a computer device as an execution subject. As shown in fig. 3, the method may include at least one of the following steps (310-340):

in step 310, m video frames are obtained from the material video, where m is a positive integer.

In some embodiments, the material video refers to a video that requires the generation of a video cover. For example, the material video refers to a video to be uploaded to a video sharing website, and after a user obtains the material video by clipping or the like, the user uploads the material image through a terminal device. The computer device obtains the material video and generates a video cover based on the material video. Optionally, the material video is recorded by the user or downloaded from a shared resource on the internet.

In some embodiments, the material video includes at least one video frame. The playing process of the video material is to switch the at least one video frame at high speed according to the playing sequence. In other words, the display effect of the material video is achieved by playing at least one video frame. The picture content in the video cover of the material video generated in the embodiment of the application comes from all or part of the video frames in the material video.

Optionally, m is a positive integer less than or equal to the total number of video frames included in the material video. Illustratively, M is in positive correlation with the total number of video frames M in the material video. For example, m=k×m, k is a real number of (0, 1). Assuming that the total number of video frames in the material video is equal to 6000 and k is equal to 0.01, M is equal to 60. For example, M is equal to an integer part of the ratio of the total number of video frames M in the material video to the frame rate, which refers to the video frames played in the unit time.

The computer device obtains m video frames from the material video and generates a video cover of the material video based on one or more video frames of the m video frames. Optionally, the computer device selects m video frames from the material video and generates a video cover of the material video based on the m video frames. For details of this step, please refer to the following examples.

In step 320, for each of the m video frames, a picture element is detected for the video frame, so as to obtain an element detection result of the video frame, where the element detection result is used to characterize an area occupied by the picture element in the video frame, and the picture element is related to picture content in the video frame.

In some embodiments, picture elements are used to describe the same category of picture content. Optionally, the picture elements include, but are not limited to, at least one of: character content, saliency content, identification content, text content.

Optionally, in the process of generating the video cover of the material video, at least one picture element may be further divided into a positive content element and a negative content element; the positive content element refers to the picture content of the video cover which can be used for generating the material video in the video frame, and the negative content element refers to the picture content of the video cover which cannot be used for generating the material video in the video frame. That is, the purpose of picture element detection for video frames is: by performing picture element detection, picture content that can be used to generate a material cover is extracted from a video frame, and for details regarding picture element detection, reference is made to the following embodiments.

In some embodiments, the element detection results are used to characterize the area occupied by each picture element in the video frame. For example, if the screen element detection needs to detect the class a screen element, the element detection result includes a sub-results, and a is a positive integer. For the ith sub-result in the a sub-results, the ith sub-result is used for representing the display area occupied by the picture element i in the video frame, and i is a positive integer less than or equal to a.

In some embodiments, the area occupied by a picture element in a video frame refers to the smallest shape area in the video frame that includes the picture element.

Optionally, the shape of the region occupied by the picture element in the video frame includes, but is not limited to, at least one of: rectangular areas, trapezoid areas, circular areas, elliptical areas, etc. The area occupied by the screen elements in the video frame is set according to actual needs, and is not limited here.

Taking the area occupied by the picture element in the video frame as a rectangle as an example, in the element detection result, the ith sub-result of the picture element i includes a center point (x, y) of the rectangle, a rectangle length h and a rectangle width w for the picture element i. The rectangular region including picture element i can be located in the video frame by the center point (x, y) of the rectangle, the rectangle length h, and the rectangle width w. Of course, the ith sub-result of the picture element i may also include the upper left vertex and lower right vertex of the rectangle, or include the lower left vertex and upper right vertex of the rectangle. The expression of the ith sub-result is set according to actual needs, and is not set here.

In one example, the process of picture element detection for m video frames is done serially, i.e., the computer device determines the element detection results for m video frames in turn.

In another example, the process of detecting the picture elements of the m video frames is completed in parallel, and under the condition that the computer device has sufficient computing capability, the computer device determines the element detection results of a plurality of video frames in the m video frames at the same time, so as to increase the speed of determining the element detection results of the m video frames, and further increase the efficiency of generating the video cover of the material video.

And 330, clipping the video frame according to the element detection result to obtain a clipping image corresponding to the video frame, wherein the size of the clipping image is smaller than or equal to that of the video frame.

After the element detection result is determined, the computer equipment cuts out the video frame according to the element detection result, removes picture elements of the video cover which cannot be used for generating the material video in the video frame, and takes the part obtained by cutting out the video frame as a cut-out image. Optionally, the picture elements included in the cropped image are less than the picture elements included in the video frame.

Through the step 320, the video frame is subjected to the picture element detection, so that the computer equipment can accurately position the area occupied by each picture element in the video frame according to the element detection result, and therefore, useless picture elements are removed according to the element detection result in the process of cutting the video frame, and the display effect of the video cover generated according to the cut image is improved. For details of the clipping of the video frames by the computer device, please refer to the following embodiments for generating clipping images.

And 340, generating a video cover of the material video according to the clipping images corresponding to the m video frames.

In some embodiments, the computer device selects n cropping images from the cropping images corresponding to the m video frames, respectively, and generates a video cover of the material video according to the n cropping images.

Optionally, the computer device selects n cropping images with better picture quality from the cropping images corresponding to the m video frames respectively, and generates the video cover of the material video based on the n cropping images. For details of this process, reference is made to the following examples.

In summary, the embodiment of the application can automatically generate the video cover of the material video based on the video frames in the material video, which is conducive to realizing automation and batch generation of the video covers of the material video. In the process of generating the video cover of the material video, the video frame is subjected to picture element detection, and the video frame is cut according to an element detection result generated by the picture element detection, so that a cut image is obtained. By the method, the picture content in the video frame is selectively removed, so that the picture content included in the cut image is suitable for serving as the picture content of the video cover, and the quality of the video cover of the material video obtained according to the cut image is further improved. The method is beneficial to realizing the automatic high-quality generation of the video cover of the material video.

Next, a process of acquiring a video frame from a material video will be described by way of several embodiments, and the execution subject of this embodiment is a computer device. Fig. 4 is a schematic diagram of a video frame acquisition process provided in an exemplary embodiment of the present application, and step 310 may include the following sub-steps.

In step 313, scenario analysis is performed on the material video to obtain video information of the material video, where the video information is used to divide the material video into at least one scenario segment, and the same scenario segment includes multiple video frames with similar picture content.

In some embodiments, video information of the material video is used to characterize the plot development in the material video. Optionally, the video information of the material video includes demarcation points between each episode of the material video. For example, the video information of a certain video material is [10, 30, 60, 120, 150], which means that [0, 30 seconds) belongs to one plot phase, that [30, 60 seconds) belongs to another plot phase, and so on.

In one example, the video information is self-contained in the material video. That is, the computer device can acquire video information of the material video directly from the material video. For example, the material video includes a plot progress bar edited by a user, the progress bar includes playing time corresponding to each plot stage, the computer device obtains the progress bar from a video frame of the material video, and calculates a demarcation point between each plot section according to the length of the progress bar and the playing time of the material video, so as to obtain video information of the material video. For example, if a certain demarcation point in the progress bar is 1/5 of the length of the progress bar, the playing time of the material video is 60 seconds, and the playing time corresponding to the demarcation point is 12 seconds.

In another example, a computer device performs a episode analysis on a story video in order to obtain video information of the story video. In consideration of the calculation amount of analysis consumption of the audio data, the calculation amount is far smaller than the data amount of analysis consumption of the video data, and in order to improve the efficiency of acquiring the video information of the material video, the computer equipment can perform scenario analysis on the playing audio of the material video to acquire the video information of the material video. The process is described in the following by way of several examples.

In some embodiments, the sub-step 313 performs scenario analysis on the material video to obtain video information of the material video, which may include the following sub-steps:

sub-step 313a, obtaining the playing audio of the material video, wherein the playing audio is used for playing synchronously with the video frame.

Optionally, the playing duration of the playing audio of the material video is equal to the playing duration of the material video, and the playing audio and the material video are synchronously played, so that the playing effect of picture-sound synchronization can be realized. This means that the emotion expressed by the playback audio of the material video is synchronized with the emotion expressed by switching the video frames in the playback order, so that the scenario analysis is performed on the playback audio of the material video, and each scenario piece included in the material video can also be determined.

Illustratively, the material video is obtained by compressing a video stream and an audio stream; the video stream comprises frame data arranged according to a playing time sequence, wherein the frame data is used for representing video frames; the audio stream is used to characterize the playback audio of the material video. In one example, the computer device derives an audio stream from the material video resulting in playback audio of the material video.

Sub-step 313b, performing segment sampling on the playback audio to obtain a plurality of sampling segments, where the sampling segments correspond to a first playback time period, and the sampling segments include signal strengths of the playback audio in the first playback time period.

Alternatively, a sample segment may be understood as an audio segment of the playback audio during a first playback time period.

In some embodiments, the respective playing periods of the plurality of sample segments in the playing audio do not overlap. For example, the plurality of sampling segments includes in play order: sample segment 1, sample segment 2 and sample segment 3, wherein, the play period of sample segment 1 is 10-20 seconds, and the play period of sample segment 2 is 25-35 seconds. Optionally, the playing time periods of the plurality of sampling segments are equal.

Optionally, the computer device divides the played audio using a time window to obtain a plurality of sample segments. The first playing time period refers to a playing time period in which the sampling segment in the playing audio is located.

Sub-step 313c, obtaining video information of the material video according to the plurality of sampling segments.

In some embodiments, the computer device obtains video information of the material video from a plurality of sample segments, including: for any one sampling segment in the plurality of sampling segments, identifying an audio attribute of the sampling segment to obtain an attribute analysis result of the sampling segment, wherein the audio attribute comprises at least one of the following components: track number, audio loudness; and merging the plurality of sampling fragments according to the attribute analysis results of the sampling fragments to obtain at least one plot fragment, wherein the attribute analysis results respectively corresponding to the sampling frequency bands included in the plot fragment belong to the same result interval.

Optionally, the audio attribute is used to characterize the audio composition included in the sample fragment. Illustratively, the audio attributes include: the number of tracks, the loudness value, wherein the number of tracks is used to characterize the number of sound sources included in a sample segment, and the audio loudness is used to characterize the sound of the sample segment.

Optionally, the computer device performs an audio track separation process on the sampled segments to determine the number of audio tracks of the sampled frequency band. For example, the computer device performs time-frequency conversion on the sampling segment to obtain spectrum data of the sampling segment, where the spectrum data is used to characterize a mapping relationship between frequencies and amplitudes in the sampling segment; the computer equipment inputs the spectrum data of the sampling segment into an audio track separation model, and the characteristic extraction is carried out on the spectrum data of the sampling segment through the audio track separation model to obtain audio characteristics; and classifying the audio features through a decoder of the track separation model to obtain classification results, and determining the number of tracks included in the sampling segment according to the classification results by the computer equipment.

The decoder comprises a plurality of classifiers, and each classifier is used for determining whether a certain type of audio track is included in the sampling segment or not based on the audio characteristics. Classification results are divided into two categories: the sample segment includes track j and the sample segment does not include track j. The computer device determines which tracks are included in the sample segment based on the respective classification results, thereby obtaining the number of tracks included in the sample segment. Exemplary types of audio tracks include, but are not limited to, at least the following: human voice, musical instrument voice, environmental sound, etc. Alternatively, the instrument sound may be subdivided or the like. The type of the audio track is set according to actual needs, and the setting is not performed here.

In some embodiments, the number of tracks and loudness values are included in the attribute analysis results. The computer device determines the loudness value of each sample segment based on the playback audio of the material video.

Optionally, after obtaining the analysis result of each sampling segment, the computer device merges the plurality of sampling segments according to the attribute analysis result of the sampling segment to obtain at least one plot stage.

For example, for a first sampling segment and a second sampling segment in the plurality of sampling segments, if a difference between the number of tracks of the first sampling segment and the number of tracks of the second sampling segment is less than or equal to a first threshold value, and a loudness value of the first sampling segment and a loudness value of the second sampling segment are less than or equal to a second threshold value, merging the first sampling segment and the second sampling segment to obtain a third sampling segment; repeating the judging step to determine whether the third sampling segment and the rest sampling segments can be combined; the time interval between the playing time periods respectively corresponding to the first sampling segment and the second sampling segment is smaller than or equal to a third threshold value; the remaining sampling segments refer to other sampling segments among the plurality of sampling segments except the first sampling segment and the second sampling segment.

Optionally, the number of tracks of the third sample segment is equal to the maximum of the number of tracks of the first sample segment and the second sample segment, and the loudness value of the third sample segment is equal to the maximum of the loudness values of the first sample segment and the second sample segment.

It will be appreciated that for any two adjacent episodes of the plurality of episodes resulting from the merging process described above, the difference in the number of tracks between the two adjacent episodes is greater than a first threshold, or the difference in the loudness values of the two adjacent episodes is greater than a second threshold.

The first threshold, the second threshold and the third threshold are illustratively positive integers, and specific values are set according to actual needs and are not set here.

In some embodiments, after obtaining each plot, the computer device uses a critical point between two adjacent plot in the playing progress as a demarcation point, thereby obtaining a plurality of demarcation points; and the computer equipment forms the video information of the material video at the playing time corresponding to the plurality of demarcation points respectively. Optionally, if the playing time periods of the two adjacent episode clips are adjacent, the ending time of the previous episode clip is taken as a critical point, and if the playing time periods of the two adjacent episode clips are not adjacent, the average between the ending time of the previous episode clip and the starting time of the next episode clip is taken as a critical point.

And step 316, respectively acquiring at least one video frame from each plot of the material video according to the video information to obtain a plurality of candidate video frames.

Optionally, for each episode of the plurality of episodes, the computer device obtains at least one candidate video frame included in the episode. Illustratively, for any episode, the number of video frames acquired from the episode has a positive correlation with the playback time period of the episode. That is, the longer the playing period corresponding to the episode, the greater the number of video frames in the episode for which candidates are acquired; the shorter the playing period corresponding to the episode, the fewer the number of video frames in the episode that are candidates for acquisition.

Illustratively, the computer device selects at least one key frame from the episode to obtain candidate video frames. A key frame refers to a video frame that represents a change in picture content in the video frame. For example, the key frame includes: the first video frame with changed picture content, the last video frame before the picture content changed. For example, the computer device selects one video frame as a candidate video frame for each of s video frames, s being equal to the frame rate of the material video, s being a positive integer.

In one example, a computer device determines a first number based on a frame rate of a material video and a total length of play of the material video; wherein the frame rate is used to characterize the number of video frames played in a unit time. The computer device selects at least one candidate video frame from each episode segment, respectively, to obtain a first number of candidate video frames.

In step 319, a layer-by-layer selection is performed on the plurality of candidate video frames, so as to obtain m video frames.

Optionally, the computer device performs layer-by-layer screening on the plurality of candidate video frames according to the picture quality and the aesthetic degree of the candidate video frames, so as to obtain m video frames. That is, m video frames refer to video frames with better picture quality among the plurality of candidate videos.

Compared with the method for randomly intercepting m video frames from the material video, the method for acquiring the video information of the material video is beneficial to improving the picture quality of the m video frames acquired by selecting a plurality of candidate video frames from the material video according to the video information of the material video. Because transition of the transition effect and the like usually exists in the material video, the picture quality of the video frames corresponding to the transition process is lower, and the video frames corresponding to the transition process can be prevented from being determined as candidate video frames by acquiring the video information of the material video. On one hand, unnecessary processing of video frames with lower image quality is reduced, so that waste of operation resources is caused; on the other hand, the picture quality of the selected video frames is ensured, and the quality of the video cover generated according to the selected video frames is improved.

In the following, a process of selecting m video frames from the candidate video frames will be described by several embodiments. The execution subject of the present embodiment is a computer device.

In some embodiments, the sub-step 319, performing layer-by-layer screening on the plurality of candidate video frames to obtain m video frames may include the following steps: scoring the plurality of candidate video frames by using a first scoring model to obtain first scoring results respectively corresponding to the plurality of candidate video frames; selecting a plurality of intermediate video frames from the plurality of candidate video frames according to the first scoring result; scoring the plurality of intermediate video frames by using a second scoring model to obtain second scoring results respectively corresponding to the plurality of intermediate video frames; selecting m frames of video frames from the plurality of intermediate video frames according to the second scoring result; one of the first scoring model and the second scoring model is used for scoring the aesthetic degree of the video frame, and the other is used for scoring the picture quality of the video frame.

In one example, a first scoring model is used to score the aesthetic appeal of a video frame, a first scoring result is used to characterize the aesthetic appeal of a video frame, a second scoring model is used to score the picture quality of a video frame, and a second scoring result is used to characterize the picture sharpness of a video frame.

In another example, a first scoring model is used to score the picture quality of a video frame, a first scoring result is used to characterize the picture sharpness of the video frame, a second scoring model is used to score the aesthetic value of the video frame, and a second scoring result is used to characterize the aesthetic value of the video frame.

That is, in the embodiment provided in the present application, in selecting m video frames from a plurality of candidate videos, the video frames may be scored based on the aesthetic degree first and then the video frames may be scored based on the picture quality; the video frames may also be scored based on picture quality first, and then on aesthetics. The sequence of the two scores is not limited in the application.

The process of selecting m video frames will be described below by taking the example that the first scoring model is used for scoring the aesthetic degree of the video frames and the second scoring model is used for scoring the picture quality of the video frames.

In some implementations, the first scoring model includes a text encoder and an image encoder, where the image encoder is configured to perform feature extraction on the video frame to obtain aesthetic features of the video frame; the text encoder learns the association between the feature information of the video frames and the aesthetic score during the pre-training of the first scoring model, i.e. the text encoder is configured to provide text features of at least one aesthetic score.

In the process of determining the attractive degree of the candidate video frame, the computer equipment inputs the candidate video frame into a first scoring model, the candidate video frame is processed through an image encoder in the first scoring model to obtain attractive degree characteristics of the candidate video frame, and the attractive degree score of the candidate video frame is determined based on the attractive degree characteristics through a neural network classification model.

Optionally, the image encoder in the first scoring model is from a text visual migration model. For example, the Image encoder of the first scoring model is an Image encoder in a contrast text-Image based model that is CLIP (Contrastive Language-Image Pre-tracking). Optionally, the neural network classification model includes at least one active region and a full connection layer for predicting an aesthetic score of a candidate video frame based on the aesthetic characteristics of the candidate video frame.

Illustratively, in pre-training the first scoring model, using LaION-5B as the dataset, the aesthetic score is divided into 1-10 points, one training subset (bucket) for each 0.25 point.

In some embodiments, the second scoring network includes a backbone network, a global averaging pooling layer, at least one encoder, a fully-connected layer, and a linear mapping layer. The backbone network is used for determining picture quality characteristics of candidate video frames, the global average pooling layer is used for pooling the picture quality characteristics to obtain a first processing result, the encoder is used for carrying out encoding processing based on the first processing result to obtain an encoding result, the full-connection layer is used for processing the encoding result to obtain a second processing result, and the linear mapping layer is used for carrying out linear mapping based on the second processing result to generate a second grading result.

FIG. 5 is a schematic diagram of a second scoring model provided in an exemplary embodiment of the present application. As shown in FIG. 5, the second scoring model was pre-trained based on Norm-in-Norm Loss. Illustratively, the backbone network is ResNeXt (Residual Network Next, post residual neural network), global average pooling is also known as Global Average Pooling, GAP for short, at least one BN (Batch Normalization ) and one ReLU (Rectified Linear Unit, rectifying Linear Unit) activation function in the encoder.

The picture quality characteristics obtained by the backbone network are respectively transmitted to a first encoder and a second encoder after being processed by two global average pooling layers, so that the first encoder and the second encoder respectively obtain respective coding results, then the coding results obtained by the first encoder and the second encoder are respectively connected in series to obtain a coding result after being connected in series, and the coding result after being connected in series is processed by a full connection layer to obtain a model prediction result; and then obtaining a second scoring result by linearly mapping the model prediction result.

The coding results obtained by different encoders in the second scoring model are connected in series, so that the coding results after being connected in series can represent the characteristics of candidate video frames at different depths, and the accuracy of the second scoring result obtained by scoring the candidate video frames is improved, and therefore video frames with higher image quality can be selected according to the second scoring result.

In one example, a computer device determines first scoring results corresponding to the plurality of candidate video frames respectively using a first scoring model, the computer device selects a second number of intermediate video frames from high to low according to the first scoring results, and the computer device determines second scoring results corresponding to the second number of intermediate video frames respectively using the second scoring model; the computer device selects m frames of video from the second number of intermediate videos from high to low according to the second scoring result.

Optionally, the second number is preset. For example, the second number is equal to one quarter of the total number of candidate video frames.

For example, the computer device acquires 120 video frames from the material video, the computer device uses the first scoring model to determine aesthetic scoring results corresponding to the 120 candidate video frames respectively, and the computer device takes the top 20 video frames in the aesthetic scoring results as intermediate video frames; the computer device determines a quality score for the 20 intermediate video frames using the second scoring model and determines the top 10 intermediate video frames of the quality score as m video frames.

Through the scoring model for evaluating the attractiveness and the picture quality, m video frames are selected from the plurality of candidate videos layer by layer, and compared with the attractive scoring results and the quality scoring results of all the candidate video frames, the method is beneficial to reducing the calculated amount of selecting the m video frames and shortening the time consumption of generating the video covers of the material video.

The following describes a picture element detection process for a video frame by several embodiments, the execution subject of which is a computer device.

In some embodiments, the element detection result includes: the method comprises the steps of a first detection result and a second detection result, wherein the first detection result is used for representing the area occupied by a positive content element in a video frame, the second detection result is used for representing the area occupied by a negative content element in the video frame, the positive content element is used for representing the picture content to be reserved, and the negative content element is used for representing the picture content to be cut.

Illustratively, the forward content elements include, but are not limited to: character content and saliency content; negative content elements include, but are not limited to: content and textual content are identified. Illustratively, the first detection result is used to characterize sub-results of respective content elements included in the positive content element, and the second detection result is used to characterize sub-results of respective content elements included in the negative content element.

Fig. 6 is a schematic diagram of screen element detection provided in an exemplary embodiment of the present application, and step 330 may include the following sub-steps.

In step 323, the video frame is processed using the first detection model, and a first detection result is generated.

In some embodiments, the first detection model is a machine learning model or a deep learning model. The computer equipment inputs the video frame into the first detection model, and obtains a first detection result output by the first detection model. Illustratively, the first detection model generates a first detection result by performing feature extraction on the video frame and decoding based on feature information obtained by the feature extraction.

Optionally, in the case that a plurality of picture elements are included in the forward content element, the first detection model includes a plurality of sub-models, each for determining a sub-result of the corresponding picture element, respectively. For details of this process, please refer to the following examples.

In step 326, the video frame is processed using the second detection model to generate a second detection result.

In some embodiments, the second detection model is a machine learning model or a deep learning model. The computer equipment inputs the video frame into the second detection model, and obtains a second detection result output by the second detection model.

Optionally, the second detection model includes a plurality of sub-models, and in the case that a plurality of picture elements are included in the negative content element, each sub-model is used to determine a sub-result of the corresponding picture element, respectively. For details of this process, please refer to the following examples.

By the method, the positive content elements and the negative content elements in the video frame are detected before the video frame is used for generating the video cover of the material video, and the display area of the content picture included in the video frame is identified, so that useless picture content is cut out from the video frame, the quality of the generated video cover of the material video is improved, and efficient automatic generation of the video cover is realized.

The following describes a determination process of the first detection result by several embodiments, and the execution subject of the present embodiment is a computer device.

In some embodiments, the forward content element includes at least one of: the character content and the saliency content, wherein the character content refers to characters displayed in a video frame, and the saliency content refers to center content which attracts sight in the video frame.

Alternatively, the character content refers to picture content related to a character in a video frame. For example, the character content includes a face, a character pose, and the like in a material video recorded by a first person. The salient content refers to a content element displayed in a line-of-sight focus area in a video frame. That is, the salient content refers to an initial region of interest when the human eye observes a video frame. For example, the salient content refers to content displayed in an area other than the background in the video frame.

In some embodiments, processing the video frame using the first detection model to generate a first detection result may include the following sub-steps:

sub-step 323a, in the case that the forward content element comprises persona content, determining at least one first feature map of the video frame using a backbone network of a first sub-model comprised by the first detection model.

Optionally, the backbone network of the first sub-model is used to extract the first feature map from the video frame. Illustratively, the backbone network includes a plurality of hidden layers, and the plurality of hidden layers are connected in series, and the output of the previous hidden layer can be used as the input of the next hidden layer. The plurality of first feature maps are output of each hidden layer respectively.

Illustratively, the hidden layer is used to upsample the feature map. For example, the hidden layer is composed of Bilinear (Bilinear interpolation) +conv (convolutional layer) +bn and Activation (active layer). The bilinear+Conv+BN+Activate is used for carrying out up-sampling processing on input data of the hidden layer, so as to obtain a first feature map generated by the hidden layer.

Sub-step 323b, using the feature pyramid network of the first sub-model to fuse at least one first feature map, generating a fused feature.

The FPN (Feature Pyramid Networks, feature pyramid) network is used for performing multi-scale fusion on at least one first feature map to obtain fusion features. Through fusion of the feature pyramids, character content can be effectively predicted based on the fusion features.

Optionally, the FPN starts from a kth feature map of the plurality of feature maps, splices the kth feature map and the kth-1 feature map to obtain a spliced feature map, splices the spliced feature map with the kth-2 feature map, and repeats the steps until the splicing process of all feature maps is completed. k is the total number of the plurality of feature maps, and the kth feature map is the feature map with the highest convolution degree in the plurality of feature maps.

Sub-step 323c, using the person detection network of the first sub-model to identify the fusion feature, generates a first sub-result, and the first sub-result is used to characterize the area occupied by the person content in the video frame.

Sub-step 323d, in the case that the forward content element comprises salient content, determining at least one second feature map of the video frame using a plurality of feature extraction networks in series in a second sub-model comprised by the first detection model.

Optionally, the composition of the feature extraction network is: downsampling +Conv +BN +ReLU. Illustratively, the second submodel includes mutually nested RSUs (ReSidual U-blocks).

An RSU module is an exemplary feature extraction network. For two adjacent feature extraction networks, the output of the next feature extraction network is the input of the last feature extraction network. The output of any feature extraction network is a second feature map.

Sub-step 323e, using a decoder of the second sub-model to decode the at least one second feature map, generates a second sub-result, the second sub-result being used to characterize the region occupied by the salient content in the video frame.

The number of feature extraction networks in the second sub-model is equal to the number of decoders. For example, the second sub-model includes p serially connected feature extraction networks and p serially connected decoders. The decoder and the feature extraction network, wherein the decoder of the upper layer corresponds to the feature extraction network of the bottom layer, and the decoder of the bottom layer corresponds to the feature extraction network of the upper layer. For example, from bottom layer to high layer, E1, E2, E3, and 3 decoders are D1, D2, D3, respectively, from bottom layer to high layer, respectively, by 3 feature extraction networks, E1 corresponds to D3, E2 corresponds to D2, and E3 corresponds to D1. Illustratively, the decoder is composed of an upsampling layer +Conv +BN +ReLU.

For any decoder, the input of the decoder includes the second feature map generated by the feature extraction network corresponding to the decoder, and the output of the adjacent decoder. For a lowest layer decoder of the plurality of decoders, the input of the decoder includes a second feature map generated by a highest layer feature extraction network and an intermediate result of convolving the second feature map generated by the highest layer feature extraction network.

By means of the method, the context information from more different scales can be captured through mixing of different receptive fields of different scales, and the accuracy of the second sub-result generated by decoding according to the second feature map is improved.

The determination process of the second detection result will be described by several embodiments, and the execution subject of the embodiment is a computer device.

In some embodiments, the negative content element includes at least one of: identify content, text content. Alternatively, the identification content refers to a brand identification, a product identification, etc. that may exist in the video frame. Text content refers to text that appears in a video frame. For example, the text content includes subtitles in video frames, and the like.

In some embodiments, the sub-step 326 of processing the video frame using the second detection model to generate a second detection result may include the following sub-steps:

In the case that the negative content element comprises the identified content, sub-step 326a, the video frame is divided into a plurality of sub-regions using a third sub-model comprised by the second detection model, the probability of the identified content being present in each sub-region is predicted, and a third sub-result is determined according to the probabilities of the plurality of sub-regions, the third sub-result being used to characterize the region of the identified content in the video frame.

Illustratively, the third sub-model is trained in accordance with YOLO (You Only Look Once), and the specific structure of the third sub-model is not described herein.

And step 326b, in the case that the negative content element includes text content, performing text recognition in the first area of the video frame by using a fourth sub-model included in the second detection model, so as to obtain a fourth sub-result, where the fourth sub-result is used for representing the area occupied by the text content in the video frame.

In some embodiments, the first region refers to a region in the video frame where there is a greater probability of text content occurring, e.g., the first region refers to a lower region in the video frame. E.g., the first region refers to a quarter region from bottom to top in the video frame.

Optionally, the computer device performs continuous domain identification on the first region in the video frame through the fourth sub-model, and obtains a fourth sub-result. For specific text recognition, please refer to the related art, and the detailed description is omitted herein.

The following describes steps of generating a clip image corresponding to a video frame based on the element detection result by several embodiments. The execution subject of the present embodiment is a computer device.

For the description of the positive content element and the negative content element, reference is made to the above embodiments, and details are not repeated here.

In step 333, the second detection result is used to correct the first detection result to obtain a target cropping area, where the target cropping area is used to indicate the region where the cropping image is located in the video frame.

In some embodiments, the target crop area is less than or equal to the area occupied by the forward content element in the video frame. The purpose of correcting the first detection result by using the second detection result is that: and trimming the area occupied by the positive content element according to the area occupied by the negative content element to obtain a target trimming area.

Optionally, the target cropping zone is a rectangular zone. Illustratively, the size of the target crop area is adapted to the display size of the video cover of the material video. For example, the display size of the video cover of the material video is 16:9, the display size of the target clipping region is also 16:9. in the case where the target crop area is a rectangle, the target crop area may be located in the video frame by the center point coordinates, the length of the rectangle, and the width of the rectangle; wherein, the center point refers to the intersection point of two diagonal lines of the rectangle.

Optionally, the target cutout region is irregularly shaped. For example, in the case where the computer device splices out the video cover using clip images corresponding to m video frames, respectively, the target clip region may be irregularly shaped, and the shapes of the target clip regions corresponding to m video frames, respectively, are different. In this case, the target cropping areas corresponding to the m video frames, respectively, may be stitched into the display size of the video cover. For example, the shape of the video cover is circular, m is equal to 4, and the target crop area is a quarter sector. In the case where the target cropping zone is a sector, the area occupied by the target cropping zone in the video frame may be determined by the center of the sector, the radius of the sector, and the angle of deflection of the initial side of the sector relative to the positive direction.

The positioning method of the target clipping region in the region occupied in the video frame is set according to actual needs, and the present application is not limited herein. For a specific step of correcting the first detection result by using the second detection result, please refer to the following embodiments.

In a substep 336, the video frame is cropped according to the target cropping zone to obtain a cropped image.

In some embodiments, after determining the target crop area, the computer device determines the area occupied by the target crop area in the video frame and eliminates the area occupied by the video frame except the target detection area from the video frame to obtain the crop image.

Since text content or tag content may exist in the video frame, these negative content elements, if not removed, may affect the display effect of the finally generated video cover. By the method, the negative content elements can be removed while the positive content elements are maintained, so that the quality of the video cover generated by using the cut image is improved, and the situation that the video cover contains unnecessary negative content elements is avoided, so that the visual effect of the video cover is poor is avoided. The requirement of a user for secondary processing of the automatically generated video cover is reduced, and convenience in generating the video cover is improved.

The method for determining the target clipping region will be described in the following by several embodiments.

In some embodiments, the forward content element includes at least one of: character content and saliency content, wherein the character content refers to characters displayed in a video frame, and the saliency content refers to center content attracting vision in the video frame; the negative content element includes at least one of: identify content, text content. For the description of the positive content element and the negative content element, reference is made to the above embodiments, and details are not repeated here.

In some embodiments, a clipping priority exists between the positive content element and the negative content element, the clipping priority being used to indicate: and a method for determining a target clipping region when there is an overlap in the region occupied by the content element. In other words, the clipping priority is used to characterize the priority of including various types of content elements in the target area.

Optionally, the computer device determines the target crop area based on the crop priority, the first detection result, and the second detection result. The clipping priority is preset.

In some embodiments, the substep 333, using the second detection result to correct the first detection result to obtain the target clipping region, may include the following steps:

Sub-step 333a determines the first region occupied by the salient content in the video frame as the initial target region. In some embodiments, the clipping priorities are arranged from high to low as: character content, saliency content, text content, and identification content.

Optionally, a first area occupied by the salient content in the video frame is determined by a second sub-result included in the first detection result; the second area occupied by the tag content in the video frame is determined by a third sub-result included in the second detection result; a third area occupied by the character content in the video is determined by a first sub-result included in the first detection result; the fourth region occupied by the text content in the video frame is determined by the fourth sub-result included in the first detection result.

In the sub-step 333b, when the second area occupied by the tag content in the video frame overlaps the initial target area, the initial target area is clipped according to the second area, so as to obtain the first target area.

Optionally, the case where the second region overlaps with the initial target region includes, but is not limited to, at least one of: the lower part of the second area is overlapped with the upper part of the initial target area, and the upper part of the second area is overlapped with the lower part of the initial target area; overlapping exists between the left side of the second area and the right side of the initial target area; there is an overlap to the right of the second region and to the left of the initial target region.

Optionally, the area occupied by the first target area in the video frame does not include tag content. The shape of the first target area is the same as the shape of the initial target area. For example, the first target area and the initial target area are both rectangular.

And step 333c, cutting the first target area according to the fourth area to obtain a second target area when there is no non-overlapping between the third area occupied by the character content in the video and the fourth area occupied by the character content in the video.

In the case that the second target region does not completely include the third region, the sub-step 333d supplements the second target region with the third region to obtain the target clipping region.

Optionally, the target crop region entirely includes the third region. The shape of the target cropped area is the same as the shape of the initial target area.

The first detection result is corrected by using the second detection result by the method, so that the target clipping region maximally removes negative content elements while retaining positive content elements, and the quality of clipping images obtained by clipping according to the target clipping region is improved.

The process of generating a video cover from a cropped image is described in several embodiments below. The execution subject of the present embodiment is a computer device.

In some embodiments, step 340, before generating the video cover of the material video according to the clip images corresponding to the m video frames, further includes: the computer equipment determines the display size of the video cover according to the playing format of the material video, wherein the display format is a horizontal version or a vertical version; the computer equipment stretches and contracts the clipping image based on the display size to obtain a clipping image with the adjusted size, wherein the clipping image with the adjusted size is used for generating a video cover of the material video.

In some embodiments, step 340, generating a video cover of the material video according to the clip images corresponding to the m video frames, includes the following sub-steps:

in the sub-step 343, the respective clipping images corresponding to the m video frames are comprehensively scored, so as to obtain the respective comprehensive scoring results corresponding to the m clipping images, where the comprehensive scoring results are used for comprehensively evaluating the clipping images from the picture quality and the picture content.

Optionally, the computer device weights the picture quality score and the aesthetic degree score to obtain a comprehensive scoring result, where the weights corresponding to the picture quality score and the aesthetic degree score may be equal or unequal, and the application does not set the weights.

For example, the picture quality score and the aesthetic score may be determined by the first score model and the second score model in the above embodiments, and details of the first score model and the second score model are referred to the above embodiments, and are not described herein.

Optionally, the composite scoring result further comprises at least one of: face quality score, open eye quality score, safety quality score, etc. The face quality score is used for representing the proportion of the face area in the clipping image. Illustratively, the higher the proportion of face regions, the higher the face quality score; the lower the proportion of face regions, the lower the face quality score. The open eye quality score is used to characterize the open eye condition of the person in the cropped image. Illustratively, when the person in the cropped image is in an open-eye state, the open-eye quality score is high; when the person in the clipping image is in the closed-eye state, the open-eye quality score is low. Illustratively, the security quality score is used to characterize whether the screen content in the cropped image meets the content requirements of the platform on which the target application is located. If the cut image has a picture content which does not meet the content requirement of the target application program, the cut image cannot be used for generating the video cover.

In one example, the computer device performs a weighted sum based on the picture quality score, the aesthetic score, the face quality score, the open eye quality score, and the safety quality score to obtain a composite score result. The higher the composite score result, the more suitable the cropped image is as a video cover. Illustratively, the weight of the aesthetic score is highest, and the weight of the picture quality score is only lower than the aesthetic score.

Step 346, selecting n clipping images from the m clipping images according to the comprehensive scoring result, wherein n is a positive integer less than or equal to m.

Optionally, the computer device selects the top n cropping images with the highest comprehensive scoring result from the m cropping images. Optionally, the computer device selects n cropped images with the comprehensive score result greater than the fourth threshold from the m cropped images. Illustratively, the computer device selects 1 crop image from the m crop images. Illustratively, the computer device selects a plurality of clip images from the m clip images, the selected plurality of clip images being used to collectively generate the video cover of the material video.

Illustratively, to increase the efficiency of generating video covers, the computer device generates n video covers from the n selected cropped images, respectively. For example, for any one of n cropped images, the computer device generates 1 video cover from the cropped image.

In step 349, fusion processing is performed based on the n clipping images, so as to generate a video cover of the material video, where the video cover includes picture contents of the n clipping images.

And under the condition that n is equal to 1, the computer equipment performs fusion processing on the selected clipping image and the video name of the material video to obtain the video cover of the material video. Alternatively, the video name of the material video is set by the user, and the display style of the video name of the video material may be specified by the user from among the candidate styles. Wherein the display style includes, but is not limited to, at least one of: font, font size, display color, etc.

And under the condition that n is greater than 1, the computer equipment performs fusion processing on the selected clipping images to obtain a fusion image, and combines the fusion image with the video name of the material video to obtain the video cover of the material video.

Illustratively, fusing the plurality of selected cropped images includes: splicing the selected cut images according to the display size of the video cover to obtain spliced images; wherein, the minimum distance between the edges of two adjacent selected clipping images in the spliced image is c, and the real number of c; and blurring processing is carried out on the boundaries of two adjacent selected clipping images in the spliced image, so that a video cover of the material video is obtained.

The video cover of the material video is generated by selecting the clipping image with a good comprehensive scoring result from the plurality of clipping images, so that the display quality of the generated video cover is improved.

In order to improve the quality of the generated cropped image, the black edges of the video frame need to be removed before element detection is performed, and the process is described in several embodiments below.

In some embodiments, before the computer device performs picture element detection on the video frame to obtain the element detection result of the video frame, the method further includes:

the computer equipment detects invalid areas of the video frames, and determines area detection results, wherein the area detection results are used for characterizing the area occupied by the filled non-content area adapting to the picture size in the video frames;

and the computer equipment cuts out a black edge region included in the video frame according to the region detection result to obtain a black edge-free image, wherein the black edge-free image is used for detecting picture elements to obtain an element detection result of the video frame.

In some embodiments, the computer device performs ineffective region detection on the video frame, and determines a region detection result, including: the computer equipment performs edge detection on the video frame and determines an edge detection result of the video frame; the computer equipment carries out Huffman detection according to the edge detection result to generate a straight line detection result; and the computer equipment carries out binarization processing on the linear detection result and determines the region detection result.

The black edge in the video frame of the region edge detection region is beneficial to reducing the number of pixel points included in the video frame, and the pressure for processing the video frame in the follow-up process is reduced.

Fig. 7 is a schematic diagram of invalid region detection provided in an exemplary embodiment of the present application. As shown in fig. 7, the black edges 710 and 720 in the video frame do not contain the effective picture content, and the black edge-free image 730 can be obtained by clipping the black edges 710 and 720.

Fig. 8 is a schematic diagram of invalid region detection provided in another exemplary embodiment of the present application.

As shown in fig. 8, a black border 810 in a video frame is a display area of a play representation, and does not include effective picture content, and a black border-free image 820 can be obtained by clipping the black border 810.

Fig. 9 is a schematic diagram of invalid region detection provided in another exemplary embodiment of the present application. In the case of invalid region detection of the video frame 910, text 920 in the video frame 910 is preserved, and the clipped black image is shown as 930.

Fig. 10 is a schematic diagram of invalid region detection provided in another exemplary embodiment of the present application. For a vertical video frame, such as sub-picture 10-a, performing wireless region detection can obtain a black-edge-free image 1020 in video frame 1010; as in sub-graph 10-b, wireless region detection can result in a black-edge free image 1040 in video frame 1030.

The following describes a process of generating a video cover of a material image by way of an example.

FIG. 11 is a schematic diagram of the generation of a video cover provided in an exemplary embodiment of the present application. The present embodiment is executed by a computer device, and includes the following steps:

and step A10, analyzing the episodes of the material video to obtain video information of the material video.

The computer device obtains the material video, such as obtaining the material video encoded in mvid. Then, the computer equipment acquires the playing audio of the material video; the method comprises the steps of performing segmented sampling on playing audio to obtain a plurality of sampling segments; and acquiring video information of the material video according to the plurality of sampling fragments.

And step A20, respectively acquiring at least one video frame from each plot of the material video according to the video information to obtain a plurality of candidate video frames. Illustratively, the computer device selects at least one key frame from the episode as a candidate video frame.

In step a30, optionally, the computer device performs layer-by-layer screening on the multiple candidate video frames according to the picture quality and the aesthetic degree of the candidate video frames, so as to obtain m video frames.

Illustratively, the computer device scores the plurality of candidate video frames using a first scoring model to obtain first scoring results corresponding to the plurality of candidate video frames respectively; selecting a plurality of intermediate video frames from the plurality of candidate video frames according to the first scoring result; scoring the plurality of intermediate video frames by using a second scoring model to obtain second scoring results respectively corresponding to the plurality of intermediate video frames; and selecting m frames of video frames from the plurality of intermediate video frames according to the second scoring result.

For the structure of the first scoring model and the second scoring model, please refer to the above embodiments, and details are not described herein. Optionally, the computer device performs ineffective area detection on m video frames, and determines an area detection result; and cutting out a black edge region included in the video frame according to the region detection result to obtain a black edge-free image, wherein the black edge-free image is used for detecting picture elements to obtain an element detection result of the video frame.

Step a40, in the case that the forward content element includes character content, determining at least one first feature map of the video frame using a backbone network of a first sub-model included in the first detection model; using a feature pyramid network of the first sub-model to perform fusion processing on at least one first feature map to generate fusion features; and identifying the fusion characteristic by using the character detection network of the first sub-model to generate a first sub-result.

Step A50, when the forward content element includes the salient content, determining at least one second feature map of the video frame by using a plurality of feature extraction networks in series in a second sub-model included in the first detection model; and decoding the at least one second feature map by using a decoder of the second sub-model to generate a second sub-result.

Step A60, when the negative content element comprises the identification content, dividing the video frame into a plurality of subareas by using a third subarea included in the second detection model, predicting the probability of the identification content in each subarea, and determining a third sub-result according to the probabilities of the subareas; and under the condition that the negative content element comprises text content, performing text recognition in the first area of the video frame by using a fourth sub-model comprising the second detection model to obtain a fourth sub-result.

Step A70, correcting the first detection result by using the second detection result to obtain a target cutting area; and cutting the video frame according to the target cutting area to obtain a cutting image.

Step A80, comprehensively scoring the cut images corresponding to the m video frames respectively to obtain comprehensive scoring results corresponding to the m cut images respectively; selecting n clipping images from the m clipping images according to the comprehensive scoring result; and carrying out fusion processing based on the n clipping images to generate a video cover of the material video.

Alternatively, step A80 may generate a plurality of video covers. If a transverse covers and a longitudinal covers are generated. After the video cover is generated, the computer equipment judges whether the definition of the video cover is smaller than or equal to a definition threshold value, and if the definition of the video cover is smaller than the definition threshold value, the computer equipment enhances the image quality of the video cover to obtain an enhanced video cover. And outputting the enhanced video cover. The enhanced video cover has a higher sharpness than the original video cover.

For a detailed description of this embodiment, reference should be made to the above embodiments.

FIG. 12 is a diagram of an effect of generating a video cover according to an exemplary embodiment of the present application. The method has the advantages that the element detection processing is carried out on the video frames to obtain the cut images, and the video cover which is the material video is generated based on the cut images, so that the generation of the video cover with higher picture quality is facilitated, and the watching experience of a user is improved.

FIG. 13 illustrates a block diagram of a video cover generation apparatus provided in an exemplary embodiment of the present application. The apparatus 1300 may include: a video frame acquisition module 1310, an element detection module 1320, a video frame cropping module 1330, and a cover generation module 1340.

The video frame acquisition module 1310 is configured to acquire m video frames from the material video, where m is a positive integer.

The element detection module 1320 is configured to perform, for each of the m video frames, a picture element detection on the video frame to obtain a element detection result of the video frame, where the element detection result is used to characterize an area occupied by a picture element in the video frame, and the picture element is related to a picture content in the video frame.

The video frame cropping module 1330 is configured to crop the video frame according to the element detection result, so as to obtain a cropped image corresponding to the video frame, where the size of the cropped image is smaller than or equal to the size of the video frame.

The cover generation module 1340 is configured to generate a video cover of the material video according to the clip images corresponding to the m video frames.

In some embodiments, the element detection result includes: the video frame display device comprises a first detection result and a second detection result, wherein the first detection result is used for representing the area occupied by a positive content element in the video frame, the second detection result is used for representing the area occupied by a negative content element in the video frame, the positive content element is used for representing the picture content to be reserved, and the negative content element is used for representing the picture content to be cut; the element detection module 1320 includes: the first detection unit is used for processing the video frame by using a first detection model to generate a first detection result; and the second detection unit is used for processing the video frame by using a second detection model and generating a second detection result.

In some embodiments, the forward content element includes at least one of: character content and saliency content, wherein the character content refers to characters displayed in the video frame, and the saliency content refers to center content which attracts sight in the video frame; the first detection unit is used for: determining at least one first feature map of the video frame using a backbone network of a first sub-model comprised by the first detection model, in case the forward content element comprises the persona content; using the feature pyramid network of the first sub-model to perform fusion processing on the at least one first feature map to generate fusion features; identifying the fusion characteristics by using a character detection network of the first sub-model to generate a first sub-result, wherein the first sub-result is used for representing the area occupied by the character content in the video frame; determining at least one second feature map of the video frame using a plurality of feature extraction networks in series in a second sub-model comprised by the first detection model, in case the forward content element comprises the salient content; and decoding the at least one second feature map by using a decoder of the second sub-model to generate a second sub-result, wherein the second sub-result is used for representing the area occupied by the salient content in the video frame.

In some embodiments, the negative content element includes at least one of: identifying content and text content; the second detection unit is configured to divide the video frame into a plurality of sub-regions by using a third sub-model included in the second detection model when the negative content element includes the identification content, predict a probability of existence of the identification content in each sub-region, and determine a third sub-result according to the probabilities of the plurality of sub-regions, where the third sub-result is used to characterize a region occupied by the identification content in the video frame; and under the condition that the negative content element comprises the text content, performing text recognition in a first area of the video frame by using a fourth sub-model comprising the second detection model to obtain a fourth sub-result, wherein the fourth sub-result is used for representing the area occupied by the text content in the video frame.

In some embodiments, the element detection result includes: the video frame display device comprises a first detection result and a second detection result, wherein the first detection result is used for representing the area occupied by a positive content element in the video frame, the second detection result is used for representing the area occupied by a negative content element in the video frame, the positive content element is used for representing the picture content to be reserved, and the negative content element is used for representing the picture content to be cut;

The video frame cropping module 1330 includes: the region determining unit is used for correcting the first detection result by using the second detection result to obtain a target clipping region, wherein the target clipping region is used for indicating the region where the clipping image is located in the video frame; and the image clipping unit is used for clipping the video frame according to the target clipping region to obtain the clipping image.

In some embodiments, the forward content element includes at least one of: character content and saliency content, wherein the character content refers to characters displayed in the video frame, and the saliency content refers to center content which attracts sight in the video frame; the negative content element includes at least one of: identifying content and text content; the area determining unit is used for: determining a first area occupied by the salient content in the video frame as an initial target area; cutting the initial target area according to the second area under the condition that the second area occupied by the tag content in the video frame is overlapped with the initial target area, so as to obtain the first target area; under the condition that a third area occupied by the character content in the video and a fourth area occupied by the character content in the video are not non-overlapped, supplementing the first target area according to the fourth area to obtain a second target area; and when the second target area does not completely contain the third area, supplementing the second target area by using the third area to obtain the target clipping area.

In some embodiments, the cover generation module 1340 includes: the scoring calculation unit is used for comprehensively scoring the cut images corresponding to the m video frames respectively to obtain m comprehensive scoring results corresponding to the cut images respectively, and the comprehensive scoring results are used for comprehensively evaluating the cut images from the picture quality and the picture content; the image selection unit is used for selecting n clipping images from m clipping images according to the comprehensive scoring result, wherein n is a positive integer less than or equal to m; the cover generation unit is used for carrying out fusion processing on the basis of the n clipping images to generate a video cover of the material video, wherein the video cover comprises picture contents of the n clipping images.

In some embodiments, the apparatus 1300 further includes a black edge clipping unit configured to perform ineffective region detection on the video frame, and determine a region detection result, where the region detection result is used to characterize a region occupied by a content-free region filled to fit in a picture size in the video frame; and cutting out a black edge region included in the video frame according to the region detection result to obtain a black edge-free image, wherein the black edge-free image is used for detecting picture elements to obtain an element detection result of the video frame.

In some embodiments, the video frame acquisition module 1310 includes: the information acquisition unit is used for carrying out plot analysis on the material video to acquire video information of the material video, wherein the video information is used for dividing the material video into at least one plot segment, and the same plot segment comprises a plurality of video frames with similar picture contents; the video frame acquisition unit is used for respectively acquiring at least one video frame from each plot segment of the material video according to the video information to obtain a plurality of candidate video frames; and the video frame screening unit is used for carrying out layer-by-layer screening on the plurality of candidate video frames to obtain the m video frames.

In some embodiments, the information obtaining unit is configured to obtain a playing audio of the material video, where the playing audio is used for playing in synchronization with the video frame; the method comprises the steps that the playing audio is sampled in a segmented mode, a plurality of sampling segments are obtained, the sampling segments correspond to a first playing time period, and the sampling segments comprise signal intensity of the playing audio in the first playing time period; and acquiring video information of the material video according to the plurality of sampling fragments.

In some embodiments, the video frame screening unit is configured to score the plurality of candidate video frames using a first scoring model, to obtain first scoring results corresponding to the plurality of candidate video frames respectively; selecting a plurality of intermediate video frames from the plurality of candidate video frames according to the first scoring result; scoring the plurality of intermediate video frames by using a second scoring model to obtain second scoring results respectively corresponding to the plurality of intermediate video frames; selecting the m-frame video frames from the plurality of intermediate video frames according to the second scoring result; one of the first scoring model and the second scoring model is used for scoring the attractive degree of the video frame, and the other is used for scoring the picture quality of the video frame.

It should be noted that, in the apparatus provided in the foregoing embodiment, when implementing the functions thereof, only the division of the foregoing functional modules is used as an example, in practical application, the foregoing functional allocation may be implemented by different functional modules, that is, the content structure of the device is divided into different functional modules, so as to implement all or part of the functions described above. In addition, the embodiments of the apparatus and the method provided in the foregoing embodiments belong to the same concept, and the specific implementation process is detailed in the method-side embodiment, which is not described herein again. The beneficial effects of the device provided in the foregoing embodiments are described with reference to the method side embodiments, and are not repeated herein.

Fig. 14 shows a block diagram of a computer device according to an exemplary embodiment of the present application.

In general, the computer device 1400 includes: a processor 1401 and a memory 1402.

Processor 1401 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 1401 may be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA (Field Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ). The processor 1401 may also include a main processor, which is a processor for processing data in an awake state, also called a CPU (Central Processing Unit ), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 1401 may be integrated with a GPU (Graphics Processing Unit, image processor) for rendering and rendering of content required to be displayed by the display screen. In some embodiments, the processor 1401 may also include an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.

Memory 1402 may include one or more computer-readable storage media, which may be tangible and non-transitory. Memory 1402 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 1402 stores a computer program that is loaded and executed by processor 1401 to implement the video cover generation method provided by the method embodiments described above.

The embodiments of the present application also provide a computer readable storage medium having a computer program stored therein, where the computer program is loaded and executed by a processor to implement the video cover generating method provided in the above embodiments of the methods.

The computer readable medium may include computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM (Random Access Memory ), ROM (Read-Only Memory), EPROM (Erasable Programmable Read-Only Memory), EEPROM (Electrically Erasable Programmable Read-Only Memory), flash Memory or other solid state Memory technology, CD-ROM, DVD (Digital Video Disc, high density digital video disc) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will recognize that the computer storage medium is not limited to the ones described above.

Embodiments of the present application also provide a computer program product, where the computer program product includes computer instructions stored in a computer readable storage medium, and a processor reads and executes the computer instructions from the computer readable storage medium to implement the video cover generating method provided in the foregoing method embodiments.

It should be understood that references herein to "a plurality" are to two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.

The foregoing description of the preferred embodiments is merely illustrative of the present application and is not intended to limit the invention to the particular embodiments shown, but on the contrary, the intention is to cover all modifications, equivalents, alternatives, and alternatives falling within the spirit and principles of the invention.

Claims

1. A method of generating a video cover, the method comprising:

2. The method of claim 1, wherein the element detection result includes: the video frame display device comprises a first detection result and a second detection result, wherein the first detection result is used for representing the area occupied by a positive content element in the video frame, the second detection result is used for representing the area occupied by a negative content element in the video frame, the positive content element is used for representing the picture content to be reserved, and the negative content element is used for representing the picture content to be cut;

the step of detecting the picture elements of the video frame to obtain the element detection result of the video frame comprises the following steps:

processing the video frame by using a first detection model to generate a first detection result;

and processing the video frame by using a second detection model to generate a second detection result.

3. The method of claim 2, wherein the forward content element comprises at least one of: character content and saliency content, wherein the character content refers to characters displayed in the video frame, and the saliency content refers to center content which attracts sight in the video frame;

The processing the video frame by using a first detection model to generate the first detection result includes:

determining at least one first feature map of the video frame using a backbone network of a first sub-model comprised by the first detection model, in case the forward content element comprises the persona content;

using the feature pyramid network of the first sub-model to perform fusion processing on the at least one first feature map to generate fusion features;

identifying the fusion characteristics by using a character detection network of the first sub-model to generate a first sub-result, wherein the first sub-result is used for representing the area occupied by the character content in the video frame;

determining at least one second feature map of the video frame using a plurality of feature extraction networks in series in a second sub-model comprised by the first detection model, in case the forward content element comprises the salient content;

and decoding the at least one second feature map by using a decoder of the second sub-model to generate a second sub-result, wherein the second sub-result is used for representing the area occupied by the salient content in the video frame.

4. The method of claim 2, wherein the negative content element comprises at least one of: identifying content and text content;

the processing the video frame by using a second detection model to generate the second detection result, including:

when the negative content element comprises the identification content, dividing the video frame into a plurality of subareas by using a third subarea included in the second detection model, predicting the probability of the identification content in each subarea, and determining a third sub-result according to the probabilities of the plurality of subareas, wherein the third sub-result is used for representing the area occupied by the identification content in the video frame;

and under the condition that the negative content element comprises the text content, performing text recognition in a first area of the video frame by using a fourth sub-model comprising the second detection model to obtain a fourth sub-result, wherein the fourth sub-result is used for representing the area occupied by the text content in the video frame.

5. The method of claim 1, wherein the element detection result includes: the video frame display device comprises a first detection result and a second detection result, wherein the first detection result is used for representing the area occupied by a positive content element in the video frame, the second detection result is used for representing the area occupied by a negative content element in the video frame, the positive content element is used for representing the picture content to be reserved, and the negative content element is used for representing the picture content to be cut;

Clipping the video frame according to the element detection result to obtain a clipping image corresponding to the video frame, including:

correcting the first detection result by using the second detection result to obtain a target clipping region, wherein the target clipping region is used for indicating the region of the clipping image in the video frame;

and clipping the video frame according to the target clipping region to obtain the clipping image.

6. The method of claim 5, wherein the forward content element comprises at least one of: character content and saliency content, wherein the character content refers to characters displayed in the video frame, and the saliency content refers to center content which attracts sight in the video frame; the negative content element includes at least one of: identifying content and text content;

the correcting the first detection result by using the second detection result to obtain a target clipping region includes:

determining a first area occupied by the salient content in the video frame as an initial target area;

cutting the initial target area according to the second area under the condition that the second area occupied by the tag content in the video frame is overlapped with the initial target area, so as to obtain the first target area;

Cutting the first target area according to the fourth area under the condition that no non-overlapping exists between a third area occupied by the character content in the video and a fourth area occupied by the character content in the video, so as to obtain a second target area;

and when the second target area does not completely contain the third area, supplementing the second target area by using the third area to obtain the target clipping area.

7. The method according to claim 1, wherein the generating the video cover of the material video according to the clip images corresponding to the m video frames respectively includes:

comprehensively scoring the cut images corresponding to the m video frames respectively to obtain m comprehensive scoring results corresponding to the cut images respectively, wherein the comprehensive scoring results are used for comprehensively evaluating the cut images from the picture quality and the picture content;

selecting n clipping images from m clipping images according to the comprehensive scoring result, wherein n is a positive integer less than or equal to m;

and carrying out fusion processing on the n clipping images to generate a video cover of the material video, wherein the video cover comprises picture contents of the n clipping images.

8. The method according to claim 1, wherein before performing picture element detection on the video frame to obtain an element detection result of the video frame, the method further comprises:

performing invalid region detection on the video frame, and determining a region detection result, wherein the region detection result is used for characterizing the region occupied by a content-free region filled for adapting to the picture size in the video frame;

and cutting out a black edge region included in the video frame according to the region detection result to obtain a black edge-free image, wherein the black edge-free image is used for detecting picture elements to obtain an element detection result of the video frame.

9. The method of claim 1, wherein the obtaining m video frames from the material video comprises:

carrying out scenario analysis on the material video to obtain video information of the material video, wherein the video information is used for dividing the material video into at least one scenario segment, and the same scenario segment comprises a plurality of video frames with similar picture contents;

respectively acquiring at least one video frame from each plot of the material video according to the video information to obtain a plurality of candidate video frames;

And carrying out layer-by-layer screening on the plurality of candidate video frames to obtain the m video frames.

10. The method of claim 9, wherein the subjecting the material video to scenario analysis to obtain video information of the material video comprises:

acquiring playing audio of the material video, wherein the playing audio is used for playing synchronously with the video frame;

the method comprises the steps that the playing audio is sampled in a segmented mode, a plurality of sampling segments are obtained, the sampling segments correspond to a first playing time period, and the sampling segments comprise signal intensity of the playing audio in the first playing time period;

and acquiring video information of the material video according to the plurality of sampling fragments.

11. The method of claim 9, wherein the layer-by-layer screening of the plurality of candidate video frames to obtain the m video frames comprises:

scoring the plurality of candidate video frames by using a first scoring model to obtain first scoring results respectively corresponding to the plurality of candidate video frames;

selecting a plurality of intermediate video frames from the plurality of candidate video frames according to the first scoring result;

Scoring the plurality of intermediate video frames by using a second scoring model to obtain second scoring results respectively corresponding to the plurality of intermediate video frames;

selecting the m-frame video frames from the plurality of intermediate video frames according to the second scoring result;

one of the first scoring model and the second scoring model is used for scoring the attractive degree of the video frame, and the other is used for scoring the picture quality of the video frame.

12. A video cover generation apparatus, the apparatus comprising:

13. A computer device comprising a processor and a memory, the processor loading and executing the computer instructions from the memory to implement the video cover generation method of any one of claims 1 to 11.

14. A computer readable storage medium having stored therein computer instructions that are loaded and executed by a processor from the storage medium to implement the video cover generation method of any one of claims 1 to 11.