CN115119071A

CN115119071A - Video cover generation method and device, electronic equipment and storage medium

Info

Publication number: CN115119071A
Application number: CN202210657734.1A
Authority: CN
Inventors: 朱允全; 刘文然; 钟立耿; 文伟
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-06-10
Filing date: 2022-06-10
Publication date: 2022-09-27

Abstract

The application relates to a video cover generation method and device, electronic equipment and a storage medium. The method comprises the following steps: acquiring a first image of a video to be processed; the first image is a frame image in a first time period in the video to be processed, the starting time of the first time period is the starting time of the video to be processed, and the duration of the first time period is smaller than a first time threshold; detecting a first preset target of the first image, and if the first image comprises the first preset target, taking the first image as a video cover of the video to be processed; if the first image does not comprise the first preset target, acquiring a plurality of second images in a second time period in the video to be processed; the starting time of the second period is the ending time of the first period; and sequentially detecting a second preset target for the plurality of second images, and taking the second image comprising the second preset target as a video cover. According to the technical scheme of the application, the generation efficiency and the visual attention of the video cover can be improved.

Description

Video cover generation method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of computer application technologies, and in particular, to a method and an apparatus for generating a video cover, an electronic device, and a storage medium.

Background

With the development of video interaction, the arrangement of video covers is also concerned. In the related art, a cover map of a video is selected based on the highlight of a video frame or the highlight of a corresponding video clip, or the cover map is selected according to the quality of the video frame, or different cover maps are presented according to different user interests. However, in these methods, all frames of the video generally need to be detected and compared, and most of the low-level information based on the video frames results in low generation efficiency of the cover, and the cover has no effective attention, and cannot face the cover generation scene of a large amount of videos.

Disclosure of Invention

In view of the above technical problems, the present application provides a video cover generation method, apparatus, electronic device and storage medium.

According to an aspect of the present application, there is provided a video cover generation method, the method including:

acquiring a first image of a video to be processed; the first image is a frame image in a first time period in the video to be processed, the starting time of the first time period is the starting time of the video to be processed, and the duration of the first time period is smaller than a first time threshold;

detecting a first preset target of the first image, and if the first image comprises the first preset target, using the first image as a video cover of the video to be processed;

if the first image does not comprise the first preset target, acquiring a plurality of second images in a second time period in the video to be processed; the starting time of the second time period is the ending time of the first time period;

and sequentially detecting a second preset target for the plurality of second images, and taking the second image comprising the second preset target as the video cover.

According to another aspect of the present application, there is provided a video cover generation apparatus including:

the first acquisition module is used for acquiring a first image of a video to be processed; the first image is a frame image in a first time period in the video to be processed, the starting time of the first time period is the starting time of the video to be processed, and the duration of the first time period is smaller than a first time threshold;

the first detection module is used for detecting and processing a first preset target on the first image, and if the first image comprises the first preset target, the first image is used as a video cover of the video to be processed;

a second obtaining module, configured to obtain, if the first image does not include the first preset target, a plurality of second images in a second time period in the video to be processed; the starting time of the second time period is the ending time of the first time period;

and the second detection module is used for sequentially detecting a second preset target for the plurality of second images and taking the second image comprising the second preset target as the video cover.

According to another aspect of the present application, there is provided an electronic device including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to perform the above method.

According to another aspect of the application, a non-transitory computer-readable storage medium is provided, having computer program instructions stored thereon, wherein the computer program instructions, when executed by a processor, implement the above-described method.

By setting the first time period and the second time period, cover mining processing can be sequentially performed on images in different time periods, so that the number of the images to be mined in each time period is controllable, detection on all frames is not needed, the detection efficiency can be improved, the generation efficiency of the video covers can be improved, the video cover generation task can be effectively applied to a large number of video cover generation tasks, the applicability is good, and the processing resources of the video covers can be saved; moreover, the time-interval sequential detection mode is adapted to the image with comprehensive video content and better visual effect of the frame which is generally ahead in the video to be processed, the generation efficiency of the video cover is ensured, the visual display effect of the video cover and the accuracy of the video cover on content display are also ensured, and the attention of the video cover is improved.

Other features and aspects of the present application will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features, and aspects of the application and, together with the description, serve to explain the principles of the application.

Fig. 1 is a schematic diagram illustrating an application system according to an embodiment of the present application.

Fig. 2 shows a flowchart of a video cover generation method according to an embodiment of the present application.

Fig. 3 is a schematic diagram illustrating a detection process of a preset sub-target according to an embodiment of the present application.

Fig. 4 is a schematic diagram illustrating a detection process of a second preset object according to an embodiment of the present application.

Fig. 5 is a schematic diagram illustrating a detection process of text information according to an embodiment of the present application.

Fig. 6 is a flowchart illustrating a video cover generation method according to an embodiment of the present application.

Fig. 7 is a flowchart illustrating a video cover generation method based on preset object clustering according to an embodiment of the present application.

Fig. 8 is a schematic diagram illustrating a process of clustering objects according to an embodiment of the present application.

Fig. 9 shows a block diagram of a video cover generation apparatus according to an embodiment of the present application.

FIG. 10 is a block diagram illustrating an electronic device for video cover generation in accordance with an exemplary embodiment.

FIG. 11 shows a block diagram of another electronic device for video cover generation provided in accordance with an embodiment of the present application.

Detailed Description

Various exemplary embodiments, features and aspects of the present application will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present application. It will be understood by those skilled in the art that the present application may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present application.

Referring to fig. 1, fig. 1 is a schematic diagram illustrating an application system according to an embodiment of the present application. The application system can be used for the video cover generation method of the application. As shown in fig. 1, the application system may include at least a server 01 and a terminal 02.

In this embodiment of the application, the server 01 may be used for generating and processing a video cover, and the server 01 may be a server of a video platform, a video website, or a video application. The server 01 may include an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network), a big data and artificial intelligence platform.

In this embodiment, the terminal 02 may be configured to provide a to-be-processed video, for example, upload the to-be-processed video to the server 01. The terminal 02 may include a smart phone, a desktop computer, a tablet computer, a notebook computer, a smart speaker, a digital assistant, an Augmented Reality (AR)/Virtual Reality (VR) device, a smart wearable device, and other types of physical devices. The physical device may also include software running in the physical device, such as an application program. The operating system running on terminal 02 in this embodiment of the present application may include, but is not limited to, an android system, an IOS system, linux, windows, and the like.

In the embodiment of the present disclosure, the terminal 02 and the server 01 may be directly or indirectly connected by a wired or wireless communication method, and the present disclosure is not limited thereto.

It should be noted that, as an example of the application scenario, the terminal 02 may execute the generation process of the video cover, specifically, a client of the video editing application in the terminal 02 may execute the generation process of the video cover to automatically generate a cover for the video to be processed, and may automatically generate the video with the video cover for the video producer. Alternatively, the video producer may publish the video with the video cover to a video platform or the like for sharing and presentation.

In a specific embodiment, when the server 02 is a distributed system, the distributed system may be a blockchain system, when the distributed system is a blockchain system, the distributed system may be formed by a plurality of nodes (any form of computing device in an access network, such as a server and a user terminal), a Peer-To-Peer (P2P, Peer To Peer) network is formed between the nodes, and the P2P Protocol is an application layer Protocol running on top of a Transmission Control Protocol (TCP). In a distributed system, any machine, such as a server or a terminal, can join to become a node, and the node comprises a hardware layer, a middle layer, an operating system layer and an application layer. Specifically, the functions of each node in the blockchain system may include:

1) routing, a basic function that a node has, is used to support communication between nodes.

Besides the routing function, the node can also have the following functions:

2) the application is used for being deployed in a block chain, realizing specific services according to actual service requirements, recording data related to the realization functions to form recording data, carrying a digital signature in the recording data to represent a source of task data, and sending the recording data to other nodes in the block chain system, so that the other nodes add the recording data to a temporary block when the source and integrity of the recording data are verified successfully.

It should be noted that, in the specific implementation manner of the present application, the data related to the user needs to obtain user permission or consent when the following embodiments of the present application are applied to specific products or technologies, and the collection, use and processing of the related data need to comply with the relevant laws and regulations and standards of the relevant countries and regions.

As shown in fig. 2, the method may include:

s201, acquiring a first image of a video to be processed.

In this embodiment of the present specification, the first image may be a frame of image in a first time period in the video to be processed, where a start time of the first time period may be a start time of the video to be processed, and a duration of the first time period may be smaller than a first duration threshold. As an example, the first time threshold may be 1 second, which is not limited by this disclosure.

In practical application, the video to be processed may be a recorded original video, or may be a preprocessed video obtained by preprocessing the recorded original video. As an example, in the case of a video editing application, taking an original video as an example, the original video may be put into the video editing application as a video to be processed, so that the method in this specification may be executed to automatically generate a video cover for the original video, and the video cover may be inserted into a first frame of the original video to implement automatic editing of the video. In the case of a video platform or a video website, the video to be processed is an original video or a pre-processed video as an example, and the pre-processing may include inserting a first frame, etc., which is not limited in this disclosure. Based on this, when the original video or the preprocessed video is received, the original video or the preprocessed video can be used as the video to be processed, so as to perform generation processing of the video cover.

In an alternative mode, it is considered that a video producer generally inserts video description content or other information capable of improving video attention into a video as a first frame, and publishes the video after the first frame is inserted into a video platform (or a video website, a service end of a video application). The video platform may generate a video cover for the video (to-be-processed video) published to the platform, thereby performing the presentation of the video. Based on this, the first frame image in the first period may be taken as the first image, that is, the first frame of the video to be processed may be taken as the first image. Therefore, accurate and efficient mining of the video content to be processed can be improved, a basis is provided for accurate generation of a follow-up video cover, and generation efficiency and cover attention can be improved.

Alternatively, regarding the first frame image in the first period as the first image may include: and in the case of detecting that a video segment in a first time interval in the video to be processed is preprocessed, taking a first frame image in the first time interval as a first image. The first frame image in the first time interval is used as the first image by setting the condition that the video clip in the first time interval in the video to be processed is preprocessed, and because the content of the general first frame of the preprocessed video is rich and the visual effect can be ensured, on the basis, the probability that the first image is used as a video cover can be higher by setting the preprocessing detection, so that the generation efficiency of the video cover can be improved, and the detection effectiveness of the first image can be further ensured.

Accordingly, the method may further comprise: and under the condition that the video clip in the first time interval in the video to be processed is not pre-processed, taking the target frame meeting the preset image condition in the first time interval as a first image. The preset image condition may be an image quality condition such as a sharpness, a maximum variety of objects in the image, and the like, and the disclosure is not limited thereto. The determination diversity of the first image can be improved through the detection of the preprocessing, and the method is more flexible and better in adaptability.

S203, detecting a first preset target of the first image, and if the first image comprises the first preset target, taking the first image as a video cover of the video to be processed;

in one possible implementation, the first preset goal may include a preset sub-goal. Based on the method, the first image can be subjected to detection processing of the preset sub-target, and if the first image comprises the preset sub-target, the first image is used as a video cover of the video to be processed. The preset sub-target may be a first preset object visually satisfying the interaction preference, and the first preset object may be a face, a pet, a food, or the like having a preset object attribute. The preset object attribute may be a preset gender, a preset age, a preset expression, and the like. Based on this, the face with the preset object attribute may be a face of a woman, a face of a smile, or the like, which are not limited in the present disclosure. For a human face, visually satisfying the interaction preference may refer to face element proportions satisfying a preset proportion; for pets, the interaction preference can be obtained by statistics according to the visual attention of a large number of accounts to the pets; for a gourmet, visually satisfying the interaction preference may mean that the visual effects of both the food and the container holding the food satisfy visual attention. The interaction preference is not limited by the disclosure, and can be obtained according to long-term data statistics or determined according to real-time heat. Through detecting the first preset object meeting the interaction preference visually earlier, the attention of the video cover in the vision can be improved, and therefore the video display effect is improved.

As an example, as shown in fig. 3, the preset sub-target may be detected based on a preset target detection model, which may be a Resnet-50 network. The first image can be input into a preset target detection model, a preset sub-target detection process is performed, and a 2-dimensional vector is output and corresponds to the confidence of the category [ preset sub-target, non-preset sub-target ] respectively. If the confidence of the preset sub-target is greater than that of the non-preset sub-target, the preset sub-target can be predicted. If the confidence of the "non-preset sub-goals" is greater than the confidence of the "preset sub-goals", then the "non-preset sub-goals" can be predicted. As shown in fig. 3, two images are input into the preset target detection model, the confidence of the upper image predicted as the preset sub-target is 0.999, and if the confidence of the upper image predicted as the non-preset sub-target is 0.1, it can be determined that the upper image includes the preset sub-target. The confidence that the underlying image is predicted as a non-preset sub-target is 1.0, and the confidence that the underlying image is predicted as a preset sub-target is 0.05, it may be determined that the underlying image does not include the preset sub-target. The preset sub-targets can be detected through the model, so that the target recognition tasks of a large number of images can be effectively dealt with, and the generation efficiency of the video cover is improved.

In an optional manner, the first preset target may further include a second preset object, and the first preset object may belong to the second preset object, that is, the second preset object may be a human face, a pet, a food, or the like. Accordingly, as shown in fig. 6, the method may further include: if the first image does not include the preset sub-object, the first image can be further subjected to detection processing of a second preset object, and if the first image includes the second preset object, the first image is used as a video cover of the video to be processed. If the first image does not include the second preset object, the following step S205 may be performed.

Optionally, the first preset target may further include text information. Accordingly, as shown in fig. 6, the method may further include: if the first image does not include the second preset object, the text information can be detected and processed on the first image, and if the first image includes the text information, the first image is used as a video cover of the video to be processed. If the first image does not include text information, the following step S205 may be performed. Or the detection of the text information may be performed without detecting the preset sub-targets, which is not limited by the present disclosure.

As shown in fig. 6, the detection of the preset sub-target, the detection of the second preset object, and the detection of the text information may be sequentially performed. By setting the first preset target to comprise the preset sub-targets, the second preset object and the text information, the cover of the expressed video can be mined according to the sequence from the attention degree of the video cover to the real content of the reflected video, so that the attention degree of the video cover can be ensured, the generation efficiency of the video cover can be improved, and the video content can be effectively represented. As one example, an object detection model may be constructed as shown in FIG. 4, using open-source yolo-v5 as the base algorithm. A first image may be input into the object detection model, and if a second preset object, such as a human face, exists in the first image, an upper left corner point (such as an upper left corner coordinate (x1, y1)), a lower right corner point (such as a lower right corner coordinate (x2, y2)) and a human face confidence may be output. And if the second preset object does not exist in the first image, returning to be empty. As an example, the detection process of the text information may also be performed based on a text model, which may be constructed using open source yolo-v5 as a base algorithm, as shown in FIG. 5. The first image may be input into the text model, and if text information exists in the first image, a top left corner point (e.g., a top left corner coordinate), a bottom right corner point (e.g., a bottom right corner coordinate) and a confidence level of the text box may be output. And returning to the null if the text information does not exist in the first image.

It should be noted that the above models are merely examples, and do not limit the present disclosure. These models may be pre-trained, nor is the present disclosure limited to the training process for these models.

S205, if the first image does not comprise the first preset target, acquiring a plurality of second images in a second time period in the video to be processed; the start time of the second period is the end time of the first period.

As shown in fig. 6, in a case where none of the plurality of kinds of first preset targets of the first image is detected after detection, the second period after the first period may be detected. The duration of the second period may be less than a second duration threshold, which may be greater than or equal to the first duration threshold, and may be, for example, 3 seconds. The starting time of the second time interval may be the time corresponding to the first frame of the video to be processed, and the duration of the second time interval is 3 seconds. Therefore, the number of the monitored images can be controlled, and the probability of digging the cover can be ensured.

And S207, sequentially carrying out detection processing on a second preset target on the plurality of second images, and taking the second images comprising the second preset target as video covers.

In the embodiment of the present specification, the second preset target may be at least one of the first preset targets, which is not limited in the present disclosure. Accordingly, for the specific detection process, reference may be made to the above corresponding contents, which are not described herein again.

In a possible implementation manner, in order to improve the reality of the video content expressed by the video cover, the second preset target may be set to include text information, so that the detection processing of the text information may be performed on a plurality of second images in sequence, and the second image including the text information is used as the video cover.

Fig. 7 is a flowchart illustrating a video cover generation method based on preset object clustering according to an embodiment of the present application. As shown in fig. 7, the method may further include:

s701, if the plurality of second images do not comprise a second preset target, extracting a plurality of third images from a third time period of the video to be processed; the third period may be a period other than the first period and the second period among the video periods of the video to be processed, such as other frames shown in fig. 6.

And S703, detecting a second preset object for the plurality of third images to obtain an image set comprising the second preset object.

As an example, the detection processing of the second preset object may be performed on the plurality of third images based on the object detection model, so as to obtain an image set including the second preset object. For details, reference may be made to the above-mentioned corresponding detection content, which is not described herein again.

S705, clustering the second preset object included in the images in the image set to obtain a target object with the most occurrence times; the target object is a second preset object with the same object attribute.

In this specification, based on a statistical manner, clustering may be performed on a second preset object included in an image set, so as to obtain a target object with the largest occurrence number. The second preset objects having the same object attribute may mean that the similarity of the object features of the second preset objects is less than a similarity threshold. For example, faces of the same account, that is, the same object attribute is a face feature of the same account.

As an example, the images in the image set may be input to Resnet-50, feature extracted, and feature vectors output. And then, the feature vectors of the images can pass through a face detection model to obtain a face frame. Based on the method, the image area corresponding to the face frame can be cut, and the cut image area is input into the object feature extraction model to obtain the extracted features. And calculating similarity between every two extracted features by using cosine distance, wherein the similarity is more than 0.8, and the extracted features can be judged as the same target object. Further, the occurrence frequency of each target object can be counted to obtain the target object with the largest occurrence frequency.

Alternatively, as shown in fig. 6 and 8, the images in the image set may be input into the object detection model, so as to perform the detection processing on the second preset object, and obtain the recognition result, for example, an object frame, such as a face frame. Therefore, the image area corresponding to the face frame can be cut to obtain a plurality of sub-images comprising a second preset object; inputting the sub-images into an object clustering model, and clustering second preset objects in the sub-images to obtain clustering clusters, wherein each clustering cluster comprises at least one sub-image, and the second preset objects in the sub-images in the same clustering cluster have the same object attribute. And then the target object with the largest occurrence frequency can be obtained based on the number of the sub-images in each cluster. For example, the maximum number of corresponding target cluster clusters may be determined, and the second preset object in the sub-image in the target cluster clusters is used as the target object.

And S707, determining a target image in the image set as a video cover, wherein the area ratio of the target object in the target image is larger than that in other images, and the other images are images except the target image in the image set.

In this embodiment of the present specification, a target object with a maximum area ratio may be found from a cluster corresponding to a target object with the largest occurrence frequency, so that a frame where the target object with the maximum area ratio is located, that is, a target image may be determined as a video cover.

Optionally, as shown in fig. 6, if none of the plurality of third images includes the second preset object, the first frame may be directly used as a video cover.

Fig. 9 shows a block diagram of a video cover generation apparatus according to an embodiment of the present application. As shown in fig. 9, the apparatus may include:

a first obtaining module 901, configured to obtain a first image of a video to be processed; the first image is a frame image in a first time period in the video to be processed, the starting time of the first time period is the starting time of the video to be processed, and the duration of the first time period is smaller than a first time threshold;

a first detection module 903, configured to perform detection processing on a first preset target on the first image, and if the first image includes the first preset target, use the first image as a video cover of the video to be processed;

a second obtaining module 905, configured to obtain multiple second images in a second time period in the video to be processed if the first image does not include the first preset target; the starting time of the second time period is the ending time of the first time period;

a second detecting module 907, configured to perform detection processing on a second preset target sequentially on the multiple second images, and use the second image including the second preset target as the video cover.

In a possible implementation manner, the first preset target includes a preset sub-target; the first detection module 903 may include:

the first detection unit is used for detecting and processing a preset sub-target on the first image, and if the first image comprises the preset sub-target, the first image is used as a video cover of the video to be processed; wherein the preset sub-target is a first preset object visually satisfying the interaction preference.

In a possible implementation manner, the first preset target further includes a second preset object, and the first preset object belongs to the second preset object; the first detecting module 903 may further include:

and the second detection unit is used for detecting and processing a second preset object on the first image if the first image does not comprise the preset sub-target, and taking the first image as a video cover of the video to be processed if the first image comprises the second preset object.

In a possible implementation manner, the first preset target may further include text information; correspondingly, the first detecting module 903 may further include:

and the third detection unit is used for detecting and processing the text information of the first image if the first image does not comprise the second preset object, and taking the first image as a video cover of the video to be processed if the first image comprises the text information.

In one possible implementation, the apparatus may further include:

a third image extraction module, configured to extract a plurality of third images from a third time period of the to-be-processed video if none of the plurality of second images includes the second preset target; the third time interval is a time interval except the first time interval and the second time interval in the video time interval of the video to be processed;

a third detection module, configured to perform detection processing on a second preset object on the multiple third images to obtain an image set including the second preset object;

the clustering module is used for clustering second preset objects included in the images in the image set to obtain a target object with the largest occurrence frequency; the target object is a second preset object with the same object attribute;

and the video cover determining module is used for determining a target image in the image set as the video cover, and the area ratio of the target object in the target image is larger than that in other images.

In a possible implementation manner, the second preset target may include text information, and the second detecting module 907 may include:

and the fourth detection unit is used for sequentially carrying out detection processing on the text information on the plurality of second images and taking the second image comprising the text information as the video cover.

In a possible implementation manner, the first obtaining module 901 is further configured to use a first frame image in the first time period as the first image.

In a possible implementation manner, the first obtaining module 901 is further configured to, when it is detected that a video segment in the to-be-processed video within the first time period is preprocessed, take a first frame image in the first time period as the first image;

correspondingly, the first obtaining module 901 is further configured to, when it is detected that a video segment in the to-be-processed video within the first time period is not preprocessed, take a target frame meeting a preset image condition within the first time period as the first image.

In a possible implementation manner, the clustering module may include:

a fifth detection unit, configured to input the images in the image set into an object detection model, and perform detection processing on the second preset object to obtain an identification result;

a cropping unit configured to crop a plurality of sub-images including the second preset object from images in the image set based on the recognition result;

the clustering unit is used for inputting the sub-images into an object clustering model, clustering the second preset objects in the sub-images to obtain clustering clusters, wherein each clustering cluster comprises at least one sub-image, and the second preset objects in the sub-images in the same clustering cluster have the same object attribute;

and the target object obtaining unit is used for obtaining the target object with the largest occurrence frequency based on the number of the sub-images in each cluster.

With regard to the apparatus in the above-described embodiment, the specific manner in which the respective modules and units perform operations has been described in detail in the embodiment related to the method, and will not be elaborated upon here.

Fig. 10 is a block diagram illustrating an electronic device for video cover generation, which may be a terminal, according to an exemplary embodiment, and the internal structure thereof may be as shown in fig. 10. The electronic device comprises a processor, a memory, a network interface, a display screen and an input device which are connected through a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic equipment comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the electronic device is used for connecting and communicating with an external terminal through a network. The computer program is executed by a processor to implement a method of video cover generation. The display screen of the electronic equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the electronic equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the electronic equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the architecture shown in fig. 10 is merely a block diagram of some of the structures associated with the disclosed aspects and does not constitute a limitation on the electronic devices to which the disclosed aspects apply, as a particular electronic device may include more or less components than those shown, or combine certain components, or have a different arrangement of components.

FIG. 11 shows a block diagram of another electronic device for video cover generation provided in accordance with an embodiment of the present application. The electronic device may be a server, and its internal structure diagram may be as shown in fig. 11. The electronic device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic equipment comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the electronic device is used for connecting and communicating with an external terminal through a network. The computer program is executed by a processor to implement a method of video cover generation.

Those skilled in the art will appreciate that the architecture shown in fig. 11 is a block diagram of only a portion of the architecture associated with the subject application, and does not constitute a limitation on the electronic devices to which the subject application may be applied, and that a particular electronic device may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In an exemplary embodiment, there is also provided an electronic device including: a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to execute the instructions to implement the video cover generation method as in the embodiments of the present application.

In an exemplary embodiment, there is also provided a storage medium in which instructions, when executed by a processor of an electronic device, enable the electronic device to perform a video cover generation method in an embodiment of the present application.

In an exemplary embodiment, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the video cover generation method in the embodiments of the present application.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A method for video cover generation, the method comprising:

and sequentially detecting a second preset target on the plurality of second images, and taking the second image comprising the second preset target as the video cover.

2. The method of claim 1, wherein the first preset target comprises a preset sub-target; the detecting and processing of the first preset target on the first image, and if the first image includes the first preset target, taking the first image as a video cover of the video to be processed, includes:

detecting and processing a preset sub-target on the first image, and if the first image comprises the preset sub-target, taking the first image as a video cover of the video to be processed;

wherein the preset sub-target is a first preset object visually satisfying the interaction preference.

3. The method of claim 2, wherein the first preset target further comprises a second preset object, and the first preset object belongs to the second preset object; the method further comprises the following steps:

and if the first image does not comprise the preset sub-target, detecting and processing a second preset object on the first image, and if the first image comprises the second preset object, taking the first image as a video cover of the video to be processed.

4. The method of claim 3, wherein the first preset target further comprises text information; the method further comprises the following steps:

and if the first image does not comprise the second preset object, performing text information detection processing on the first image, and if the first image comprises the text information, taking the first image as a video cover of the video to be processed.

5. The method according to any one of claims 1-4, further comprising:

if the plurality of second images do not comprise the second preset target, extracting a plurality of third images from a third time period of the video to be processed; the third time interval is a time interval except the first time interval and the second time interval in the video time interval of the video to be processed;

detecting a second preset object on the plurality of third images to obtain an image set comprising the second preset object;

clustering second preset objects included in the images in the image set to obtain a target object with the largest occurrence frequency; the target object is a second preset object with the same object attribute;

and determining a target image in the image set as the video cover, wherein the area ratio of the target object in the target image is larger than that in other images.

6. The method according to claim 1, wherein the second preset target comprises text information, and the sequentially performing detection processing of the second preset target on the plurality of second images, and using the second image including the second preset target as the video cover comprises:

and sequentially detecting the text information of the plurality of second images, and taking the second images comprising the text information as the video covers.

7. The method according to any one of claims 1-4 and 6, wherein the acquiring the first image of the video to be processed comprises:

and taking a first frame image in the first period as the first image.

8. The method according to claim 7, wherein the taking the first frame image in the first period as the first image comprises:

taking a first frame image in the first time interval as the first image under the condition that the video segment in the first time interval in the video to be processed is detected to be preprocessed;

accordingly, the method further comprises:

and under the condition that the video clips in the to-be-processed video within the first time interval are not pre-processed, taking the target frames meeting preset image conditions within the first time interval as the first images.

9. The method according to claim 5, wherein the clustering the second preset object included in the images in the image set to obtain the target object with the largest occurrence number comprises:

inputting the images in the image set into an object detection model, and performing detection processing on the second preset object to obtain a recognition result;

cropping a plurality of sub-images including the second preset object from the images in the image set based on the recognition result;

inputting the sub-images into an object clustering model, and clustering the second preset objects in the sub-images to obtain clustering clusters, wherein each clustering cluster comprises at least one sub-image, and the second preset objects in the sub-images in the same clustering cluster have the same object attribute;

and obtaining the target object with the most occurrence times based on the number of the sub-images in each cluster.

10. A video cover creation device, comprising:

11. An electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to execute the executable instructions to implement the method of any of claims 1 to 9.

12. A non-transitory computer readable storage medium having stored thereon computer program instructions, wherein the computer program instructions, when executed by a processor, implement the method of any one of claims 1 to 9.