CN111581435A

CN111581435A - Video cover image generation method and device, electronic equipment and storage medium

Info

Publication number: CN111581435A
Application number: CN202010449182.6A
Authority: CN
Inventors: 刘畅; 李岩
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2020-05-25
Filing date: 2020-05-25
Publication date: 2020-08-25
Anticipated expiration: 2040-05-25
Also published as: CN111581435B

Abstract

The disclosure relates to a video cover image generation method, a video cover image generation device, electronic equipment and a storage medium, which are used for solving the problem that the click rate of a short video is low due to a method of taking a first frame of a video as a video cover in the prior art; determining at least two frames of alternative images from a target video, and determining screening parameters corresponding to the alternative images according to the pixel characteristics of the alternative images and search keywords corresponding to the video search instruction; and screening a target image from the at least two frames of alternative images according to the screening parameters corresponding to the alternative images, and generating a cover of the target video according to the target image. Compared with the method of taking the first frame of the video as the cover of the video, the video cover determined by the video cover image generation method provided by the embodiment of the disclosure has the characteristics of higher wonderful degree and higher matching degree with the search keyword, and can attract more users to click the video for watching, so that the click rate of the video is improved.

Description

Video cover image generation method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for generating a video cover image, an electronic device, and a storage medium.

Background

With rapid progress of modern information transmission technology and popularization of video shooting equipment such as intelligent terminals, more and more people share life by creating videos, various short videos gradually become main carriers for people to receive information in daily life, and application software for sharing the short videos is more and more.

Usually, a user can search keywords interested by the user in the short video application to obtain related short videos interested by the user, preliminarily know the content of the short videos according to covers of the short videos, and judge whether the short videos are the short videos interested by the user, so that the click rate of the short videos can be greatly improved by the covers of the wonderful short videos. However, the cover of the short video is usually the first frame of the video at present, and the content of the first frame of the video often cannot show the wonderful part in the video, which results in a low click rate of the short video.

Disclosure of Invention

The disclosure provides a method and a device for generating a video cover image and electronic equipment, which are used for solving the problem that the click rate of a short video is low due to a method of taking a first frame of a video as a short video cover in the prior art. The technical scheme of the disclosure is as follows:

according to a first aspect of the embodiments of the present disclosure, there is provided a video cover image generation method, including:

responding to a video searching instruction, and determining a target video corresponding to the video searching instruction;

determining at least two frames of alternative images from the target video, and determining screening parameters corresponding to the alternative images and used for expressing the matching degree with the search keywords according to the pixel characteristics of the alternative images and the search keywords corresponding to the video search instruction;

and screening a target image from the at least two frames of alternative images according to the screening parameters corresponding to the alternative images, and generating a cover of the target video according to the target image.

In a possible implementation manner, the at least two frames of candidate images are images separated by a preset time length in the target video.

In a possible implementation manner, the screening parameters corresponding to the candidate images are determined according to the following manner:

inputting the alternative images and the search keywords into a trained deep learning network, and acquiring screening parameters corresponding to the alternative images output by the trained deep learning network;

and determining the screening parameters corresponding to the candidate images according to the pixel characteristics of the candidate images and the search keywords by the deep learning network.

In one possible implementation, the deep learning network is trained according to the following:

aiming at video samples in a video sample training set, inputting at least two training images determined from the video samples and search keywords corresponding to the video samples into a deep learning network, and acquiring first sample screening parameters corresponding to the training images output by the deep learning network; and

inputting the at least two training images and the search keywords corresponding to the video samples into a trained image two classifier, and acquiring second sample screening parameters corresponding to the training images output by the trained image two classifier;

aiming at any two frames of training images, determining loss values corresponding to the two frames of training images according to a first sample screening parameter and a second sample screening parameter respectively corresponding to the two frames of training images;

and adjusting parameters of the deep learning network according to the determined loss value until the determined loss value is not greater than a preset threshold value to obtain the trained deep learning network.

In a possible implementation manner, the determining the loss values corresponding to the two frames of training images according to the first sample screening parameter and the second sample screening parameter respectively corresponding to the two frames of training images includes:

according to second sample screening parameters respectively corresponding to the two frames of training images, determining a pseudo-binary label for representing the size of the second sample screening parameters respectively corresponding to the two frames of training images, and determining a difference value of first sample screening parameters respectively corresponding to the two frames of training images;

and determining the loss value corresponding to the two frames of training images according to the pseudo-binarization label and the difference value.

In one possible implementation, the video sample training set includes standard video samples and non-standard video samples;

obtaining the video sample training set according to the following modes:

determining videos corresponding to a plurality of preset search keywords;

sequencing videos corresponding to each preset search keyword according to the click rate, taking N videos in the front of the sequence as standard video samples, and taking M videos in the back of the sequence as non-standard video samples;

wherein N, M is a positive integer.

In one possible implementation, the image two classifier is trained according to the following:

selecting part or all of the standard images from the standard image set, and selecting part or all of the non-standard images from the non-standard image set; wherein the standard image set consists of cover images of the standard video sample; the non-standard image set is composed of cover images of the non-standard video sample;

and taking the selected standard image, the selected non-standard image, the search keyword corresponding to the video sample to which each image belongs and the classification label corresponding to each image as the input of the image second classifier, taking a first probability value that the classification label of each image is the standard label and a second probability value that the classification label of each image is the non-standard label as the output of the image second classifier, and training the image second classifier.

In a possible implementation manner, the obtaining of the second sample screening parameter corresponding to the training image output by the trained image two classifier includes:

and taking the classification label of the training image output by the image two classifier as a first probability value of a standard label as a second sample screening parameter corresponding to the training image.

In one possible implementation, the target image is screened from the at least two frames of candidate images according to the following manner:

and taking the candidate image with the largest screening parameter in the at least two frames of candidate images as the target image.

According to a second aspect of the embodiments of the present disclosure, there is provided a video-cover-image generating apparatus including:

the searching unit is configured to execute the steps of responding to a video searching instruction, and determining a target video corresponding to the video searching instruction;

the determining unit is configured to determine at least two frames of alternative images from the target video, and determine a screening parameter which is corresponding to the alternative images and used for expressing the matching degree with the search keyword according to the pixel characteristics of the alternative images and the search keyword corresponding to the video search instruction;

and the generating unit is configured to screen a target image from the at least two frames of candidate images according to the screening parameters corresponding to the candidate images, and generate a cover of the target video according to the target image.

In a possible implementation manner, the determining unit is specifically configured to:

In a possible implementation, the determining unit is further configured to perform training of the deep learning network according to:

In a possible implementation, the determining unit is specifically configured to perform:

the determination unit is further configured to perform the obtaining of the training set of video samples according to:

determining videos corresponding to a plurality of preset search keywords;

wherein N, M is a positive integer.

In a possible implementation, the determining unit is further configured to perform training of the image two classifier according to:

In a possible implementation manner, the determining unit is specifically configured to perform:

In one possible implementation, the generating unit is specifically configured to perform:

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including: a memory for storing executable instructions;

a processor configured to read and execute the executable instructions stored in the memory to implement the method for generating a video cover image according to any one of the first aspect of the embodiments of the present disclosure.

According to a fourth aspect of the embodiments of the present disclosure, there is provided a non-volatile storage medium, wherein instructions that, when executed by a processor of a video cover recommendation apparatus, enable the video cover recommendation apparatus to perform the video cover image generation method described in the first aspect of the embodiments of the present disclosure.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

the method for generating the video cover image comprises the steps of determining a target video corresponding to a video search instruction, determining at least two frames of alternative images from the target video, determining a screening parameter of the matching degree of each frame of alternative images and a keyword according to the pixel characteristics of each frame of alternative images and the keyword corresponding to the video search instruction, and selecting the target image according to the screening parameter to generate the cover of the target video. The screening parameters are determined according to the pixel characteristics of the alternative images and the search keywords corresponding to the search instructions, so that the highlight degree of the alternative images and the matching degree with the search keywords can be determined through the cover image generation method provided by the embodiment of the invention, the highlight degree of the video cover image generated according to the target image determined by the screening parameters of the alternative images is higher, the matching degree with the search keywords is higher, more users can be attracted to click the video for watching, and the click rate of the video is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

FIG. 1 is a schematic flow diagram illustrating a method for generating a video cover image in accordance with one exemplary embodiment;

FIG. 2 is a schematic diagram illustrating an application scenario in accordance with an illustrative embodiment;

FIG. 3 is an overall flow diagram illustrating a method of generating a video cover image in accordance with one illustrative embodiment;

FIG. 4 is an overall flow diagram illustrating a deep learning network training method in accordance with an exemplary embodiment;

FIG. 5 is a block diagram illustrating a video cover image generation apparatus according to one exemplary embodiment;

FIG. 6 is a block diagram illustrating an electronic device in accordance with an example embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

Hereinafter, some terms in the embodiments of the present disclosure are explained to facilitate understanding by those skilled in the art.

(1) The term "plurality" in the embodiments of the present disclosure means two or more, and other terms are similar thereto.

(2) The term "electronic device" in the embodiments of the present disclosure may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, or the like.

(3) The term "short video" in the embodiments of the present disclosure refers to high-frequency pushed video content, ranging from seconds to minutes, played on various new media platforms, suitable for viewing in mobile and short-time leisure states. The contents integrate the topics of skill sharing, humorous work, fashion trend, social hotspots, street interviews, public education, advertising creativity, business customization and the like. Because the content is short, the content can be individually sliced or can be a series of columns.

(4) The term "Click-Through-Rate" or CTR (Click-Through-Rate) in the embodiments of the present disclosure refers to the ratio of the number of times a certain content on a website page is clicked to the number of times it is displayed, i.e. clicks/Show content, which is a percentage. Which reflects the degree of interest of a certain content on a web page, is often used to measure the attractiveness of an advertisement.

(5) The term "loss function" in the embodiments of the present disclosure refers to a function that maps the value of a random event or its associated random variable to a non-negative real number to represent the "risk" or "loss" of the random event. In application, the loss function is usually associated with the optimization problem as a learning criterion, i.e. the model is solved and evaluated by minimizing the loss function.

The click rate is an important index for measuring the effect of internet advertisements, and generally means that after keywords are input in a search engine, searching is carried out, then related advertisements are arranged in sequence, and a user can select the advertisements which are interested by the user to click in; the number of times an advertisement is searched is taken as the total number of times, and the ratio of the number of times a user clicks and enters a website to the total number of times is called the click rate.

For example, if the advertisement is search shown 1000 times and clicked 10 times by the user, then the click-through rate for the advertisement is: 1 percent.

The click rate can also be used for measuring the popularity of the short video, a user inputs a search keyword in the short video application, for example, if the input keyword is 'laugh', the related laugh short video can be searched, then the user selects a short video frequency in which the user is interested to click in the searched short video, the number of times of searching out the short video after inputting the keyword is taken as the total number of times, and the proportion of the number of times of clicking to enter the short video to the total number of times is the click rate of the short video.

When a user selects a short video which is interested in, the user generally knows the content of the short video primarily according to the cover of the short video, the cover of the short video is more wonderful, and the click rate of the short video is higher. However, many existing short video applications automatically take the first frame of the short video as the front cover of the short video when a user uploads the short video, and the first frame content of the short video often cannot show a highlight in the video, which may result in a low click rate of the short video.

The embodiment of the disclosure provides a video cover image generation method, which is used for generating a wonderful cover for a video, so that the click rate of the video is improved.

To make the objects, technical solutions and advantages of the present disclosure clearer, the present disclosure will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present disclosure, rather than all embodiments. All other embodiments, which can be derived by one of ordinary skill in the art from the embodiments disclosed herein without making any creative effort, shall fall within the scope of protection of the present disclosure.

Embodiments of the present disclosure are described in further detail below.

FIG. 1 is a flow diagram illustrating a method for generating a video cover image according to an exemplary embodiment, as shown in FIG. 1, including the steps of:

in step S11, in response to the video search instruction, determining a target video corresponding to the video search instruction;

in step S12, at least two frames of candidate images are determined from the target video, and a screening parameter for indicating the degree of matching with the search keyword corresponding to the candidate images is determined according to the pixel characteristics of the candidate images and the search keyword corresponding to the video search instruction;

in step S13, a target image is screened from the at least two frames of candidate images according to the screening parameters corresponding to the candidate images, and a cover of the target video is generated according to the target image.

From the above, in the method for generating a video cover image disclosed in the embodiment of the present disclosure, a target video corresponding to a video search instruction is determined, at least two frames of alternative images are determined from the target video, a screening parameter of a matching degree between each frame of alternative image and a keyword is determined according to a pixel feature of each frame of alternative image and the keyword corresponding to the video search instruction, and a cover of the target video is generated by selecting the target image according to the screening parameter. The screening parameters are determined according to the pixel characteristics of the alternative images and the search keywords corresponding to the search instructions, so that the highlight degree of the alternative images and the matching degree with the search keywords can be determined through the cover image generation method provided by the embodiment of the invention, the highlight degree of the video cover image generated according to the target image determined by the screening parameters of the alternative images is higher, the matching degree with the search keywords is higher, more users can be attracted to click the video for watching, and the click rate of the video is improved.

It should be noted that, in the embodiment of the present disclosure, the screening parameter is used to indicate the matching degree between the candidate image and the search keyword, and the higher the screening parameter of the candidate image is, the higher the possibility that the click rate of the video after the cover is generated from the candidate image is.

An optional application scenario may be as shown in fig. 2, where a short video application is installed in a terminal device 21, and when a user 20 inputs a search keyword on a search page of the short video application of the terminal device 21 and confirms a search trigger video search instruction, the terminal device 21 responds to the video search instruction to obtain a target short video corresponding to the video search instruction, in an optional implementation manner, the terminal device 21 responds to the video search instruction to send a request to a server 22, and the server 22 returns the target video corresponding to the video search instruction to the terminal device 21; when the target video is returned, the server 22 further determines configuration information of the target video, such as a title, a cover page, author information and the like, returns the configuration information of the target video to the terminal device 21, determines at least two frames of alternative images from the target video for any target video when the cover page of the target video is determined, determines a screening parameter corresponding to the alternative images and used for indicating the matching degree of the search keyword according to pixel characteristics of the alternative images and a keyword corresponding to a video search instruction, and screens the target image from the at least two frames of alternative images according to the screening parameter corresponding to the alternative images. The server 22 sends the target image to the short video application in the terminal device 21, and the short video application generates a cover page of the target video from the target image and recommends the cover page to the user 20.

It should be noted that the application of the method for generating a cover image of a video to a scene of a video search in the embodiment of the present disclosure is only an example, and does not limit the scope of the embodiment of the present invention. The method for generating the video cover image disclosed by the embodiment of the disclosure can be applied to uploading and publishing after the video recording is finished; or the method can be applied to the situation that after the video is uploaded for a period of time, if the click rate of the video is low; or in any application scenario in which a cover needs to be generated for a video, the embodiment of the present disclosure is not particularly limited in this respect.

At least two frames of alternative images are determined from the target video aiming at the target video needing to generate a cover image, and one frame is selected from the determined at least two frames of alternative images to be used as the cover of the target video.

Specifically, the method for determining the alternative image may be: and taking the images with a preset time interval in the target video as at least two frame alternative images.

Because the video is played, the image pictures of two adjacent frames are basically the same. In the method for determining the alternative image provided by the embodiment of the disclosure, the image processing process can be simplified by selecting one frame every preset time, and a complete target video can be basically covered.

For example, the preset time duration may be 1 second, and images at intervals of 1 second in the target video are taken as the candidate images.

The method for determining the alternative image is only an example, and the method for determining the alternative image may also be to arbitrarily select at least two frames in the target video, which is not limited in this disclosure.

After the alternative images are determined, screening parameters corresponding to the alternative images are determined, wherein the screening parameters corresponding to the alternative images represent the matching degree of the search keywords, and in the process of determining the screening parameters, according to the pixel characteristics of the alternative images, the images with good pixel characteristics have the characteristics of good picture shooting method, good people in the images and the like, and the larger the screening parameters are, the better the pixel characteristics are, and the higher the wonderful degree of the images is.

The embodiment of the disclosure provides a method for determining screening parameters corresponding to alternative images, which determines the screening parameters corresponding to the alternative images through a trained deep learning network.

An optional implementation manner is that the candidate images are input into the trained deep learning network, and the screening parameters corresponding to the candidate images output by the trained deep learning network are obtained.

The deep learning network determines the screening parameters corresponding to the candidate images according to the pixel characteristics of the candidate images and the search keywords.

The method and the device have the advantages that the efficient deep learning network is trained, the screening parameters corresponding to the alternative images are determined through the trained deep learning network, the execution efficiency is high, and the multiple alternative images can be processed at the same time; and because the deep learning network is trained by a large number of training samples, the accuracy of the screening parameters corresponding to the alternative images determined by the trained deep learning network is higher.

Specifically, the trained deep learning network can extract the pixel characteristics of the alternative images; for example, the deep learning network may be a trained ResNet 18. And determining the screening parameters corresponding to the alternative images by the deep learning network according to the pixel characteristics of the alternative images and the search keywords.

The pixel feature of the candidate image may specifically be a pixel value of each pixel in the candidate image.

Supposing that 50 frames of candidate images are determined from the target video, the 50 frames of candidate images are input into a deep learning network, the deep learning network determines the screening parameters corresponding to each frame of candidate images according to the pixel characteristics of the 50 frames of candidate images and the search keywords corresponding to the video search instruction, and determines the target image according to the screening parameters corresponding to the 50 frames of candidate images.

An optional implementation manner is that the candidate image with the largest screening parameter in the at least two frame candidate images is taken as the target image.

In the embodiment of the disclosure, the screening parameters are used for representing the matching degree of the alternative images and the search keywords, and the matching degree between the alternative image with the highest screening parameter and the search keywords is the highest. Since at least two frames of alternative images are determined from the same target video, the degree of highlights of the alternative image with the highest screening parameter in the target video and the degree of matching with the search keyword are the highest. The alternative image with the highest screening parameter is used as the target image and the target image is used as the cover of the target video, so that the wonderful degree of the cover of the target video and the matching degree with the search keyword can be improved, more users can be attracted to click the video for watching, and the click rate of the video is improved.

For example, if the screening parameter corresponding to the candidate image with the highest screening parameter in the determined 50 frames of candidate images is 0.89, determining the frame of candidate image as the target image, and generating the cover of the target video according to the target image;

or, determining the candidate images with the screening parameters larger than the preset threshold value as target images in the determined 50 frames of candidate images, wherein the target images may be multiple frames, and recommending the multiple frames of target images to the user so that the user selects one frame from the multiple frames of target images as a cover of the target video. For example, the preset threshold is 0.7, that is, 5 candidate images with the screening parameter larger than 0.7 in the determined 50 candidate images are recommended to the user as the target image, and the user selects one frame from the 5 candidate images as the cover of the target video. When the multi-frame target image is recommended to the user, the multi-frame target image can be displayed to the user according to the sequence of the screening parameters from large to small.

FIG. 3 is a complete flow diagram illustrating a method for generating a video cover image according to an exemplary embodiment, as shown in FIG. 3, including the steps of:

step S31, responding to the video searching instruction, and determining a target video corresponding to the video searching instruction;

step S32, determining at least two frame alternative images from the target video;

step S33, inputting the alternative images into the trained deep learning network, and obtaining screening parameters corresponding to the alternative images output by the trained deep learning network;

step S34, selecting a target image from at least two frames of alternative images according to the screening parameters corresponding to the alternative images;

in step S35, a cover of the target video is generated from the target image.

The following describes a training method of a deep learning network in the embodiment of the present disclosure.

In an alternative implementation, the embodiment of the present disclosure trains the deep learning network according to the following ways:

at least two training images determined from the video samples and search keywords corresponding to the video samples are input into a deep learning network aiming at the video samples in a video sample training set, and first sample screening parameters corresponding to the training images output by the deep learning network are obtained; inputting at least two frames of training images into a trained image two classifier, and acquiring second sample screening parameters corresponding to the training images output by the trained image two classifier; aiming at any two frames of training images, determining loss values corresponding to the two frames of training images according to a first sample screening parameter and a second sample screening parameter respectively corresponding to the two frames of training images; and adjusting parameters of the deep learning network according to the determined loss value until the determined loss value is not greater than a preset threshold value to obtain the trained deep learning network.

According to the training method for training the deep learning network provided by the embodiment of the disclosure, the loss value corresponding to the training image is determined through the first sample screening parameter corresponding to the training image output by the deep learning network and the second sample screening parameter corresponding to the training image output by the image two classifier, and the deep learning network is trained according to the loss value. In the training process, the second sample screening parameters corresponding to the training images output by the image second classifier are trained on the basis of the second sample screening parameters corresponding to the training images, the training images do not need to be manually labeled, the training speed can be improved, the image second classifier is a classifier trained by a large number of samples, and the trained image second classifier can accurately output the second sample screening parameters corresponding to the training images; therefore, the deep learning network trained by the method provided by the embodiment of the disclosure can accurately determine the screening parameters corresponding to the alternative images acquired from the same target video, and the execution efficiency is high.

Specifically, a training image used in training the deep learning network is determined from the same video sample, a plurality of video samples can be collected, the deep learning network is subjected to multi-group training, in the training process, the training image is input into the deep learning network to obtain a first sample screening parameter corresponding to the training image, the training image is input into a trained image second classifier to obtain a second sample screening parameter corresponding to the training image, loss values corresponding to two frames of training images are determined according to the first sample screening parameter and the second sample screening parameter respectively corresponding to any two frames of training images, a plurality of loss values can be determined according to the same video sample due to the fact that the training image is at least two frames, and the parameters of the deep learning network are adjusted according to the determined loss values until the training is completed.

In the implementation, the training process of the deep learning network is divided into two parts, firstly, an image two-classifier is trained, then, the deep learning network is trained, and before the deep learning network is trained, a video sample training set used for training the deep learning network needs to be collected.

In the embodiment of the disclosure, the video sample training set comprises a standard video sample and a non-standard video sample;

obtaining a video sample training set according to the following modes:

determining videos corresponding to a plurality of preset search keywords;

wherein N, M is a positive integer.

According to the embodiment of the disclosure, whether the video is a standard video sample or a non-standard video sample is determined according to the click rate of the video, and the click rate of the video is determined by the search display times of the video and the click watching times of the user, so that the click rate can measure the video cover wonderful degree of the video, and has a reference value. And the click rate of the video is easy to obtain, and the standard video sample and the non-standard video sample can be quickly and accurately selected according to the click rate of the video.

According to data statistics of short video click rate, cover images of videos with high click rate are more wonderful, video samples with high click rate can be used as standard video samples, and covers of the standard video samples form a standard image set; and taking the video sample with low click rate as a non-standard video sample, wherein the cover of the non-standard video sample forms a non-standard image set.

Specifically, when a standard video sample and a non-standard video sample are determined, whether the video is the standard video sample or the non-standard video sample can be determined according to the click rate of the video, a plurality of preset search keywords are determined, videos searched by each preset search keyword are ranked according to the click rate, N video samples ranked in the front are used as the standard video samples, and M videos ranked in the back are used as the non-standard video samples.

For example, selecting search keywords which are 1000 times before the number of searches in a month, taking 1000 videos before the click rate corresponding to each search word as standard video samples, taking 1000 videos after the click rate corresponding to each search word as non-standard video samples, determining 10 ten thousand standard video samples, wherein a standard image set consisting of cover images of the standard video samples comprises 10 ten thousand standard images; determining 10 ten thousand non-standard video samples, wherein a non-standard image set consisting of cover images of the non-standard video samples comprises 10 ten thousand non-standard images; 10 ten thousand standard video samples and 10 ten thousand non-standard video samples constitute a video sample training set.

After determining the video sample training set and the standard image set and the non-standard image set, sequentially introducing two parts of a training image two classifier and a training deep learning network as follows:

training image two classifier

An optional implementation is that the image two classifier is trained according to the following way:

selecting part or all of the standard images from the standard image set, and selecting part or all of the non-standard images from the non-standard image set; and taking the selected standard image, the selected non-standard image, the search keyword corresponding to the video sample to which each image belongs and the classification label corresponding to each image as the input of a second image classifier, taking a first probability value that the classification label of each image is the standard label and a second probability value that the classification label of each image is the non-standard label as the output of the second image classifier, and training the second image classifier.

In the embodiment of the disclosure, the image two classifier is trained through the standard image and the non-standard image, and the obtained image two classifier can judge the category of the image and the matching degree with the keyword. In the embodiment of the disclosure, the trained image two classifier is used to determine the second sample screening parameter corresponding to the candidate image, which is used to determine the loss value in the deep learning network training process, in fact, the second sample screening parameter corresponding to the candidate image output by the trained image two classifier can be regarded as the real screening parameter corresponding to the candidate image, in the process of training the deep learning network, the loss value is determined by the real screening parameter corresponding to the candidate image and the predicted screening parameter corresponding to the candidate image output by the deep learning network, and the deep learning network is trained, so that the predicted screening parameter corresponding to the image determined by the deep learning network is closer to the real screening parameter corresponding to the image, and the deep learning effect is achieved.

Specifically, the image two classifier can determine which category the input image belongs to and the matching degree with the keyword, and the image two classifier trained in the embodiment of the present disclosure can determine whether the image belongs to the "standard" label or the "non-standard" label and the matching degree of the image and the search keyword corresponding to the video sample to which the image belongs. It should be noted that the "standard" in the embodiment of the present disclosure means that the image has high image fineness, and after the image with the classification label as the standard label is used as a video cover, the video may often obtain a higher click rate.

It should be noted that the standard image may also be referred to as a positive sample image, and the classification label corresponding to the positive sample image is a positive label; the non-standard image may also be referred to as a negative exemplar image, and the classification label corresponding to the negative exemplar image is a negative label.

In the training process, selecting part or all of standard images from the standard image set, selecting part or all of non-standard images from the non-standard set, and inputting the selected standard images and the non-standard images, search keywords corresponding to the video samples to which each image belongs and classification labels corresponding to each image into an image two classifier; and for each image, the image two classifier outputs a first probability value that the classification label of the image is a standard label and a second probability value that the classification label of the image is a non-standard label, then compares the two probability values with the input classification label labeled by the image, adjusts the parameters of the image two classifier, and continuously iterates until the training of the image two classifier is finished.

In the implementation, a training process of the image two classifier in the embodiment of the present disclosure is described by taking 5 standard images and 5 non-standard images as an example to select some or all of the standard images from the standard image set and some or all of the non-standard images from the non-standard set.

For example, a drawing a, a drawing b, a drawing c, a drawing d and a drawing e are standard images, a classification label of each image is labeled as a standard label, a search keyword corresponding to a video sample to which the drawing a and the drawing b belong is "fun", and a search keyword corresponding to a video sample to which the drawing c, the drawing d and the drawing e belong is "pet"; the graph f, the graph g, the graph h, the graph i and the graph j are nonstandard images, the classification label of each image is marked as a nonstandard label, the search keyword corresponding to the video sample to which the graph f and the graph g belong is 'laugh', and the search keyword corresponding to the video sample to which the graph h, the graph i and the graph j belong is 'food'; and inputting the image labeled with the label into an image two classifier, wherein the image two classifier can output two probability values corresponding to each image and represent the prediction classification result of the image two classifier on the input image.

Assume that the output of the image two classifier is shown in table 1:

drawing number	First probability value	Second probability value
			FIG. a	0.4	0.8
FIG. b	0.8	0.2
			FIG. c is a drawing	0.6	0.2
FIG. d	0.3	0.1
			Drawing e	0.5	0.3
FIG. f	0.1	0.6
			Drawing g	0.8	0.5
Graph h	0.4	0.9
			Drawing i	0.8	0.7
Graph j	0.2	0.3

TABLE 1

Taking fig. a as an example, if the first probability value that the classification label of the image two classifier output image a is the standard label is 0.4, and the second probability value that the classification label of the output image a is the non-standard label is 0.8, it indicates that the image two classifier considers that the probability that the image a is the standard image is 0.4 and the probability that the image a is the non-standard image is 0.8, but the probability that the image a is the standard image should be 1 and the probability that the image a is the non-standard image is 0, it indicates that the image two classifier is classified incorrectly. The determination of the first probability value and the second probability value by the image two classifier is also determined according to the matching degree of the graph a and the search keyword of "fun".

In the data shown in table 1, the image two classifier can be considered to be correctly classified for the image b, the image c, the image d, the image e, the image f, the image h, the image i and the image j, and the image a and the image g can be considered to be incorrectly classified for the image two classifier, and the image two classifier continues to be trained through the next set of standard images and non-standard images.

It should be noted that the above description is only an example, and the condition for determining whether the classification result of the image two classifier is correct or not is not unique, for example, the result of the standard image recognition may be set, and the image two classifier is considered to be correct if the first probability value is greater than the preset threshold and the second probability value is smaller than the preset threshold. It is assumed that, for the standard image, when the first probability value is greater than 0.7 and the second probability value is less than 0.3, the recognition result of the image two classifier is considered to be correct. The standard image map d has a first probability value of 0.3 and a second probability value of 0.1 for the above conditions, and the image classifier is considered to identify an error.

In addition, the training of the image two classifier is not indicated only when the first probability value and the second probability value corresponding to each image output by the image two classifier are only 0 and 1, and actually, the output of the trained image two classifier may have a certain error. For example, a standard image is input into the trained image two classifier, and the output first probability value is 0.9, and the output second probability value is 0.1, but the classification is not more wrong than that of the image two classifier.

In the embodiment of the present disclosure, the training process of the image two classifier is only an example, and does not limit the scope of the present disclosure. Optionally, the embodiment of the present disclosure may choose to train an image two classifier with softmax as an objective function and an inception-3 as a type.

Second, training deep learning network

According to the video cover image generation method provided by the embodiment of the disclosure, the alternative image with high wonderness and high matching degree with the search keyword in at least two frames of alternative images determined in the same video is used as the cover image, so that the image determined in the same video is also used as the training image when the deep learning network is trained, and the training is performed based on the sequencing logic in the process of training the deep learning network, so that the deep learning network can determine the video frame with high wonderness and high matching degree with the search keyword from a section of video, and the video frame with high wonderness and high matching degree with the search keyword is used as the cover of the video, thereby improving the click rate of the video.

In implementation, at least two training images determined from the video samples and search keywords corresponding to the video samples are input into a deep learning network aiming at the video samples in a video sample training set, and first sample screening parameters corresponding to the training images output by the deep learning network are obtained. The first sample screening parameter is a numerical value which is determined by the deep learning network and is used for expressing the matching degree with the search keyword.

E.g. for any video sample V_nSuppose a video sample V_nThe corresponding search keyword is 'make' and is to be selected from the video sample V_nOf the determined training image V_n ⁽ⁱ⁾And searching the keyword 'make it smile' to input the deep learning network, assuming the training image V output by the deep learning network_n ⁽ⁱ⁾The corresponding first sample screening parameter is 0.6, which indicates that the deep learning network considers the training image V_n ⁽ⁱ⁾The degree of integrated wonderful and matching degree with the search keyword "make up" is 60%.

And inputting the determined at least two training images and the search keywords corresponding to the video samples into a trained image two-classifier, and acquiring second sample screening parameters corresponding to the training images output by the trained image two-classifier. The second sample screening parameter is a numerical value determined by the image second classifier and used for representing the matching degree with the search keyword.

In an optional implementation manner, the classification label of the training image output by the image two classifier is the first probability value of the standard label, and is used as the second sample screening parameter corresponding to the training image.

According to the embodiment of the disclosure, the first probability value of the training image output by the image two classifier is used as the second sample screening parameter corresponding to the training image, the trained image two classifier determines that the second sample screening parameter corresponding to the training image can be regarded as the real screening parameter corresponding to the image, the real screening parameter of the training image does not need to be artificially determined, the efficiency is high, the accuracy is high, and the training process of the deep learning network is accelerated.

Specifically, the image two classifier outputs a first probability value and a second probability value corresponding to the training image, and the first probability value indicates that the classification label of the training image is the first probability value of the standard label and is determined according to the matching degree of the training image and the search keyword, so that the second sample screening parameter indicating the matching degree of the training image and the search keyword can be selected to output the first probability value corresponding to the training image.

For example, a video sample V_nOf the determined training image V_n ⁽ⁱ⁾And video sample V_nInputting the 'fun' of the corresponding search keyword into the trained image two classifier, and assuming the V output by the image two classifier_n ⁽ⁱ⁾And if the corresponding first probability value is 0.8 and the second probability value is 0.3, taking the first probability value 0.8 as the second sample screening parameter corresponding to the training image.

In essence, the embodiment of the present disclosure uses the second sample screening parameter corresponding to the training image determined by the image two classifier as the real screening parameter corresponding to the training image, and uses the output of the image two classifier as the ideal output of the deep learning network.

After the first sample screening parameter and the second sample screening parameter corresponding to the training images are determined, the loss values corresponding to any two frames of training images need to be determined according to the first sample screening parameter and the second sample screening parameter corresponding to any two frames of training images, and the deep learning network parameters need to be adjusted according to the determined loss values.

According to the second sample screening parameters respectively corresponding to the two frames of training images, determining pseudo-binary labels for representing the sizes of the second sample screening parameters respectively corresponding to the two frames of training images, and determining the difference value of the first sample screening parameters respectively corresponding to the two frames of training images; and determining the loss values corresponding to the two training images according to the pseudo-binary label and the difference value.

In the embodiment of the disclosure, the pseudo-binary labels are determined according to the second sample screening parameters respectively corresponding to the training images output by the image two classifier, labels do not need to be manually labeled on the training images, and the execution efficiency is high. And the pseudo-binary label is used for representing the size sorting of the second sample screening parameters respectively corresponding to the two frames of training images, so that the pseudo-binary label also comprises sorting logic, and the loss value determined according to the difference value of the first sample screening parameters respectively corresponding to the pseudo-binary label and the two frames of training images is also the loss value determined based on the sorting logic.

Specifically, in the embodiment of the present disclosure, the pseudo-binarization labels corresponding to any two frames of training images represent the sizes of second sample screening parameters corresponding to the training images, the sizes of first sample screening parameters corresponding to the two frames of training images output by the deep learning network should be in order with the sizes of the second sample screening parameters in an ideal state, a loss value is determined according to a difference between the pseudo-binarization labels and the first sample screening parameters, parameters of the deep learning network are adjusted according to the loss value, and the deep learning network after training has an ordering logic.

In implementation, the formula for determining the pseudo-binary label is shown in formula 1:

wherein Z is_n ^(i，j)Representing V for any video sample_nTraining image V of the ith frame_n ⁽ⁱ⁾With the jth frame training image V_n ^(j)The pseudo-binary label of (2); f (V)_n ⁽ⁱ⁾) Representing a training image V_n ⁽ⁱ⁾Corresponding second sample screening parameters; f (V)_n ^(j)) Representing a training image V_n ^(j)Corresponding second sample screening parameters.

When f (V)_n ⁽ⁱ⁾) Greater than f (V)_n ^(j)) Time, pseudo binary label Z_n ^(i，j)Has a value of 1;

when f (V)_n ⁽ⁱ⁾) F (V) or less_n ^(j)) Time, pseudo binary label Z_n ^(i，j)Has a value of 1.

For example, for any two frames of training image V_n ⁽¹⁾And V_n ⁽²⁾Training image V_n ⁽¹⁾Corresponding second sample screening parameter f (V)_n ⁽¹⁾) 0.8, training image V_n ⁽²⁾Corresponding second sample screening parameter f (V)_n ⁽²⁾) 0.6, then for any two frames training image V_n ⁽¹⁾And V_n ⁽²⁾Its corresponding pseudo-binary label Z_n ^(1，2)Is 1.

Suppose that any two frames of training images V_n ⁽¹⁾And V_n ⁽²⁾And V_n ⁽¹⁾And V_n ⁽²⁾The first sample screening parameters obtained by inputting the search keywords corresponding to the video sample into the deep learning network are respectively as follows: s (V)_n ⁽¹⁾)＝0.7；S(V_n ⁽²⁾) 0.9. The sizes of the first sample screening parameters respectively corresponding to the two frames of training images output by the deep learning network are inconsistent with the size ordering of the second sample screening parameters, and the deep learning network needs to be adjusted.

The embodiment of the present disclosure provides a loss function for determining a loss value, where the loss function is shown in formula 2:

where rank (i, j) represents V for any video sample_nTraining image V of the ith frame_n ⁽ⁱ⁾With the jth frame training image V_n ^(j)The corresponding loss value η is a fixed value, which can be an empirical value set by a person skilled in the art, and is generally 0. it should be noted that the value η may not be too large, which may cause the influence on the positive and negative values of the difference value of the first sample screening parameters corresponding to the two training images, respectively, and further on the loss value.

In the embodiment of the disclosure, when the loss value is 0, it indicates that the size ordering of the first sample screening parameter corresponding to the training image determined by the deep learning network is consistent with the size ordering of the second sample screening parameter, and the parameter of the deep learning network is not required to be adjusted.

For example, the training image V_n ⁽¹⁾Corresponding second sample screening parameter f (V)_n ⁽¹⁾) 0.8, training image V_n ⁽²⁾Corresponding second sample screening parameter f (V)_n ⁽²⁾) 0.6, then for any two frames training image V_n ⁽¹⁾And V_n ^(2)，Its corresponding pseudo-binary label Z_n ^(1，2)Is 1. Suppose, training image V_n ⁽¹⁾Corresponding first sample screening parameter S (V)_n ⁽¹⁾) 0.7; training image V_n ⁽²⁾Corresponding first sample screening parameter S (V)_n ⁽²⁾) Assuming that η is equal to 0, rank (1, 2) is equal to 0.2, which indicates that the size sequence of the first sample screening parameter corresponding to the training image determined by the deep learning network is inconsistent with the size sequence of the second sample screening parameter, and the parameter of the deep learning network needs to be adjusted;

assume again that the training image V_n ⁽¹⁾Corresponding first sample screening parameter S (V)_n ⁽¹⁾) 0.9; training image V_n ⁽²⁾Corresponding first sample screening parameter S (V)_n ⁽²⁾) And assuming that η is 0, rank (1, 2) is 0, which means that the size sequence of the first sample screening parameter corresponding to the training image determined by the deep learning network is consistent with the size sequence of the second sample screening parameter, and the parameters of the deep learning network are not adjusted.

In the training process, if n training images are extracted from the same video sample, the number of loss values that can be determined is C_n ²According to the m sample videos, m C can be determined_n ²And (4) loss value. And adjusting parameters of the deep learning network according to the determined loss value until the loss value is not greater than a preset threshold value to obtain the trained deep learning network.

It should be noted that, in the embodiment of the present disclosure, the preset threshold may be 0, and in an error allowable range, if most of the determined loss values are not greater than the preset threshold, the deep learning network training may be considered to be completed. For example, if the loss value of 99% of the loss values determined from the training images is not greater than the preset threshold, the deep learning network training is considered to be completed.

Obviously, after the deep learning network is trained according to the deep learning network training method provided by the embodiment of the disclosure, the obtained trained deep learning network can identify the input alternative images to obtain the screening parameters corresponding to the alternative images, wherein the deep learning network is trained based on the ranking logic in the training process, so that the screening parameters corresponding to the alternative images obtained in the same target video are also determined based on the ranking logic. Therefore, for the same target video, the candidate image with the largest screening parameter obtained by using the trained deep learning network is the video frame with the highest wonderful degree in all video frames in the target video and the highest matching degree of the search keywords corresponding to the video.

FIG. 4 is a flowchart illustrating an overall deep learning network training method according to an exemplary embodiment, as shown in FIG. 4, including the following steps:

in step S41, acquiring a video sample training set, forming a standard image set according to covers of standard video samples in the video sample training set, and forming a non-standard image set according to covers of non-standard video samples in the video sample training set;

in step S42, selecting some or all of the standard images from the standard image set and some or all of the non-standard images from the non-standard image set, and training an image two classifier;

in step S43, inputting at least two training images determined from the same video sample and search keywords corresponding to the video sample into a deep learning network, and obtaining first sample screening parameters corresponding to the training images output by the deep learning network;

in step S44, inputting the at least two training images and the search keywords corresponding to the video samples into the trained image two-classifier, and obtaining second sample screening parameters corresponding to the training images output by the trained image two-classifier;

in step S45, for any two frames of training images, determining loss values corresponding to the two frames of training images according to the first sample screening parameter and the second sample screening parameter respectively corresponding to the two frames of training images;

in step S46, parameters of the deep learning network are adjusted according to the determined loss value until the determined loss value is not greater than a preset threshold value, so as to obtain a trained deep learning network.

It should be noted that in the video cover image generation method provided in the embodiment of the present disclosure, the screening parameters corresponding to the candidate images are determined according to the pixel features of the candidate images and the search keywords corresponding to the video search instruction, the candidate images with larger screening parameters are determined to have high wonderness and high matching degree with the search keywords, and the candidate images with larger screening parameters are more suitable for being used as the cover of the target video.

In addition, the video cover image generation method provided in the embodiment of the present disclosure may also be used to recommend a cover with higher wonderness for the video, for example, when the user opens the short video application, a plurality of short videos may be displayed on the top page of the short video application, in such a scenario, the user does not trigger a video search instruction, determines that the interested video is clicked to watch by browsing the cover of the recommended video, and if the wonderness of the covers of the plurality of short videos displayed on the top page is lower, the click rate of the short videos may be lower.

Aiming at any target video, determining at least two frame alternative images from the target video; wherein the target video may not be a video searched by the user through the search instruction. Inputting the alternative images into the trained deep learning network, and acquiring screening parameters corresponding to the alternative images output by the trained deep learning network; the screening parameters are determined by the depth learning according to the pixel characteristics of the candidate images. And selecting a target image from the at least two frames of alternative images according to the screening parameters corresponding to the alternative images, and taking the target image as a cover of the target video.

In the method, the trained deep learning network is trained in the following way:

firstly, acquiring a video sample training set comprising a standard video sample and a non-standard video sample, and determining videos corresponding to a plurality of preset search keywords when the video sample training set is acquired; sequencing videos corresponding to each preset search keyword according to the click rate, taking N videos in the front of the sequence as standard video samples, and taking M videos in the back of the sequence as non-standard video samples; wherein N, M is a positive integer.

And forming a standard image set according to cover images of the standard video samples in the video sample set, and forming a non-standard image set according to cover images of the non-standard video samples in the video sample set.

Next, an image two classifier is trained.

Selecting a part or all of standard images from a standard image set, selecting a part or all of non-standard images from a non-standard image set, taking the selected standard images, the selected non-standard images and classification labels corresponding to the images as the input of a second image classifier, and taking a first probability value that the classification label of each image is the standard label and a second probability value that the classification label of each image is the non-standard label as the output of the second image classifier to train the second image classifier.

And after the image two classifier is trained, training the deep learning network according to the image two classifier.

One optional implementation manner is that, for a video sample in a video sample training set, at least two frames of training images determined from the video sample are input into a deep learning network, and first sample screening parameters corresponding to the training images output by the deep learning network are obtained; and

inputting the at least two frames of training images into a trained image two classifier, and acquiring second sample screening parameters corresponding to the training images output by the trained image two classifier;

the classification label of the training image output by the image second classifier is a first probability value of a standard label and is used as a second sample screening parameter corresponding to the training image;

according to second sample screening parameters respectively corresponding to the two frames of training images, determining a pseudo-binary label for representing the size of the second sample screening parameters respectively corresponding to the two frames of training images, and determining the difference value of the first sample screening parameters respectively corresponding to the two frames of training images; and determining the loss values corresponding to the two training images according to the pseudo-binarization label and the difference value.

In the embodiment of the disclosure, according to data statistics of short video click rates, it is found that cover images of videos with higher click rates are often more wonderful, and a cover of a video with a high click rate can be used as a standard image, and a cover of a video with a first click rate can be used as a non-standard image. In the process of training the deep learning network, due to the fact that pixel characteristics of the standard image are different from pixel characteristics of the non-standard image, the deep neural network obtained through training of the standard image and the non-standard image has the capability of determining the fineness of the alternative image, and the obtained screening parameters represent the fineness of the alternative image, therefore, under the condition that no search keyword exists, the screening parameters for identifying the fineness of the alternative image can be determined through the trained deep learning network, the alternative image crossed by the screening parameters is used as a cover of the target video, the cover fineness is high, and the click rate of the video can be improved.

It should be noted that the deep learning network and the training method of the deep learning network, which are involved in the video cover image generation method in the scene where the user does not trigger the video search instruction and are provided in the embodiment of the present disclosure, may be the same as the deep learning network and the training method of the deep learning network, which are involved in the video cover image generation method in the video search scene and are provided in the embodiment of the present disclosure, and sample data during training of the deep learning network may not include search keywords corresponding to video samples, which is not described in detail in the embodiment of the present disclosure.

The embodiment of the disclosure also provides a video cover image generation device, and as the device corresponds to the device corresponding to the video cover image generation method in the embodiment of the disclosure, and the principle of the device for solving the problem is similar to that of the method, the implementation of the device can refer to the implementation of the method, and repeated details are not repeated.

FIG. 5 is a block diagram illustrating a video cover image generation apparatus according to one exemplary embodiment. Referring to fig. 5, the apparatus includes a search unit 500, a determination unit 501, and a generation unit 502.

The searching unit 500 is configured to execute the steps of responding to a video searching instruction, and determining a target video corresponding to the video searching instruction;

a determining unit 501, configured to determine at least two frames of candidate images from the target video, and determine, according to pixel features of the candidate images and a search keyword corresponding to the video search instruction, a screening parameter corresponding to the candidate images and used for indicating a degree of matching with the search keyword;

a generating unit 502 configured to screen a target image from the at least two frames of candidate images according to the screening parameters corresponding to the candidate images, and generate a cover of the target video according to the target image.

In a possible implementation manner, the determining unit 502 is specifically configured to:

In a possible implementation, the determining unit 502 is further configured to perform training of the deep learning network according to the following manner:

In a possible implementation, the determining unit 502 is specifically configured to perform:

determining videos corresponding to a plurality of preset search keywords;

wherein N, M is a positive integer.

In one possible implementation, the determining unit 502 is further configured to perform training of the image two classifier according to:

In a possible implementation manner, the determining unit 502 is specifically configured to perform:

In one possible implementation, the generating unit 503 is specifically configured to perform:

With regard to the apparatus in the above-described embodiment, the specific manner in which each unit executes the request has been described in detail in the embodiment related to the method, and will not be elaborated here.

Fig. 6 is a block diagram illustrating an electronic device 600 according to an example embodiment, the electronic device including:

a processor 610;

a memory 620 for storing instructions executable by the processor 610;

wherein the processor 610 is configured to execute the instructions to implement the video cover image generation method in the embodiments of the present disclosure.

In an exemplary embodiment, a non-volatile storage medium including instructions, such as the memory 620 including instructions, that are executable by the processor 610 of the electronic device 600 to perform the above-described method is also provided. Alternatively, the storage medium may be a non-transitory computer readable storage medium, which may be, for example, a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

The embodiment of the present disclosure further provides a computer program product, which when run on an electronic device, causes the electronic device to execute a method for implementing any one of the video cover recommendation methods or any one of the video cover recommendation methods described in the embodiment of the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for generating a video cover image, the method comprising:

2. The method of claim 1, wherein the screening parameters corresponding to the candidate images are determined according to the following:

3. The method of claim 2, wherein the deep learning network is trained according to the following:

4. The method of claim 3, wherein the determining the loss values corresponding to the two frames of training images according to the first sample screening parameter and the second sample screening parameter corresponding to the two frames of training images respectively comprises:

5. The method of claim 3, wherein the training set of video samples includes standard video samples and non-standard video samples;

obtaining the video sample training set according to the following modes:

determining videos corresponding to a plurality of preset search keywords;

wherein N, M is a positive integer.

6. The method of claim 5, wherein the image two classifier is trained according to:

7. The method of claim 3, wherein the obtaining of the second sample screening parameter corresponding to the training image output by the trained image two classifier comprises:

8. A video cover image generation apparatus, comprising:

9. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the video cover image generation method of any one of claims 1 to 7.

10. A storage medium, wherein instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the video cover image generation method of any one of claims 1 to 7.