CN108650524B

CN108650524B - Video cover generation method and device, computer equipment and storage medium

Info

Publication number: CN108650524B
Application number: CN201810504021.5A
Authority: CN
Inventors: 费梦娟; 高永强; 谯睿智; 戴宇荣; 沈小勇
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2018-05-23
Filing date: 2018-05-23
Publication date: 2022-08-16
Anticipated expiration: 2038-05-23
Also published as: CN108650524A

Abstract

The application discloses a video cover generation method, a device, computer equipment and a storage medium, wherein the method comprises the following steps: acquiring a plurality of frame images in a video; aiming at each frame of image, determining a memorability score of the image according to the image characteristics of the image reflecting the impression depth scale, wherein the memorability score is used for reflecting the interest degree of the user in the image; selecting at least one frame of target image for generating a video cover from the multi-frame image based on the memorability degree score of the multi-frame image; and generating a video cover of the video based on at least one frame of the target image. The scheme of this application is favorable to improving the attraction degree of video front cover to the user.

Description

Video cover generation method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of technologies, and in particular, to a method and an apparatus for generating a video cover, a computer device, and a storage medium.

Background

With the continuous development of internet technology, more and more users prefer to publish videos to a network platform (such as a social platform or a video publishing platform) so as to share the videos to other users in the network platform.

Before publishing the video uploaded by the user, the network platform selects a frame of image from the video as a video cover of the video, and then publishes the video with the video cover. The importance of the video cover (also referred to as a cover icon of the video) as a mark for displaying the video content is self-evident. However, at present, the network platform only uses the first frame of image of the video as the video cover, or randomly selects one frame of image from the video as the video cover, so that it is difficult to attract the attention of the user, resulting in a low video click rate.

Disclosure of Invention

In view of this, the present application provides a method, an apparatus, a computer device and a storage medium for generating a video cover, so that the generated video cover can better reflect the content of interest of the user in the video, the attractiveness of the video cover to the user is improved, and the click rate of the video is increased.

To achieve the above object, in one aspect, the present application provides a video cover generation method, including:

acquiring a multi-frame image in a video;

aiming at each frame of image, determining a memorability score of the image according to the image characteristics of the image reflecting the impression depth scale, wherein the memorability score is used for reflecting the interest degree of the user in the image;

selecting at least one frame of target image for generating a video cover from the multi-frame images based on the memorability scores of the multi-frame images;

generating a video cover of the video based on at least one frame of the target image.

In one possible implementation, the determining the memorability score of the image includes:

calculating the memorability score of the image by using an image memorability model obtained by pre-training, wherein the image memorability model is obtained by training a plurality of sample images marked with the memorability score.

In a possible implementation manner, the acquiring multiple frames of images in a video includes:

acquiring a video of a video cover to be generated;

splitting the video into a plurality of continuous video segments, wherein each video segment comprises at least one frame of image;

and selecting at least one frame of image from each video segment as a candidate cover, and obtaining a plurality of frames of images used as candidate covers.

Preferably, the selecting at least one frame of image from each video segment as a candidate cover page includes:

respectively calculating the definition of each frame of image in each video segment;

and selecting at least one frame of image with the definition meeting preset conditions from each video segment as a candidate cover.

In another aspect, the present application further provides a video cover generation apparatus, including:

the video acquisition unit is used for acquiring multi-frame images in the video;

the image scoring unit is used for determining a memorability score of each image according to the image characteristics of the image reflecting the impression depth scale, wherein the memorability score is used for reflecting the interest degree of the user in the image;

the image screening unit is used for selecting at least one frame of target image for generating a video cover from the multi-frame images based on the memorability scores of the multi-frame images;

and the cover generation unit is used for generating a video cover of the video based on at least one frame of the target image.

In yet another aspect, the present application further provides a computer device, including:

a processor and a memory;

wherein the processor is configured to execute a program stored in the memory;

the memory is to store a program to at least:

acquiring a plurality of frame images in a video;

and generating a video cover of the video based on at least one frame of the target image.

In yet another aspect, the present application further provides a storage medium having stored therein computer-executable instructions, which when loaded and executed by a processor, implement the video cover generation method according to any one of the embodiments of the present application.

It can be seen that in the embodiment of the application, after the multi-frame images serving as candidate covers in the video are acquired, the memorability scores of each image are respectively determined, and the memorability scores of the images can be used for reflecting the interest degree of the user on the images, so that the target image used for generating the video cover is selected based on the memorability scores of the multi-frame images, which is beneficial to selecting the image which can reflect the content of interest of the user in the video more, so that the generated video cover has higher attraction degree on the user, and the click rate of the video is further improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on the provided drawings without creative efforts.

FIG. 1 is a block diagram of a video cover generation system according to an embodiment of the present application;

FIG. 2 is a schematic flow chart diagram illustrating one embodiment of a method for generating video covers in accordance with the present disclosure;

FIG. 3 shows an example of selecting a cover of a video from a video in an embodiment of the present application;

FIG. 4 is a diagram illustrating how different saliency images correspond to a degree of memorability;

FIG. 5 is a diagram illustrating the degree of memorability for a plurality of images expressing different emotions;

FIG. 6 is a schematic diagram illustrating an embodiment of the present application for training an image amnesia model using a plurality of sample images;

FIG. 7 is a schematic flow chart illustrating a process of training an image memorability model in an embodiment of the present application;

fig. 8 is a schematic diagram illustrating an application scenario to which a video cover generation method according to an embodiment of the present application is applied;

FIG. 9 is a flow interaction diagram of a video cover generation method according to an embodiment of the present application;

fig. 10 is a schematic diagram illustrating another application scenario to which a video cover generation method according to an embodiment of the present application is applied;

fig. 11 is a schematic diagram illustrating another application scenario to which a video cover generation method according to an embodiment of the present application is applied;

FIG. 12 is a schematic view illustrating an interaction of another flow of a video cover generation method according to an embodiment of the present application;

FIG. 13 is a schematic diagram illustrating the components of an embodiment of a video cover generation apparatus according to an embodiment of the present application;

fig. 14 shows a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The video cover generation method is suitable for selecting the images for generating the video cover from the video, so that the selected images can reflect the content of interest of the user in the video, and the attraction degree of the video cover to the user is improved.

The inventor of the present application found through research that: the greater the user's interest level in the image, the higher the user's forgetting level of the image, and therefore, the user's interest level in the image can be analyzed by the forgetting level of the image. Wherein the memorability of the image represents the degree to which the image is impressive, which characterizes the degree to which the user is interested in the image. Based on the research, the method and the device have the advantages that when the image used for generating the video front cover is selected from the video, the memorability of the image in the video can be combined to improve the memorability of the generated video front cover, so that the interest degree of a user in the video front cover is improved.

The video cover generation method can be suitable for servers in network platforms, such as servers in multimedia network platforms, and can be used for automatically selecting the video cover for the video uploaded by a user through the servers. The video cover generation method is also suitable for terminals, such as mobile phones, tablet computers, notebook computers and the like, so that when a user uploads videos to a network platform through the terminal, images suitable for generating video covers are selected from the videos selected by the user.

For ease of understanding, a scenario in which the scheme of the present application is applicable will be described. For example, referring to fig. 1, a schematic diagram of a component structure of a video cover generation system of the present application is shown.

In the system shown in fig. 1, it comprises: a terminal 10 and a server 20 in a network platform, wherein the terminal 10 and the server 20 are connected through a network 30 in a communication way.

The network platform may be a social platform, a multimedia platform, or the like. The network platform may include one or more servers, and fig. 1 illustrates one server in the network platform as an example, but when the network platform includes a plurality of servers, the operations performed by any of the plurality of servers are the same.

Wherein, the terminal 10 is used for uploading the video to be published to the server 20 in the network platform.

The server 20 in the network platform is used for determining a video cover corresponding to the video uploaded by the terminal and publishing the video with the video cover.

When the terminal does not specify the video cover of the video, the server of the network platform needs to select at least one frame of image for generating the video cover from the video to be published, and generate the video cover of the video by using the selected at least one frame of image.

A video cover generation method is introduced from a server side, for example, referring to fig. 2, which shows a flow chart of an embodiment of the video cover generation method according to the present application, and the method of the embodiment may include:

s201, receiving the video uploaded by the terminal.

For example, the terminal requests the server to upload a video, and after the server agrees to the request of the terminal, the terminal transmits the video to be distributed to the server.

It is understood that the step S201 is not a necessary step for the server to select the video cover for the video, and is only for facilitating understanding of the solution of the present application, and is described in a case of a source of the video to be generated into the video cover. In practical applications, the video that needs to be generated into the video cover in the server may also be uploaded by a manager on the server side or transmitted to the server by other network platforms, which is not limited herein.

S202, acquiring a plurality of frame images in the video.

The video is a video of a video cover to be generated.

In a possible implementation manner, the step S202 may be to determine multiple frames of images in the video, so as to select an image used for generating the video cover from the multiple frames of images. In this case, it can be considered that each frame image in the video is taken as a candidate image that can be used to generate a video cover. Wherein the candidate cover refers to an image in the video that can be used for generating a video cover.

In another possible scenario, in order to reduce the data processing amount and more fully reflect the content covered by the video, the server may extract partial images from the video as candidate cover pages. For example, multiple frames of images may be randomly extracted from the video as candidate covers.

In consideration of the fact that images are randomly extracted from the video to serve as candidate covers, the distribution positions of the candidate covers of the plurality of frames in the video are easily concentrated, and therefore the selected candidate covers cannot comprehensively reflect the content displayed by the video. For example, a video contains 1000 frames of images, but the extracted candidate covers may be only concentrated between the 10 th frame and the 100 th frame, so that the candidate covers can only reflect part of the content in the video, and some wonderful content is easily missed, so that the video covers selected from the candidate covers subsequently may not reflect the wonderful content in the video, which is interesting to the user. Optionally, in order to enable the screened candidate covers to more fully reflect the content shown by the video, the server may first split the video into a plurality of consecutive video segments, and then select at least one image from each video segment as a candidate cover.

For example, the video can be uniformly split into a plurality of video segments, for example, the number of frames of images contained in each video segment is the same, or the difference between the number of frames of any two video segments can be at most one in consideration of the situation that a plurality of frames of images in the video cannot be equally split; then, one frame of image is selected from each video segment as a candidate cover, so as to obtain a plurality of frames of images as candidate covers.

Optionally, in order to ensure the definition of the cover page of the video, after the video is split into a plurality of video segments, the definition of each frame of image in each video segment may be calculated respectively, and then at least one frame of image whose definition meets the preset condition is selected from each video segment to serve as the candidate cover page. The preset condition may be set according to needs, for example, the preset condition may be that the definition exceeds a preset threshold, or that the definition is the highest in the video segment, and the like. The method for calculating the image definition can be various, and the method for judging the image definition is not limited.

For ease of understanding, reference may be made to fig. 3, which shows an example of extracting a video from a video, in which fig. 3 the video is uniformly split into a plurality of video segments, each video segment comprising a plurality of frames of images. Then, for each video segment, based on the definition of each frame of image in the video segment, selecting a frame of image with the highest definition from the video segment as a candidate cover, thereby obtaining a plurality of frames of candidate covers.

It can be understood that, the candidate covers are respectively screened from each video segment by combining the definition of the image, so that the situation that the candidate covers are similar due to the concentrated distribution of the candidate covers and the content in the video cannot be comprehensively reflected can be avoided, the definition of the screened candidate covers can meet the requirement, and the definition of the subsequently selected video covers and the interest degree of a user can be favorably improved.

S203, aiming at each frame of image, determining a memorability score for reflecting the interest degree of the user to the image according to the image characteristics of the impression depth scale reflected in the image.

The memorability score of the image is used for reflecting the memorability degree of the image, and the memorability score can also be a memorability index, can be an integer score, can also be a probability value, can also be a memorability grade, and of course, can also be other forms for representing the memorability degree.

It can be understood that certain features of the image are often easy to attract and impressive, so that the memorability score corresponding to the image can be obtained by determining the image features reflecting the impression depth scale in the image and based on the image features. For example, whether or not there are features of a person or other set target object in the image, a feature of a position of the person or object in the image, a feature of expressing an emotion portion in the image, and the like may be used as features reflecting the imprinted depth scale. For example, an image with an object such as a person in the middle is more impressive than an image with an object such as a person at the edge or without an object such as a person.

In the present application, the inventors found through research that: the memorability of an image is related to the popularity of the image, the saliency of the image, the content emotion expressed by the image, and the like. Therefore, the significance of the image, the emotion expressed by the image, and the popularity of the image can be used as image features reflecting the impression depth scale in the image. Wherein, the higher the significance of the image is, the higher the memorability of the image is; the higher the popularity of an image, the higher the memorability of the image; among images expressing emotion, an image expressing a specific emotion has a higher degree of memorability than an image expressing other emotions.

Wherein the saliency of the image represents the visual attention, which represents the degree to which the image region attracts the attention. There are many algorithms for judging the saliency of the image, and the present application does not limit this. To facilitate understanding of the relationship between saliency and image amnesia, one can refer to fig. 4, which illustrates the image amnesia for a plurality of images with different saliency. In the three images from left to right in fig. 4, the person in the first image is in the center of the image; the second image has a person to the right of the image; and no person in the third image. The significance of the three images from left to right is reduced in sequence, wherein the significance of the first image is the highest, and correspondingly, the memorability of the first image is also the highest through a large number of tests. As can be seen from fig. 4, the degree of memorability of the first image is 0.751, the degree of memorability of the second image is 0.39, and the degree of memorability of the third image is 0.241.

For example, an image expressing a strong emotion such as difficulty in expression, surprise, anger, or the like is more impressive than an image expressing an emotion such as satisfaction, fear, or the like. See fig. 5, which shows the memorability of multiple images expressing different emotions. Among the three images from left to right in fig. 5, the character in the first image expresses an angry expression; the characters in the second image express a difficult expression; and the person in the third image expresses a satisfied expression. Correspondingly, through a large number of tests, the memorability of the first image is the highest and is 0.95; and the memorability of the second image is 0.88; the third image had the lowest degree of memorability, which was 0.79.

As another example, the popularity of an image may reflect how many times the image is liked, recommended, or viewed by a user in a social network. The higher the number of times an image is browsed and recommended by a user, the more popular the image is represented. The popularity of the image can be determined by counting the operation behaviors of the user in the social network, such as recommendation and browsing of the image, and the relationship between the popularity and the forgetfulness of the image is similar to the previous cases and is not repeated.

As is clear from the above analysis, since the degree of memorability of an image can be reflected on the basis of the saliency of the image, the emotional characteristics described in the image, and the like, the score of the degree of memorability of the image can be determined by analyzing the image characteristics of a plurality of dimensions such as the saliency of the image and the emotional characteristics expressed by the image. For example, a plurality of feature dimensions of the image to be analyzed may be set, for example, the plurality of feature dimensions may include: the method comprises the steps of determining the significance of an image, the emotion expressed by the image and the like, wherein different emotions expressed by the image correspond to emotion scores of different emotion dimensions, setting different weights for the different dimensions, and performing weighted summation on the scores of the dimensions to determine the memorability of the image.

Optionally, in order to determine the forgetting degree score of the image more conveniently and quickly, an image forgetting degree model may be obtained by pre-training, and the image forgetting degree model may be obtained by training a plurality of sample images with the forgetting degree score. The image memorability model can convert image features representing impression depth scales in the image into memorability scores and output the memorability scores. Correspondingly, the respective memorability scores of each frame of image can be calculated by utilizing the image memorability model obtained by the pre-training. For example, each frame of image is input into the image memorability model, and the memorability scores of the images output by the image memorability model are respectively obtained.

For example, based on the plurality of sample images marked with the memorability scores, a deep learning network or a convolutional neural network model is continuously trained, and the network model obtained by final training is determined as the image memorability model.

For ease of understanding, for example, the deep learning network model is trained and trained by training a plurality of sample images. For example, referring to fig. 6, which shows an exemplary diagram of training a deep learning network by using a plurality of sample images, it can be seen that the plurality of sample images labeled with the amnesia degree score are input to the deep learning network to be trained, then, the amnesia degree score of each sample image output by the deep learning network is compared with the actually labeled amnesia degree score of each sample image, and the deep learning network is continuously adjusted, so that a deep learning network with the similar amnesia degree score of the output sample image and the actually labeled amnesia degree score of the sample image, that is, an image amnesia degree model can be finally obtained. Referring to fig. 7 in conjunction with the example of fig. 6, a flow diagram illustrating training of a deep learning network through a plurality of sample images is shown, which may include:

s701, obtaining a plurality of sample images, wherein each sample image is marked with a memorability score.

Wherein, the memorability degree score of the sample image can be manually marked in advance. For example, a memorability score for each sample image is obtained through a number of user tests. For another example, the memorability scores are manually set for the sample images according to experience.

It will be appreciated that the memorability scores of different sample images will also vary due to differences in the sample images.

S702, inputting a plurality of sample images into a deep learning network to be trained, and obtaining the memorability score of each sample image output by the deep learning network.

The deep learning network may have a variety of possibilities, e.g., may be a lightweight neural network, e.g., MobileNet, or the like.

And S703, determining the accuracy of the score of the memorability of the prediction image of the deep learning network based on the scores of the memorability marked by the sample images respectively and the scores of the memorability of the images output by the deep learning network.

It can be understood that the deep learning network can estimate the memorability score of each image, and in order to verify whether the memorability score estimated by the deep learning network is accurate, the memorability score of each sample image estimated by the deep learning network needs to be compared with the memorability score actually labeled by the sample image. And the degree of the difference between the estimated memorability score and the actually labeled memorability score can reflect the accuracy of the memorability score of the deep learning network predicted image. For example, the degree of difference between the estimated memorability score and the actual labeled memorability score can be compared by a loss function-a cross entropy function.

Of course, the method for judging the accuracy of the score of the degree of amnesia of the deep learning network prediction image through other methods is also applicable to the embodiment.

S704, judging whether the accuracy of the score of the deep learning network prediction image memorability meets the preset requirement, if so, determining the current deep learning network as an image memorability model, and finishing training; if not, adjusting the parameter value of the parameter in the deep learning network, and returning to the step S702.

For example, if the degree of difference between the estimated memorability score of the sample image and the memorability score actually labeled on the sample image meets the preset deviation degree, the accuracy can be determined to meet the preset requirement.

It should be noted that fig. 6 is only a simple example of training the deep learning network, and in practical applications, in the process of training the deep learning network through the sample images, each time training is performed, the deep learning network needs to be tested through a plurality of sample images for testing, and finally, in combination with the test result, a finally required model is determined from the deep learning networks trained for multiple times.

Of course, fig. 6 and fig. 7 are only one possible case of training to obtain the image forgetting degree model, and the network model that is trained by using a plurality of sample images through other methods to obtain the image forgetting degree that can be evaluated is also applicable to the embodiment, which is not limited herein.

And S204, selecting at least one frame from the multi-frame images to be used for generating a target image of the video cover based on the memorability scores of the multi-frame images.

According to the respective scores of the memorability of the multi-frame images as the candidate cover, the images with relatively high memorability can be selected as the images for generating the video cover.

For example, at least one of the target images in the top ranking may be selected from the plurality of frames in order of their high to low memorability scores.

As shown in fig. 3, after one frame of image is selected from each video segment as a candidate cover page, a candidate cover page with the highest memorability score may be selected from the plurality of frame of candidate cover pages as a video cover page for the memorability score of each frame of candidate cover page.

And S205, generating a video cover of the video based on the selected at least one frame of target image.

It is understood that the kind of video covers can be classified into both still video covers and motion video covers. For ease of understanding, these two types of video covers are described in terms of several scenarios for generating video covers. When the video cover to be generated is a still video cover, a target image with the highest memorability score may be selected from the plurality of images, and the still video cover of the video may be generated using the target image. For example, the selected target image is determined as a still image cover, or a specific process, such as adding a specific title or description, is performed on the selected target image, so that the processed target image is used as a still image cover of the video. Of course, a plurality of target images with top scores may be selected and then combined into one video cover page by using the plurality of target images.

In a possible mode, when the video cover to be generated is a dynamic video cover, an image with the highest memorability score may be selected from images of multiple frames as candidate covers, and for convenience of distinction, the image with the highest memorability score is referred to as a reference image; then, in the video segment to which the reference image belongs, a continuous multi-frame image including the reference image is selected as an object image for creating a moving-video cover, and accordingly, an animation can be created using the continuous multi-frame object, and the animation can be used as the moving-video cover. For example, after selecting the image with the highest memorability from the candidate cover, the image of the nearest 10 frames before the image and the image of the nearest 11 frames after the image are selected from the video segment to which the image belongs, so as to obtain 11 frames of images, and the motion generated by the 11 frames of images is used as the dynamic video cover.

For the case of generating a dynamic video cover, in yet another possible way, multiple frames of target images used for generating the video cover may be selected from multiple frames of images serving as candidate covers, for example, multiple frames of target images with scores exceeding a preset threshold value or multiple frames of target images with scores higher in the order of the scores; then, an animation as a cover of the moving video is generated using the plurality of frames of target images.

It is understood that after the target image for generating the video cover is selected, the video cover can be generated in various ways for different types of video covers, which is not limited in the present application.

It should be noted that, after the server selects at least one frame of target image for generating the video cover, the generation of the video cover may be completed by the server, or the server may be completed by another server or device, which is not limited in this application. The process of generating the video cover based on the selected target image, namely step S205, is only for facilitating understanding of the entire process of generating the video cover, and is not a step that must be executed to select the video cover.

Therefore, in the embodiment of the application, after the server acquires the multi-frame images serving as the candidate covers in the video of the video cover to be generated, the memorability scores of the images are respectively determined, and because the memorability scores of the images can be used for reflecting the interest degree of the user in the images, the target image for generating the video cover is selected based on the memorability scores of the multi-frame images, so that the image which can better reflect the interest content of the user in the video is selected from the video as the video cover, the attraction degree of the generated video cover to the user is higher, and the click rate of the generated video cover is further improved.

Meanwhile, because the interest degree of the user to the image and the wonderful degree of the image also have a positive correlation relationship, the scheme of the application is beneficial to selecting the more wonderful image in the video as the video cover of the video when the image which the user is interested in is selected from the video as the video cover, thereby being beneficial to improving the attraction degree of the video cover.

It can be understood that the video cover generation method of the present application can be applied to various application scenarios for realizing video distribution. For ease of understanding, the following describes a process of selecting and generating a video cover on the server side, taking an application scenario as an example.

For example, refer to fig. 8, which shows an exemplary diagram of an application scenario to which the video cover generation method of the present application is applied. As can be seen from fig. 8, in this application scenario, the network platform is a video distribution platform as an example. The terminal 10 may upload the video a to be distributed to the server in the video distribution platform. The terminal does not designate the image in the video a as the cover of the video a.

Correspondingly, after receiving the video a, the server 20 of the video distribution platform selects at least one frame of target image for generating the video cover of the video a based on the respective scores of the memorability of the images that can be used as candidate covers in the video a, and generates the video cover a of the video a by using the selected target image; then, the server stores the video a in the shared storage area together with the video cover a of the video a.

The video cover a of the video a can be a static video cover or a dynamic video cover.

The shared storage area is a storage area accessible by different terminals to store videos published by different users, and may be considered as a part of the storage area of the server 20 or may be a storage area in a storage device other than the server 20.

The user of the terminal accesses the shared storage area, and can see videos issued by all users (for example, videos issued by the user and issued by other users can be included) who have access right ranges.

With reference to the application scenario of fig. 8, taking an example that a user publishes a video in a personal shared storage space allocated to the user through a terminal server as an example, refer to fig. 9, which shows a flowchart interaction diagram of another embodiment of a video cover generation method according to the present application. The method of the embodiment may include:

s901, the terminal sends a user login request to a server of the video publishing platform.

The user login request may carry a user identification and an authentication code of the user. For example, the user identification may be a user name of the user, and the authentication code may be a login password.

And S902, the server responds to the user login request and completes the user login after the user identity is verified.

If the server verifies that the user name of the user is consistent with the login password, the user is allowed to log in to establish the connection between the server and the terminal, so that the user can log in the server through the terminal.

For example, taking a terminal as an instant messaging client as an example, a user may log in an instant messaging server through the terminal to access a personal shared storage space, such as a commonly-known circle of friends or a personal space, allocated by the instant messaging server for the user.

The steps S901 and S902 do not belong to the steps necessary for the terminal to distribute the video to the server, and are only described by taking a scene as an example for facilitating the understanding scheme.

S903, the terminal sends a video publishing request to the server, wherein the video publishing request carries the video to be published and the user identifier of the user.

For example, the video publishing request is used to request that a video be published into the user's personal shared memory space so that other users accessing the user's personal shared memory space can view the video. For example, a user publishes a small video to a personal shared storage space, so that the user can share the small video to others for watching, and the like. Where small video generally refers to video having a duration less than a particular duration (e.g., less than three minutes).

Of course, the step S903 is only described by taking one video distribution scenario as an example, and the same applies to other video distribution scenarios.

S904, the server splits the video into a plurality of consecutive video segments.

Wherein, each video segment comprises at least one frame of image.

Optionally, the video is split according to the lengths of different videos. The lengths of the split video segments can be the same or different. For example, according to the length of the video, the video is split into a plurality of video segments with the same or similar length. For example, a video may be split into 10 video segments, each having the same number of images.

S905, the server calculates the definition of each frame of image in each video segment.

S906, the server selects a frame of image with the highest definition from each video segment as a candidate cover, and a plurality of frames of candidate covers are obtained.

In this embodiment, an example of selecting a frame of image with the highest definition as the candidate cover is described, but selecting one or more frames with a definition exceeding a preset threshold or selecting the candidate cover in other manners based on the definition of the image is also applicable to this embodiment.

And S907, the server calculates the memorability score of each frame of the candidate cover by using the image memorability model obtained by pre-training.

The image memorability model is obtained by training a network model by utilizing a plurality of sample images with memorability scores.

S908, the server selects a candidate cover with the highest memorability score from the candidate covers as a still video cover of the video.

In the embodiment of the application, a static video cover is generated, and a candidate cover with the highest memorability score is directly selected as the static video cover, so that only one frame of the candidate cover with the highest memorability score is selected, and the selected candidate cover can be directly determined as the static video cover without performing subsequent processing. However, it is understood that, in practical applications, the process of selecting multiple frames of candidate covers and generating a still video cover or a moving video cover by processing the multiple frames of candidate covers is also applicable to the embodiment, and is not limited herein.

It is understood that specific implementation of steps S904 to S908 can refer to related description of the foregoing embodiments, and will not be described herein.

And S909, the server stores the video into the personal shared storage space corresponding to the user in the shared storage area according to the user identification of the user, and sets a static video cover for displaying the selected video.

The video is stored in the personal shared storage space, and the cover of the video is set to be the cover of the selected static video, so that the video is published, and correspondingly, the user and other users who have access to the personal shared storage space of the user can access the personal shared storage space of the user, so that the user can watch the cover of the static video of the video published by the user.

S910, the server returns a successful distribution prompt for indicating the successful distribution of the video to the terminal.

It should be noted that, the steps S909 and S910 are optional steps, and are only one possible processing manner after the server selects the image for generating the video cover. In practical applications, after selecting the target image for generating the video cover, the server may further indicate the target image for generating the video cover to the user, so that the user may designate one or more target images from the at least one frame of target image to generate a still video cover or a moving video cover.

For example, referring to fig. 10, which shows an exemplary view of still another application scenario to which the video cover generation method of the present application is applied, in the example of fig. 10, after at least one frame of object image for generating a video cover is selected from a video by a server, the at least one frame of object image is recommended to a user so that the video cover is finally selected by the user.

As can be seen from fig. 10, in step S10, the terminal sends a video to be published to the server;

in step S11, the server selects at least one frame of target image with the top ranking, such as multiple frames of target images, from the video. The process of the server selecting the target image may refer to the description related to the embodiment of fig. 2 or the description related to steps S904 to S908 in fig. 9, except that the server may select one or more frames for generating the target image of the video cover.

In this step S12, the server recommends the selected at least one target image that can be used to generate the video cover to the terminal to instruct the user of the terminal to select at least one from the recommended at least one target image as the video cover.

In this step S13, the terminal notifies the server of the video cover selected by the user. If the server recommends three target images to the user and one target image selected by the user is used as a video cover, the terminal sends the identifier of the target image selected by the user as the video cover to the server.

In this step S14, the server takes the video cover selected by the user as the video cover for the video, and issues the video cover to the shared storage area together with the video.

In combination with the scheme of the above embodiment of the present application, the inventor of the present application tests a plurality of small videos published in an opposite platform, and the tested small videos include videos of various life scenes such as self-timer of a user, party, food, indoor and outdoor, sports, and the like. In these small videos shot by users, people are often the subject of shooting, but other various contents are mixed in the people. By adopting the scheme of the application, the person is taken as a central object, and the image is selected from the video to be used as the cover page based on the memorability score. The video cover that the scheme that will adopt this application was generated for this a plurality of little videos compares with the video cover that adopts modes such as current random extraction to determine and can obviously discover: the video cover generated by the scheme of the application has higher wonderful degree and definition, and can obtain better effect.

It is understood that the video cover selection method in the above embodiments is described by taking an example in which the server selects at least one frame from the video for generating the target image of the video cover. However, it is understood that before the terminal uploads the video to be published to the server, the terminal may determine at least one frame of target image for generating a video cover from the video, and then generate the video cover based on the selected target image, or transmit information of the at least one frame of target image for generating the video cover and the video to be published to the server, so that the server publishes the video and generates the cover of the video by using the at least one frame of target image.

For example, see fig. 11, which shows an exemplary view of the video cover generation method of the present application in yet another application scenario. As can be seen from fig. 11, in the application scenario, after the terminal 10 acquires the video to be published, it selects a video cover or at least one frame of image used for generating the video cover from the video, and transmits information of the selected video cover or image and the video to the server 20.

Referring to fig. 12 in conjunction with fig. 11, which shows a schematic flow interaction diagram of still another embodiment of the video cover generation method of the present application, the method of this embodiment may include:

s1201, the terminal determines the video to be released.

For example, the terminal receives a video to be published selected by the user.

S1202, the terminal splits the video into a plurality of continuous video segments.

S1203, the terminal calculates the definition of each frame of image in each video segment respectively.

The step is an optional step, and on the premise that the definition of each frame of image in the video segment is considered to meet the requirement, or on the premise that the definition is not considered, the step S1203 is not executed, and one or more frames of images are directly selected from each video segment at random to serve as the candidate cover page.

S1204, the terminal respectively selects at least one frame of image with the definition meeting the preset conditions from each video segment to serve as a candidate cover, and a plurality of frames of candidate covers are obtained.

For example, a frame of the highest definition image is selected from each video segment as a candidate cover.

The specific operation process of the terminal side to execute the steps S1202 to S1204 may refer to the process of the server side to execute the related operation, and may specifically refer to the related description above, which is not described herein again.

It should be noted that steps S1202 to S1204 are only one implementation manner of the terminal acquiring the multiple frames of images in the video as the candidate cover page, and in practical applications, the terminal may also be configured such that each frame of image in the video is used as the candidate cover page. Of course, there may be other ways, and the specific way of the front server side acquiring the multiframe images in the video as the candidate cover is also applicable to the terminal side, and is not described herein again.

And S1205, the terminal respectively calculates the memorability scores of the candidate cover of each frame by using a preset image memorability model.

The step is only one implementation manner of calculating the memorability score of the candidate cover, and the manner of determining the memorability score of each frame as the image of the candidate cover by the front server side is also applicable to determining the memorability score of each frame of the candidate cover by the terminal side.

And S1206, selecting the candidate cover with the highest memorability score from the candidate covers by the terminal as the video cover.

In practical applications, the terminal may select a frame of candidate cover with the score of the degree of difficulty exceeding a preset threshold as the video cover, or may select the video cover from the candidate covers in other manners, which is not limited herein.

This step S1206 is described by taking an example of the terminal selecting a frame of candidate cover as a video cover, in which case the video cover selected by the terminal is actually a still video cover of the video. In practical application, the terminal may also select a plurality of candidate covers from the plurality of candidate covers as the video cover. Alternatively, after the terminal selects the candidate cover page with the highest memorability score, the terminal may extract, from the video segment to which the candidate cover page with the highest memorability score belongs, a plurality of frame images closest to the candidate cover page with the highest memorability score as the video cover page. Of course, other implementation manners for selecting at least one frame for generating the target image of the video cover based on the scores of the memorability of the images of the multiple frames as the candidate covers are also applicable to the embodiment, and are not described herein again.

It is understood that, in the case where the terminal picks up a plurality of frames of target images for generating a video cover, the terminal may also recommend the picked-up plurality of frames of target images to the user, and the user finally picks up the image required as the video cover.

S1207, the terminal transmits the identification of the cover of the video and the video to the server.

For example, the terminal transmits the video to the server and indicates the frame serial number of the image as the video cover in the video, so that the server determines the image selected as the video cover in the video.

S1208, the server publishes the video with the video cover.

In the case that the video cover page of the video is determined, the server may issue the video in various ways, for example, the server may store the video in association with the video cover page in a shared storage area, and the like.

In step S1208, the video cover is directly selected by the terminal for introduction, and in practical applications, the terminal may also select one or more frames of target images for generating the video cover, and then indicate the one or more frames of target images to the server, so as to generate a static or dynamic video cover based on the one or more frames of target images through the server; alternatively, a static or dynamic video cover is generated by using one or more frames of the target image and transmitted to the server.

The embodiment of the application also provides a video cover generation device corresponding to the video cover generation method. For example, referring to fig. 13, which shows a schematic diagram of a composition structure of an embodiment of a video cover generation apparatus according to the present application, the apparatus of the present embodiment may be applied to a computer device, and the computer device may be the aforementioned server or the aforementioned terminal. The apparatus of this embodiment may include:

a video obtaining unit 1301, configured to obtain multiple frames of images in a video;

the image scoring unit 1302 is configured to determine, for each frame of the image, a memorability score of the image according to image features of the image, where the image features reflect impression depth scales, where the memorability score is used to reflect a degree of interest of a user in the image;

the image screening unit 1303 is used for selecting at least one frame of target image used for generating a video cover from the multi-frame images based on the memorability scores of the multi-frame images;

a cover generation unit 1304 for generating a video cover of the video based on at least one frame of the target image.

In one possible implementation, the image scoring unit includes:

and the image scoring unit is used for calculating the memorability score of the image by utilizing an image memorability model obtained by pre-training aiming at each frame of image, wherein the image memorability model is obtained by utilizing a plurality of sample images marked with the memorability score.

Optionally, the apparatus may further include: the model training unit is used for obtaining the image memorability model through training in the following mode:

acquiring a plurality of sample images, wherein each sample image is marked with a memorability score;

inputting a plurality of sample images into a deep learning network to be trained to obtain the memorability score of each sample image predicted by the deep learning network;

determining the accuracy of the deep learning network predicted image memorability score based on the memorability scores respectively labeled by the plurality of sample images and the memorability scores of the plurality of images output by the deep learning network;

and when the accuracy does not meet the preset requirement, adjusting parameter values of parameters in the deep learning network, and returning to execute the operation of inputting the multiple sample images into the deep learning network to be trained until the accuracy meets the preset requirement.

In one possible implementation manner, the video obtaining unit includes:

the video acquisition subunit is used for acquiring a video of a video cover to be generated;

the video splitting subunit is used for splitting the video into a plurality of continuous video segments, and each video segment comprises at least one frame of image;

and the image candidate subunit is used for selecting at least one frame of image from each video segment to be used as a candidate cover, so as to obtain a plurality of frames of images used as the candidate cover.

Further, the image candidate subunit may include:

the definition calculating operator unit is used for calculating the definition of each frame image in each video segment;

and the first candidate subunit is used for selecting at least one frame of image with the definition meeting a preset condition from each video segment to serve as a candidate cover page.

In one possible implementation manner, the image filtering unit includes:

the first screening subunit is used for selecting a target image with the highest memorability score from the multi-frame images;

the cover generation unit includes:

a first generating subunit for generating a still video cover of the video using the target image.

In yet another possible implementation manner, the image filtering unit may include:

the second screening subunit is used for selecting a reference image with the highest memorability score from the multi-frame images;

and the third screening subunit is used for selecting continuous multi-frame images including the reference image from the video segment to which the reference image belongs as a target image for generating a dynamic video cover.

In another aspect, the present application further provides a computer device, which may be the aforementioned server or the aforementioned terminal. For example, referring to fig. 14, a schematic diagram of a component structure of a computer device of the present application is shown.

As can be seen from fig. 14, the computer device 1400 comprises at least: a processor 1401, and a memory 1402.

The processor 1401 may be a Central Processing Unit (CPU), an application specific integrated circuit, a digital signal processor, an off-the-shelf programmable gate array, or other programmable logic device.

Wherein the processor is configured to execute a program stored in the memory;

the memory 1402 is used to store one or more programs, which may include program code, including computer operating instructions.

In the embodiment of the present application, the memory stores at least a program for realizing the following functions:

acquiring a plurality of frame images in a video;

In one possible implementation, the memory 1402 may include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as an image playing function, etc.), and the like; the storage data area may store data created during use of the computer, such as rating data and models.

The memory 1402 may include high-speed random access memory, and may also include non-volatile memory, such as at least one disk storage device or other volatile solid state storage device.

Optionally, the terminal may further comprise a communication interface 1403, an input unit 1404 and a display 1405 and a communication bus 1406.

The processor 1401, the memory 1402, the communication interface 1403, the input unit 1404 and the display 1405 all communicate with each other via the communication bus 1406.

Of course, the structure of the terminal shown in fig. 14 does not constitute a limitation of the terminal in the embodiment of the present application, and the terminal may include more or less components than those shown in fig. 14 or some components may be combined in practical applications.

In another aspect, the present application further provides a storage medium, in which a computer program is stored, and when the computer program is loaded and executed by a processor, the method for generating a video cover as described in any one of the above embodiments is implemented.

It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that it is obvious to those skilled in the art that various modifications and improvements can be made without departing from the principle of the present invention, and these modifications and improvements should also be considered as the protection scope of the present invention.

Claims

1. A method for generating a video cover, comprising:

acquiring a plurality of frame images in a video;

for each frame of image, an image forgetting degree model obtained by pre-training is utilized, scores corresponding to image features of all dimensions are weighted and summed according to different weights corresponding to the image features of different dimensions and image features of different dimensions reflected in the image, and a forgetting degree score of the image is determined, wherein the image forgetting degree model is used for converting the image features representing the impression depth scale in the image into the forgetting degree score and outputting the forgetting degree score, the forgetting degree score is used for reflecting the interest degree of a user on the image, the interest degree of the user on the image and the wonderful degree of the image have a positive correlation, and the image features at least comprise: the method comprises the steps of determining the significance of an image, the emotion expressed by the image and the popularity of the image, wherein the significance of the image represents the degree of attracting the attention of a user in an image area, and the popularity of the image reflects the number of times that the image is liked, recommended and browsed by the user in a social network, wherein the significance of the image is determined according to the existence or nonexistence of the characteristics of a person or other target objects in the image and the position characteristics of the person or other target objects in the image;

based on the memorability degree score of the multi-frame images, at least one frame of target image used for generating a video cover is selected from the multi-frame images, and the method comprises the following steps: selecting a reference image with the highest memorability score from the multi-frame images; selecting continuous multi-frame images which are adjacent to the reference image at the left and the right and contain the reference image from a video segment to which the reference image belongs as target images for generating a dynamic video cover;

generating a video cover of the video based on at least one frame of the target image;

the image memorability model is obtained by training in the following way:

comparing the degree of difference between the memorability scores output by the deep learning network and the actually labeled memorability scores through a loss function-cross entropy function based on the memorability scores respectively labeled by the plurality of sample images and the memorability scores output by the deep learning network;

and when the degree of the phase difference does not meet the preset deviation degree, adjusting parameter values of parameters in the deep learning network, and returning to execute the operation of inputting the plurality of sample images into the deep learning network to be trained until the degree of the phase difference meets the preset deviation degree.

2. The method for generating video covers according to claim 1, wherein the acquiring multiple frames of images in the video comprises:

acquiring a video of a video cover to be generated;

and selecting at least one frame of image from each video segment as a candidate cover, and obtaining a plurality of frames of images as candidate covers.

3. The method for generating video covers according to claim 2, wherein said selecting at least one image from each of said video segments as a candidate cover comprises:

4. The method for generating video covers according to any one of claims 1 to 3, wherein the selecting at least one target image from the plurality of frames of images for generating video covers based on the memorability scores of the plurality of frames of images comprises:

selecting a target image with the highest memorability score from the multi-frame images;

the generating a video cover of the video based on at least one frame of the target image comprises:

generating a static video cover of the video using the target image.

5. A video cover creation device, comprising:

the image scoring unit is used for weighting and summing scores corresponding to image features of all dimensions according to different weights corresponding to image features of different dimensions reflecting impression depth scales and image features of different dimensions in the images by utilizing an image memorability model obtained by pre-training for each frame of the images, and determining the memorability score of the images, the image memorability model is used for converting the image features representing the impression depth scales in the images into memorability scores and outputting the memorability scores, the memorability scores are used for reflecting the interest degree of the users in the images, the interest degree of the users in the images has a positive correlation with the wonderful degree of the images, and the image features at least comprise: the method comprises the steps of determining the significance of an image, the emotion expressed by the image and the popularity of the image, wherein the significance of the image represents the degree of attracting the attention of a user in an image area, and the popularity of the image reflects the number of times that the image is liked, recommended and browsed by the user in a social network, wherein the significance of the image is determined according to the existence or nonexistence of the characteristics of a person or other target objects in the image and the position characteristics of the person or other target objects in the image;

the image screening unit is used for selecting at least one frame of target image for generating a video cover from the multi-frame images based on the memorability scores of the multi-frame images, and comprises: selecting a reference image with the highest memorability score from the multi-frame images; selecting continuous multi-frame images which are adjacent to the reference image at the left and the right and contain the reference image from a video segment to which the reference image belongs as target images for generating a dynamic video cover;

a cover generation unit for generating a video cover of the video based on at least one frame of the target image;

the image memorability model is obtained by training in the following mode:

6. The video cover creation device according to claim 5, wherein the video acquisition unit includes:

7. The video cover creation device of claim 6, wherein the image candidate subunit comprises:

8. A computer device, comprising:

a processor and a memory;

wherein the processor is configured to execute a program stored in the memory;

the memory is to store a program to at least:

acquiring a plurality of frame images in a video;

the image memorability model is obtained by training in the following mode:

9. A storage medium having stored thereon computer-executable instructions that, when loaded and executed by a processor, carry out a method of video cover generation as claimed in any one of claims 1 to 4.