CN115474084A

CN115474084A - Method, device, equipment and storage medium for generating video cover image

Info

Publication number: CN115474084A
Application number: CN202210954874.5A
Authority: CN
Inventors: 宁本德
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2022-08-10
Filing date: 2022-08-10
Publication date: 2022-12-13
Anticipated expiration: 2042-08-10
Also published as: CN115474084B

Abstract

The embodiment of the invention provides a method, a device, equipment and a storage medium for generating a video cover image, wherein the method comprises the following steps: acquiring each video image of a target video, and determining a first semantic image corresponding to each video image; acquiring a guide image and a second semantic image corresponding to the guide image; respectively comparing each first semantic image with each second semantic image to obtain a first comparison result corresponding to each first semantic image; respectively comparing each video image with the guide image to obtain a second comparison result corresponding to each video image; and screening out a cover image of the target video from the video images according to the first comparison result and the second comparison result. Therefore, the first semantic image and the second semantic image are compared, the video image and the guide image are subjected to color comparison, the cover image is screened out according to the comparison result, and the time for screening the cover image is reduced.

Description

Method, device, equipment and storage medium for generating video cover image

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method and an apparatus for generating a video cover image, an electronic device, and a computer-readable storage medium.

Background

With the development of internet technology, the application of the internet is more and more widespread, and the functions are more and more powerful. The user can watch various videos through the internet, each video usually shows a cover map corresponding to the video, and a better cover map can effectively attract the click of the user, so that the extraction of the cover map from the video is important for a video website.

In the traditional method, a video is usually directly framed, and various information analysis is performed on the framed image to obtain a cover picture, wherein the information comprises face information, sound, letters and the like, but the traditional method needs to call face detection and recognition, audio detection and recognition and character detection and recognition, so that the complexity and time consumption of the project for generating the cover picture are increased, and the obtained cover picture may not meet the standard of a designer. Therefore, how to quickly select a cover map meeting the standard of a designer from a video is a problem to be solved by the technical field.

Disclosure of Invention

In view of the above problems, embodiments of the present invention are proposed to provide a method of generating a video-cover image, an apparatus for generating a video-cover image, an electronic device, and a computer-readable storage medium that overcome or at least partially solve the above problems.

In order to solve the above problem, an embodiment of the present invention discloses a method for generating a video cover image, where the method includes:

acquiring each video image of a target video, and determining a first semantic image corresponding to each video image; the first semantic image is used for representing the position distribution of a target object in the video image;

acquiring a guide image and a second semantic image corresponding to the guide image; the second semantic image is used for representing the position distribution of the reference object in the guide image;

respectively comparing each first semantic image with each second semantic image to obtain a first comparison result corresponding to each first semantic image;

respectively comparing each video image with the guide image to obtain a second comparison result corresponding to each video image;

screening a target image from the video images as a cover image of the target video according to the first comparison result corresponding to each first semantic image and the second comparison result corresponding to each video image.

Optionally, the acquiring each video image of the target video and determining a first semantic image corresponding to each video image includes:

acquiring the target video, and performing frame extraction processing on the target video to obtain each video image;

inputting each video image into a preset semantic generation network for processing to obtain a first semantic image corresponding to each video image; the pixels of the first semantic image are used for representing the position distribution of the target object in the first semantic image through different pixel colors; the pixel color is used for comparing the first semantic image with the second semantic image.

Optionally, the comparing each of the first semantic images with the second semantic image respectively to obtain a first comparison result corresponding to each of the first semantic images includes:

respectively calculating pixel difference values of the positions, corresponding to the second semantic images, in the first semantic images;

and respectively calculating the semantic similarity of each first semantic image and the second semantic image according to the pixel difference value of the corresponding position of the first semantic image and the second semantic image.

Optionally, the respectively comparing each of the video images with the guide image to obtain a second comparison result corresponding to each of the video images includes:

respectively calculating pixel difference values of corresponding positions of the video images and the guide image;

and respectively calculating the color similarity of each video image and the guide image according to the pixel difference value of the corresponding position of the video image and the guide image.

Optionally, the screening, according to the first comparison result corresponding to each of the first semantic images and the second comparison result corresponding to each of the video images, a target image from the video images as a cover image of the target video includes:

calculating the total similarity of each video image according to the semantic similarity of the first semantic image and the second semantic image corresponding to each video image and the color similarity of the video image and the guide image;

and screening a target image from the video images as a cover image of the target video according to the total similarity of all the video images.

Optionally, the calculating semantic similarity between each of the first semantic images and the second semantic images according to the pixel difference at the corresponding position of the first semantic image and the second semantic image includes:

respectively calculating the similarity of each first semantic image and each second semantic image according to the following formula:

EDs represents the semantic similarity between the first semantic image and the second semantic image, cs represents an image channel of the first semantic image, js represents the jth pixel in the first semantic image, ms represents the pixel number of the first semantic image in the same image channel, si represents the first semantic image, and SG represents the second semantic image.

Optionally, the calculating the color similarity between each of the video images and the guide image according to the pixel difference values of the corresponding positions of the video images and the guide image respectively includes:

respectively calculating the color similarity of each video image and the guide image according to the following formula:

wherein ED _i Representing said video image and said guide imageColor similarity, C _I An image channel representing the video image, ii representing the video image, IG representing the guide image,

represents the average value of Ii pixels, j _i Representing the jth pixel, M, in said video image _i Represents the number of pixels of the video image in the same image channel, and theta represents a preset constant.

The embodiment of the invention discloses a device for generating a video cover image, which comprises:

the determining module is used for acquiring each video image of a target video and determining a first semantic image corresponding to each video image; the first semantic image is used for representing the position distribution of a target object in the video image;

the acquisition module is used for acquiring a guide image and a second semantic image corresponding to the guide image; the second semantic image is used for representing the position distribution of the reference object in the guide image;

the first comparison module is used for respectively comparing each first semantic image with the second semantic image to obtain a first comparison result corresponding to each first semantic image;

the second comparison module is used for respectively comparing each video image with the guide image to obtain a second comparison result corresponding to each video image;

and the screening module is used for screening a target image from the video images as a cover image of the target video according to the first comparison result corresponding to each first semantic image and the second comparison result corresponding to each video image.

Optionally, the determining module includes:

the frame extracting sub-module is used for acquiring the target video and performing frame extracting processing on the target video to obtain each video image;

the acquisition submodule is used for inputting each video image into a preset semantic generation network for processing to obtain a first semantic image corresponding to each video image; the pixels of the first semantic image are used for representing the position distribution of the target object in the first semantic image through different pixel colors; the pixel color is used for comparing the first semantic image with the second semantic image.

Optionally, the first comparison module comprises:

the first calculation submodule is used for respectively calculating the pixel difference value of the corresponding position of each first semantic image and the corresponding position of each second semantic image;

and the second calculation submodule is used for respectively calculating the semantic similarity of each first semantic image and the second semantic image according to the pixel difference value of the corresponding position of the first semantic image and the second semantic image.

Optionally, the second comparison module comprises:

the third calculation submodule is used for respectively calculating the pixel difference value of the corresponding position of each video image and the guide image;

and the fourth calculating submodule is used for respectively calculating the color similarity of each video image and the guide image according to the pixel difference value of the corresponding position of the video image and the guide image.

Optionally, the screening module comprises:

a fifth calculation submodule, configured to calculate a total similarity of each video image according to a semantic similarity between the first semantic image and the second semantic image corresponding to each video image and a color similarity between the video image and the guide image;

and the screening submodule is used for screening a target image from the video images as a cover image of the target video according to the total similarity of all the video images.

Optionally, the second computation submodule includes:

a first calculating unit, configured to calculate, according to the following formula, semantic similarity between each of the first semantic images and the second semantic image:

wherein EDs represent semantic similarity of the first semantic image and the second semantic image, C _s Image channel j representing said first semantic image _s Representing the jth pixel, M, in the first semantic image _s The number of pixels of the first semantic image in the same image channel is represented, si represents the first semantic image, and SG represents the second semantic image.

Optionally, the fourth computing submodule includes:

a second calculating unit, configured to calculate color similarity between each of the video images and the guide image according to the following formula:

wherein ED _i Representing the color similarity of the video image and the guide image, C _I An image channel representing the video image, ii representing the video image, IG representing the guide image,

representing the average value of Ii pixels, j _i Representing the jth pixel, M, in said video image _i Represents the number of pixels of the video image in the same image channel, and theta represents a preset constant.

The invention also discloses an electronic device which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor realizes the steps of the video cover image generation method when executing the computer program.

The invention also discloses a computer readable storage medium, wherein a computer program is stored on the computer readable storage medium, and when the computer program is executed by a processor, the steps of the video cover image generation method are realized.

The embodiment of the invention has the following advantages:

in the embodiment of the invention, a first comparison result is obtained by comparing a first semantic image corresponding to each video image with a second semantic image corresponding to a guide image, and a second comparison result is obtained by comparing the video image with the guide image; the contrast of the first semantic image and the second semantic image and the contrast of the video image and the guide image are only the contrast of pixels, and complex algorithms such as face detection and recognition, audio detection and recognition, character detection and recognition and the like do not need to be called; according to the first comparison result and the second comparison result, the target image can be screened from the video images to be used as the cover image of the target video.

Drawings

FIG. 1 is a flowchart illustrating steps of a method for generating a video cover image according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a video image and a first semantic image provided by an embodiment of the present invention;

FIG. 3 is a schematic diagram of a guide image and a second semantic image provided by an embodiment of the invention;

FIG. 4 is a flowchart illustrating steps provided by an embodiment of the present invention to obtain a first comparison result;

FIG. 5 is a flowchart illustrating steps provided by an embodiment of the present invention to obtain a second comparison result;

fig. 6 is a block diagram of a device for generating a video cover image according to an embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention more comprehensible, the present invention is described in detail with reference to the accompanying drawings and the detailed description thereof.

In the prior art, face detection and recognition, audio detection and recognition, and character detection and recognition need to be called to obtain a cover picture of a video, the method can increase the complexity and time consumption of the process of generating the cover picture, and the cover picture selected from the video does not meet the standard of a designer. In order to solve the technical problem, the invention provides a method for generating a video cover image, which has the core concept that a first semantic image corresponding to a video image is compared with a second semantic image corresponding to a guide image, the video image is compared with the guide image in colors, and a cover image is determined from the video image according to the comparison result.

Referring to fig. 1, a flowchart illustrating steps of a method for generating a video cover image according to an embodiment of the present invention is shown, where the method specifically includes the following steps:

step 101, acquiring each video image of a target video, and determining a first semantic image corresponding to each video image; the first semantic image is used for representing the position distribution of the target object in the video image.

In the embodiment of the invention, the target object can be a person, an animal, an object and the like, after the target video is obtained, the frame extraction processing is performed on the target video to obtain the video image, when the frame extraction processing is performed on the target video, the frame extraction processing can be performed on the target video according to the preset time period to obtain the video image, and the frame extraction processing can be performed on the target video according to the preset frame number period to obtain the video image, wherein the video image comprises one or more frames of frame extraction images obtained after the frame extraction processing is performed on the target video.

For example, according to the video duration of the target video being 60 minutes and the preset time period being 5 seconds, frame extraction processing is performed on the target video every 5 seconds, 720 video images can be obtained, the shorter the preset time period is, the more the obtained video images are, the more the video images can provide images for comparison with the guide image, the more the images are, and finally the generated cover image of the target video can better meet the standard of a designer. For another example, the video frame rate of the target video is 30 frames/second, the video duration of the target video is 60 minutes, and the preset frame number period is 60 frames, then, frame extraction processing is performed on each 60 frames of the target video, 1800 video images can be obtained, the smaller the preset frame number period is, the more the obtained video images are, the more the video images can provide images for comparison with the guide image, the more the images are, and finally, the generated cover image of the target video can better meet the standard of a designer.

And inputting the video images into a preset semantic generation network for processing to obtain first semantic images corresponding to the video images. The pre-set semantic generating network may be pre-trained to recognize specific target objects. At least a first semantic image corresponding to a target object can be obtained through preset semantic generation network processing. For example, the preset semantic generation network may be trained in advance to be able to recognize objects such as people, animals, vehicles, and the like. When the target object is a person, a first semantic image of the target object person can be obtained after the preset semantic generation network processing, and when the target object is a person or a dog, the first semantic image of the target object person and the dog can be obtained after the preset semantic generation network processing.

The pixels of the first semantic image can represent the position distribution of the target object in the first semantic image through different pixel colors, and can also be used for distinguishing the target object from a non-target object, so that the position distribution of the target object in the video image can be represented. For example, a pixel with a value of 0 in the first semantic image represents black, and a pixel with a value of 1 represents white; if the target object is represented by black and the non-target object is represented by white, the pixel corresponding to the target object in the first semantic image is 0 and the pixel corresponding to the non-target object is 1. For another example, the first semantic image has three colors, which are black, white, and red, and the first semantic image includes two target objects, where a pixel of one target object corresponds to black, a pixel of the other target object corresponds to red, and a pixel of the background corresponds to white. Pixel color may be used for the first semantic image to compare with the second semantic image. For example, the first semantic image and the second semantic image have two pixel colors of black and white, a region with a black pixel color in the first semantic image represents a region of the target object, a region with a black pixel color in the second semantic image represents a region of the reference object, the first semantic image and the second semantic image are compared with each other, and the more similar the black region is, the more similar the first semantic image and the second semantic image are.

After obtaining a video image of a target video, the embodiment of the invention inputs the video image into a preset semantic generation network for processing to obtain a first semantic image corresponding to the video image, wherein the semantic image comprises the first semantic image corresponding to the video image and a second semantic image corresponding to a guide image, the preset semantic generation network is obtained by learning multiple groups of sample information through a deep learning neural network in advance, and the deep learning neural network comprises: mask RCNN, background matching, etc., but the invention is not limited thereto.

Referring to fig. 2, an illustration of a video image and a first semantic image according to an embodiment of the present invention is shown, where a target video is a first episode of a drama a, and 21A, 22A, 23A, and 24A in fig. 2 are video images of the first episode of the drama a, and after the

video images

21A, 22A, 33A, and 34A are respectively input into a preset semantic generation network and processed, first

semantic images

21B, 22B, 23B, and 24B corresponding to the

video images

21A, 22A, 23A, and 24A, respectively, are obtained.

102, acquiring a guide image and a second semantic image corresponding to the guide image; the second semantic image is used for representing the position distribution of the reference object in the guide image;

the method includes the steps that a guide image and a second semantic image corresponding to the guide image are obtained, the guide image is an image screened from a cover image of a preset movie play in advance, the preset movie play can be screened according to the heat or the score, if the heat or the score of the movie play is greater than a preset heat threshold or the score is greater than a preset score, the movie play is the preset movie play, the cover image of the movie play is the guide image, and the cover image of one movie play generally does not have only one cover image, so that the number of the guide image can be the cover images of a plurality of classical movie plays, and the cover images can be screened from different classical movie plays to serve as the guide image. For example, the series B is screened out according to the heat of the series, that is, the series B is a preset series, the cover image of the series B is acquired, and the cover image of the series B is used as a guide image. The resolution of each image may be different, so that the resolution of the guide image and the resolution of the video image are detected after the guide image is acquired, and if the resolution of the guide image is different from the resolution of the video image, the resolution of the guide image is processed to make the resolution of the guide image the same as that of the video image, so that each pixel between each image can be kept corresponding when the images are compared subsequently. And then inputting the guide image into a preset semantic generation network for processing to obtain a second semantic image corresponding to the guide image. The second semantic image may distinguish the reference object from the non-reference object by the color of the pixel, and may represent a position distribution of the reference object in the guide image. The reference object in the present application may be a human, an animal, an object, or the like.

Referring to fig. 3, which illustrates an illustration of a guide image and a second semantic image according to an embodiment of the present invention, a cover image of a tv series B is used as the guide image, and the guide image 3A is input into a preset semantic generation network and processed to obtain the second semantic image 3B corresponding to the guide image 3A.

103, comparing each first semantic image with each second semantic image respectively to obtain a first comparison result corresponding to each first semantic image;

in the application, for a first semantic image of a plurality of video images, each first semantic image can be compared with a second semantic image to obtain a first comparison result corresponding to each first semantic image.

For example, a target video is subjected to frame extraction processing to obtain 1000 video images, the 1000 video images are respectively input into a preset semantic generation network to be processed to obtain 1000 first semantic images corresponding to the video images, a guide image is 1, 1 second semantic image is obtained after processing, the 1000 first semantic images are respectively compared with the 1000 second semantic images, each pixel in each first semantic image is compared with a pixel corresponding to each pixel position in the second semantic image, a first comparison result of each first semantic image and each second semantic image is obtained according to a comparison result of each pixel, and the 1000 first comparison results corresponding to the first semantic images are obtained after the 1000 first semantic images are compared with the second semantic images.

The way of comparing the first semantic image with the second semantic image may be: respectively comparing the pixel at each position of the first semantic image with the pixel at the corresponding position in the second semantic image to obtain the comparison result of the pixel at each position; and obtaining a first comparison result of the first semantic image according to the comparison result of each position pixel. By comparing the first semantic image with the second semantic image, the video image with high similarity to the guide image composition can be screened out.

Referring to fig. 4, a flowchart illustrating steps of obtaining a first comparison result according to an embodiment of the present invention is shown;

step S11, respectively calculating pixel difference values of corresponding positions of the first semantic image and the second semantic image;

the first semantic image and the second semantic image may be represented in a variety of color spaces, which may include RGB, HSV, HLV, YCBCR, LUV, and the like. For different color spaces, the pixel difference value of the corresponding position of the first semantic image and the second semantic image can be calculated according to the color attribute of the color space. For example, taking an RGB color space as an example, the corresponding attributes, i.e., color attributes, are an R channel, a G channel, and a B channel, and pixel difference values of corresponding positions of the first semantic image and the second semantic image in the R channel, the G channel, and the B channel can be calculated respectively. A person skilled in the art may calculate a pixel difference value of a corresponding position of the first semantic image and the second semantic image based on a color space of the image, which is not specifically limited herein.

And S12, respectively calculating the semantic similarity of each first semantic image and the second semantic image according to the pixel difference value of the corresponding position of the first semantic image and the second semantic image.

In one embodiment, semantic similarity between pixels at corresponding positions of each first semantic image and each second semantic image can be respectively calculated according to pixel difference values at corresponding positions of the first semantic image and the second semantic image; and respectively calculating the semantic similarity of the first semantic image and the second semantic image according to the semantic similarity between the pixels at the corresponding positions of the first semantic image and the second semantic image. The higher the semantic similarity corresponding to the first semantic image, the more similar the position distribution of the target object in the video image and the reference object in the guide image.

In one embodiment, the semantic similarity between the first semantic image and the second semantic image can be calculated according to the following formula:

wherein EDs represent semantic similarity of the first semantic image and the second semantic image, C _s Image channel j representing a first semantic image _s Representing the jth pixel, M, in the first semantic image _s The number of pixels of the first semantic image in the same image channel is represented, si represents the first semantic image, and SG represents the second semantic image. If the image channel of the first semantic image has only one channel, cs =1 in the above formula. The larger the EDs value is, the higher the semantic similarity between the first semantic image and the second semantic image is.

Step 104, comparing each video image with the guide image respectively to obtain a second comparison result corresponding to each video image;

in the application, each video image can be compared with the guide image for a plurality of video images respectively, so that a second comparison result corresponding to each video image is obtained.

The way of comparing the video image with the guide image may be: respectively comparing the pixels at each position of the video image with the pixels at the corresponding position in the guide image to obtain the comparison result of the pixels at each position; and obtaining a second comparison result of the video image according to the comparison result of the pixels at the positions. The video image with high similarity to the guide image in color matching can be screened out by comparing the video image with the guide image.

Referring to fig. 5, a flowchart illustrating steps provided by an embodiment of the present invention to obtain a second comparison result is shown;

step S21, respectively calculating pixel difference values of corresponding positions of each video image and the guide image;

the video image and the guide image may be represented in a variety of color spaces, which may include RGB, HSV, HLV, YCBCR, LUV, and the like. For different color spaces, pixel difference values of corresponding positions of the video image and the guide image can be calculated according to color attributes of the color spaces. For example, taking an RGB color space as an example, the corresponding attributes, i.e., color attributes, are an R channel, a G channel, and a B channel, and pixel difference values of corresponding positions of the video image and the guide image in the R channel, the G channel, and the B channel can be calculated respectively. Taking the LUV color space as an example, the parameter value corresponding to the LUV color space may be converted into the parameter value corresponding to the RGB color space, and then the pixel difference values of the corresponding positions of the video image and the guide image in the R channel, the G channel, and the B channel are calculated respectively. A person skilled in the art may calculate a pixel difference value of a corresponding position of the first semantic image and the second semantic image based on a color space of the image, which is not specifically limited herein.

And S22, respectively calculating the color similarity of each video image and the guide image according to the pixel difference value of the corresponding position in the video image and the guide image.

In one embodiment, the color similarity between the pixels at the corresponding positions of the video image and the guide image can be respectively calculated according to the pixel difference values at the corresponding positions of the video image and the guide image; and respectively calculating the color similarity of each video image and the guide image according to the color similarity between the pixels at the corresponding positions in the video image and the guide image. The higher the color similarity, the more similar the color matching between the video image and the guide image.

In the embodiment of the invention, the color similarity between each video image and the guide image is respectively calculated according to the following formula:

wherein ED _i Representing the color similarity of the video image and the guide image, C _I An image channel representing a video image, ii representing a video image, IG representing a guide image,

representing the average value, j, of the pixels of the video image Ii _i Representing the jth pixel, M, in a video image _i Representing the number of pixels of the video image in the same image channel, theta represents a preset constant, theta is a value which is added to prevent the denominator in the above formula from being 0, and the optimal value of theta can be set to 10 according to the test in practical application ^-5 The specific value of the preset constant θ may be set according to practical applications, and is not limited in particular. If there are three image channels of the video image, for example, when the color space of the video image and the guide image is RGB, the color attributes are R channel, G channel, and B channel, then C in the above formula _I ＝3。ED _i A larger value indicates a higher color similarity of the video image and the guide image.

And 105, screening a target image from the video images as a cover image of the target video according to the first comparison result corresponding to each first semantic image and the second comparison result corresponding to each video image.

In an embodiment, the first comparison result is a semantic similarity between the first semantic image and the second semantic image, and the second comparison result is a color similarity between the video image and the guide image, so that the total similarity of the video images can be calculated according to the semantic similarity corresponding to the first semantic image and the color similarity corresponding to the video image.

In the embodiment of the invention, the semantic similarity of the first semantic image corresponding to the video image and the second semantic image corresponding to the guide image in the video image is different from the weight ratio of the color similarity of the video image and the guide image to the total similarity of the video image, so the total similarity of each video image and the guide image can be respectively calculated according to the semantic similarity of the first semantic image corresponding to the video image and the second semantic image corresponding to the guide image, the corresponding first preset weight coefficient, the color similarity and the corresponding second preset weight coefficient.

In one embodiment, the total similarity of the video images can be calculated separately as follows:

wherein ED represents the total similarity of the video images, α represents a first preset weight coefficient, and M _s Representing the number of pixels of the first semantic image in the same image channel, ED _s Representing the semantic similarity between the first semantic image and the second semantic image, beta representing a second preset weight coefficient, ED _i Indicating the color similarity of the video image and the guide image. The value of the first preset weight coefficient may be set to 0.9 according to a test in an actual application, the value of the first preset weight coefficient may be set to 0.1 according to the test in the actual application, and specific values of the first preset weight coefficient and the second preset weight coefficient may be set according to the actual application, which is not specifically limited herein.

And respectively calculating the total similarity of each video image and the guide image, determining the value with the maximum total similarity in the total similarity, taking the video image corresponding to the value with the maximum total similarity as a target image, and taking the target image as a cover image of the target video.

For example, in the example where the target video is the first episode of the drama a and the cover image of the drama B is used as the guide image, the total similarity between the video image 21A and the guide image 3A is 0.5, the total similarity between the video image 22A and the guide image 3A is 0.55, the total similarity between the video image 23A and the guide image 3A is 0.7, and the total similarity between the video image 24A and the guide image 3A is 0.85, the video image 24A is used as the cover image of the first episode of the target video drama a.

The preset semantic generation network in the embodiment of the invention can also add marks to the target object and the reference object according to the recognized categories of the target object and the reference object, can screen out video images from large to small according to the total degree of recognition according to the preset selection number after calculating the total degree of recognition of the video images and the guide images, and then can take the video images with the same marks as the guide images as cover images of the target video according to the marks from the screened video images. For example, the preset number of the selected sheets is 5, 5 video images with the total recognition degrees of the video image and the guide image from large to small are calculated, and the total similarity degrees between the guide image and the video images a, B, C, D, and E are 0.95, 0.94, 0.92, 0.90, and 0.85 respectively, where the labels corresponding to the video images a, B, D, and E are all animals, the label corresponding to the video image C is a person, and the label corresponding to the guide image is a person, and the video image C labeled as a person is determined as a cover map of the target video.

In the embodiment of the invention, only the first semantic image corresponding to the video image and the second semantic image corresponding to the guide image need to be compared, the video image and the guide image are compared, and the cover image is screened out according to the comparison result, so that the complexity and time consumption of the project for generating the cover image are reduced, and the cover image meeting the standard of a designer can be quickly and efficiently selected from the video.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Referring to fig. 6, a block diagram of a structure of a device for generating a video cover image according to an embodiment of the present invention is shown, and specifically includes the following modules:

a determining module 601, configured to obtain each video image of a target video, and determine a first semantic image corresponding to each video image; the first semantic image is used for representing the position distribution of a target object in the video image;

an obtaining module 602, configured to obtain a guide image and a second semantic image corresponding to the guide image; the second semantic image is used for representing the position distribution of the reference object in the guide image;

a first comparison module 603, configured to compare each of the first semantic images with the second semantic image, respectively, to obtain a first comparison result corresponding to each of the first semantic images;

a second comparison module 604, configured to compare each of the video images with the guide image, respectively, to obtain a second comparison result corresponding to each of the video images;

a screening module 605, configured to screen a target image from the video images as a cover image of the target video according to the first comparison result corresponding to each of the first semantic images and the second comparison result corresponding to each of the video images.

Optionally, the determining module 601 includes:

the frame extracting submodule is used for acquiring the target video and extracting the frame of the target video to obtain each video image;

the acquisition submodule is used for inputting each video image into a preset semantic generation network for processing to obtain a first semantic image corresponding to each video image; the pixels of the first semantic image are used for representing the position of the target object in the first semantic image through different pixel colors; the pixel color is used for comparing the first semantic image with the second semantic image.

Optionally, the first comparing module 603 includes:

the first calculation submodule is used for respectively calculating pixel difference values of corresponding positions in each first semantic image and the second semantic image;

and the second calculation submodule is used for respectively calculating the semantic similarity of each first semantic image and the second semantic image according to the pixel difference value of the corresponding position in the first semantic image and the second semantic image.

Optionally, the second comparing module 604 includes:

the third calculation submodule is used for respectively calculating the pixel difference value of each video image and the corresponding position in the guide image;

Optionally, the screening module 605 includes:

Optionally, the second computation submodule includes:

a first calculating unit, configured to calculate semantic similarity between each of the first semantic images and the second semantic image according to the following formula:

wherein EDs represent semantic similarity of the first semantic image and the second semantic image, C _s Image channel j representing said first semantic image _s Representing the jth pixel, M, in said first semantic image _s The number of pixels of the first semantic image in the same image channel is represented, si represents the first semantic image, and SG represents the second semantic image.

Optionally, the fourth computing submodule includes:

representing the average value, j, of the pixels of the video image Ii _i Representing the jth pixel, M, in said video image _i Represents the number of pixels of the video image in the same image channel, and theta represents a preset constant.

The invention discloses a generating device of a video cover image, which comprises: the determining module is used for acquiring each video image of a target video and determining a first semantic image corresponding to each video image; the acquisition module is used for acquiring a guide image and a second semantic image corresponding to the guide image; the second semantic image is used for representing the position distribution of the reference object in the guide image; the first comparison module is used for respectively comparing each first semantic image with the second semantic image to obtain a first comparison result corresponding to each first semantic image; the second comparison module is used for respectively comparing each video image with the guide image to obtain a second comparison result corresponding to each video image; and the screening module is used for screening a target image from the video images as a cover image of the target video according to the first comparison result corresponding to each first semantic image and the second comparison result corresponding to each video image. Compared with the prior art, the method only needs to compare the first semantic image corresponding to the video image with the second semantic image corresponding to the guide image, perform color comparison on the video image and the guide image, and screen out the cover image according to the comparison result, so that the complexity and time consumption of the project for generating the cover image are reduced, and the cover image meeting the standard of a designer can be quickly and efficiently selected from the video.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

An embodiment of the present invention further provides an electronic device, including:

the video cover image generation method comprises a processor, a memory and a computer program which is stored on the memory and can run on the processor, wherein when the computer program is executed by the processor, each process of the video cover image generation method embodiment is realized, the same technical effect can be achieved, and in order to avoid repetition, the description is omitted.

The embodiment of the invention also provides a computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and when being executed by a processor, the computer program realizes each process of the embodiment of the video cover image generation method, can achieve the same technical effect, and is not repeated here to avoid repetition.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one of skill in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all changes and modifications that fall within the true scope of the embodiments of the present invention.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrases "comprising one of \ 8230; \8230;" does not exclude the presence of additional like elements in a process, method, article, or terminal device that comprises the element.

The method, the apparatus, the electronic device and the computer-readable storage medium for generating a video cover image according to the present invention are described in detail, and a specific example is applied to illustrate the principle and the implementation of the present invention, and the description of the embodiment is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method for generating a video cover image, the method comprising:

and screening a target image from the video images as a cover image of the target video according to the first comparison result corresponding to each first semantic image and the second comparison result corresponding to each video image.

2. The method of claim 1, wherein the obtaining each video image of the target video and determining the first semantic image corresponding to each video image comprises:

3. The method according to claim 1, wherein the comparing each of the first semantic images with the second semantic image to obtain a first comparison result corresponding to each of the first semantic images comprises:

respectively calculating pixel difference values of corresponding positions of the first semantic image and the second semantic image;

4. The method according to claim 1, wherein the comparing each of the video images with the guide image to obtain a second comparison result corresponding to each of the video images comprises:

5. The method according to claim 4, wherein the screening a target image from the video images as a cover image of the target video according to the first comparison result corresponding to each of the first semantic images and the second comparison result corresponding to each of the video images comprises:

calculating the total similarity of the video images according to the semantic similarity of the first semantic image and the second semantic image corresponding to each video image and the color similarity of the video images and the guide image respectively;

6. The method according to claim 3, wherein the calculating the semantic similarity between each of the first semantic images and the second semantic images according to the pixel difference value of the corresponding position of the first semantic image and the second semantic image respectively comprises:

respectively calculating the semantic similarity of each first semantic image and each second semantic image according to the following formula:

7. The method according to claim 4, wherein the calculating the color similarity between each of the video images and the guide image according to the pixel difference at the corresponding position of the video image and the guide image comprises:

8. An apparatus for generating a video cover image, the apparatus comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method of any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.