CN111062314B

CN111062314B - Image selection method and device, computer readable storage medium and electronic equipment

Info

Publication number: CN111062314B
Application number: CN201911286031.7A
Authority: CN
Inventors: 高洵; 沈招益; 刘军煜; 杨天舒
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-12-13
Filing date: 2019-12-13
Publication date: 2021-11-02
Anticipated expiration: 2039-12-13
Also published as: CN111062314A

Abstract

The present disclosure provides an image selection method, an image selection apparatus, a computer-readable storage medium, and an electronic device; relates to the technical field of computers. The method comprises the following steps: extracting video frames of the video file to obtain an image sequence; clustering the image sequence according to the image characteristics corresponding to the images in the image sequence to obtain at least one image set; and carrying out multi-dimensional evaluation on each image in the image set according to a preset evaluation standard, and selecting a target image from the image set according to a multi-dimensional evaluation result, wherein the target image is used for synthesizing a specific image representing a video file. The method in the disclosure can overcome the problem that the efficiency of material selection is low in manual work, and can determine the material (namely, the target image) for synthesizing the video cover through extraction and analysis of the video frame, so that the efficiency of material selection is improved.

Description

Image selection method and device, computer readable storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to an image selection method, an image selection apparatus, a computer-readable storage medium, and an electronic device.

Background

With the diversification of video program types, people can meet the aims of entertainment, leisure, learning and the like by selecting interesting video programs for watching. Generally, people make a preliminary judgment on whether a video program is interested or not by combining the title of the video program with a cover map of the video program. Therefore, for a first video program, in addition to the title being important, the production of a cover book is also important. The cover art is typically customized by a designer by selecting material from the content of the video program. However, as the number of video programs increases, the manner in which material selection is performed manually has a problem of being inefficient.

It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

The present disclosure is directed to an image selection method, an image selection apparatus, a computer-readable storage medium, and an electronic device, which can overcome the problem of low efficiency in manually selecting a material, and determine a material for synthesizing a video cover by extracting and analyzing a video frame, thereby improving the efficiency of selecting a material.

According to an aspect of the present disclosure, there is provided an image selecting method, including:

extracting video frames of the video file to obtain an image sequence;

clustering the image sequence according to the image characteristics corresponding to the images in the image sequence to obtain at least one image set;

and performing multi-dimensional evaluation on each image in the image set according to a preset evaluation standard, and selecting a target image from the image set according to a multi-dimensional evaluation result, wherein the target image is used for synthesizing a specific image representing a video file.

According to another aspect of the present disclosure, an image selecting apparatus is provided, which includes a video frame extracting unit, an image clustering unit, a target image selecting unit, and a specific image synthesizing unit, wherein:

the video frame extraction unit is used for extracting video frames of the video file to obtain an image sequence;

the image clustering unit is used for clustering the image sequences according to the image characteristics corresponding to the images in the image sequences to obtain at least one image set;

and the target image selecting unit is used for carrying out multi-dimensional evaluation on each image in the image set according to a preset evaluation standard, and selecting a target image from the image set according to a multi-dimensional evaluation result, wherein the target image is used for synthesizing a specific image representing a video file.

According to another aspect of the present disclosure, there is provided an electronic device including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the method of any one of the above via execution of the executable instructions.

According to another aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of any one of the above.

Exemplary embodiments of the present disclosure may have some or all of the following benefits:

in the image selecting method provided in an example embodiment of the present disclosure, a video file (e.g., a first-phase video file of a certain art program) may be subjected to video frame extraction to obtain an image sequence; further, the image sequences can be clustered according to the image characteristics corresponding to the images in the image sequences to obtain at least one image set; further, the images in the image set may be evaluated in multiple dimensions according to preset evaluation criteria, and a target image for synthesizing a specific image (e.g., a video cover) representing a video file may be selected from the image set according to the results of the evaluation in multiple dimensions. According to the technical scheme, on one hand, the problem of low efficiency of material selection manually can be solved, and the material (namely, the target image) for synthesizing the video cover can be determined through extraction and analysis of the video frame, so that the material selection efficiency is improved; on the other hand, the selecting quality of the video cover materials can be ensured through clustering and multi-dimensional evaluation of the images.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty.

Fig. 1 is a schematic diagram illustrating an exemplary system architecture of an image selecting method and an image selecting apparatus to which the embodiments of the present disclosure may be applied;

FIG. 2 illustrates a schematic structural diagram of a computer system suitable for use with the electronic device used to implement embodiments of the present disclosure;

FIG. 3 schematically shows a flow diagram of an image selection method according to an embodiment of the present disclosure;

FIG. 4 schematically illustrates a flow diagram for clustering a sequence of images according to image features corresponding to respective images in the sequence of images according to one embodiment of the present disclosure;

FIG. 5 schematically illustrates an implementation of an image selection method according to one embodiment of the present disclosure;

FIG. 6 schematically shows an application diagram of an image selection method according to an embodiment of the present disclosure;

FIG. 7 schematically illustrates a flow diagram of an image selection method according to another embodiment of the present disclosure;

FIG. 8 schematically illustrates a flow chart of an image selection method according to yet another embodiment of the present disclosure;

fig. 9 schematically shows a block diagram of an image selecting apparatus in an embodiment according to the present disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and the like. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

Fig. 1 is a schematic diagram illustrating a system architecture of an exemplary application environment to which an image selecting method and an image selecting apparatus according to an embodiment of the present disclosure may be applied.

As shown in fig. 1, the system architecture 100 may include one or more of

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few. The

terminal devices

101, 102, 103 may be various electronic devices having a display screen, including but not limited to desktop computers, portable computers, smart phones, tablet computers, and the like. It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. For example, server 105 may be a server cluster comprised of multiple servers, or the like.

The image selecting method provided by the embodiment of the disclosure is generally executed by the server 105, and accordingly, the image selecting apparatus is generally disposed in the server 105. However, it is easily understood by those skilled in the art that the image selecting method provided in the embodiment of the present disclosure may also be executed by the

terminal devices

101, 102, and 103, and accordingly, the image selecting apparatus may also be disposed in the

terminal devices

101, 102, and 103, which is not particularly limited in the exemplary embodiment. For example, in an exemplary embodiment, the server 105 may perform video frame extraction on a video file to obtain an image sequence; clustering the image sequence according to the image characteristics corresponding to the images in the image sequence to obtain at least one image set; and performing multi-dimensional evaluation on each image in the image set according to a preset evaluation standard, and selecting a target image from the image set according to a multi-dimensional evaluation result, wherein the target image is used for synthesizing a specific image representing a video file. The image selecting method provided by the embodiment of the present disclosure may also be executed by the terminal device and the server 105 together. In another exemplary embodiment, the server 105 may extract video frames from a video file to obtain an image sequence, cluster the image sequence according to image features corresponding to images in the image sequence to obtain at least one image set, and transmit the image set to at least one of the

terminal devices

101, 102, and 103, so that the

corresponding terminal device

101, 102, and/or 103 performs multi-dimensional evaluation on the images in the image set according to a preset evaluation criterion, and selects a target image from the image set according to a multi-dimensional evaluation result; wherein the target image is used to compose a particular image representing the video file.

FIG. 2 illustrates a schematic structural diagram of a computer system suitable for use in implementing the electronic device of an embodiment of the present disclosure. The electronic device may be the terminal device or the server shown in fig. 1.

It should be noted that the computer system 200 of the electronic device shown in fig. 2 is only an example, and should not bring any limitation to the functions and the scope of the application of the embodiments of the present disclosure.

As shown in fig. 2, the computer system 200 includes a Central Processing Unit (CPU)201 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)202 or a program loaded from a storage section 208 into a Random Access Memory (RAM) 203. In the RAM 203, various programs and data necessary for system operation are also stored. The CPU 201, ROM 202, and RAM 203 are connected to each other via a bus 204. An input/output (I/O) interface 205 is also connected to bus 204.

The following components are connected to the I/O interface 205: an input portion 206 including a keyboard, a mouse, and the like; an output section 207 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 208 including a hard disk and the like; and a communication section 209 including a network interface card such as a LAN card, a modem, or the like. The communication section 209 performs communication processing via a network such as the internet. A drive 210 is also connected to the I/O interface 205 as needed. A removable medium 211 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 210 as necessary, so that a computer program read out therefrom is mounted into the storage section 208 as necessary.

In particular, the processes described below with reference to the flowcharts may be implemented as computer software programs, according to embodiments of the present disclosure. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 209 and/or installed from the removable medium 211. The computer program, when executed by a Central Processing Unit (CPU)201, performs various functions defined in the methods and apparatus of the present application.

The technical solution of the embodiment of the present disclosure is explained in detail below:

for video programs, it is often necessary to produce a cover-book representing the video program for promotion or as an entry to a video program window. The cover sheet is usually made by manually selecting materials and then making the cover sheet. However, this method is prone to waste of labor and low efficiency when processing video programs in batches.

In view of one or more of the above problems, the present example embodiment provides an image selecting method. The image selecting method may be applied to the server 105, and may also be applied to one or more of the

terminal devices

101, 102, and 103, which is not particularly limited in this exemplary embodiment. Referring to fig. 3, the image selecting method may include the following steps S310 to S340:

step S310: and extracting video frames of the video file to obtain an image sequence.

Step S320: and clustering the image sequence according to the image characteristics corresponding to the images in the image sequence to obtain at least one image set.

Step S330: and performing multi-dimensional evaluation on each image in the image set according to a preset evaluation standard, and selecting a target image from the image set according to a multi-dimensional evaluation result, wherein the target image is used for synthesizing a specific image representing a video file.

The above steps of the present exemplary embodiment will be described in more detail below.

In step S310, video frames of the video file are extracted to obtain an image sequence.

The video file is a multimedia file, and the video file may include one or more videos, where a video is composed of temporally consecutive video frames, and one video frame may be understood as one image and a plurality of video frames may be understood as a plurality of images. Additionally, the format of the video file may include, but is not limited to, MP4, MKV, MOV, AVI, SWF, FLV, and WEBM. The image sequence includes a plurality of images, and the plurality of images in the image sequence may be video frames sequentially extracted in time sequence. The image sequence may be a plurality of images that are continuous in shooting time, or may be a plurality of images that are discontinuous in shooting time, and the embodiment of the present disclosure is not limited thereto.

In this embodiment of the present disclosure, optionally, the extracting video frames from the video file to obtain an image sequence includes: extracting video frames corresponding to important contents according to video frame identifications used for representing the important contents on a time axis of a video file to obtain an image sequence; or extracting the video frames corresponding to the video content containing the target object in the video file to obtain the image sequence.

The video frame identifier may be a data flag at a certain time point in a video file, where the data flag is used to instruct a client or a server to extract one or more video frames, and the number of the video frame identifiers may be one or more, which is not limited in the embodiment of the present disclosure. The video frame corresponding to the time axis position where the video frame mark is positioned can be regarded as important content. In a video file, one or more video frame markers may be included, and if a plurality of video frame markers are included, the plurality of video frame markers may be the same or different, and the embodiments of the present disclosure are not limited thereto. In addition, the important content may be the text content of the video file, for example, if the video file is a synthesis video, the head and the tail of the film may be non-important content, and the other content may be important content. In addition, the time axis of the video file is used for representing the shooting time of the video file, and each time point on the time axis corresponds to one video frame respectively.

In addition, the manner of extracting the video frames corresponding to the important content according to the video frame identifier representing the important content on the time axis of the video file may specifically be: identifying a video frame starting mark and a video frame ending mark which are used for representing important content on a time axis, and extracting a corresponding video frame between the video frame starting mark and the video frame ending mark on the time axis; the video frame identifier may be preset, and the video frame identifier includes a video frame start identifier and a video frame end identifier. By the adoption of the method and the device, the position of the time axis where the video frame needs to be extracted can be located more quickly, and the extraction efficiency of the video frame is improved.

In addition, the manner of extracting the video frames corresponding to the video content containing the target object in the video file to obtain the image sequence may specifically be: identifying video content containing the target object in the video file according to the characteristic points of the target object, and extracting a video frame corresponding to the video content; the target object may include, but is not limited to, a person, an animal, a plant, a commodity, and the like, and the embodiments of the present disclosure are not limited thereto; the video content may correspond to a period of shooting time or a video frame, and if the video content corresponds to a period of shooting time, a plurality of video frames corresponding to the video content are provided. According to the mode, the video content of the relevant target object in the video file can be extracted in a targeted manner, for example, when the video file is applied to the field of movie and television play, the video file can be a movie, players in the movie are generally divided into a chief actor and a chief actor, if a movie cover needs to be made according to the image of the chief actor in the movie, the chief actor can be used as the target object, a video frame corresponding to the video content containing the target object is extracted, and then a specific image (namely, the video cover) for representing the movie is determined conveniently according to the video frame, so that the making efficiency of the movie cover can be improved, and the rapid development of the movie and television industry is promoted.

Therefore, by implementing the optional embodiment, the required image can be obtained by extracting the video frame, and compared with the manual image selection, the efficiency of image selection can be improved.

In this embodiment of the present disclosure, another optional method for extracting video frames from a video file to obtain an image sequence includes: extracting video frames of a video file according to preset time (such as 30 seconds) to obtain an image sequence; and every two adjacent images in the image sequence are separated by a preset time length. It will be appreciated that if a video file is 3 minutes, then a video frame decimation is performed every 30 seconds, resulting in an image sequence of 6 video frames.

The preset duration may be a user-defined duration or a default duration, which is not limited in the embodiment of the present disclosure, and the user-defined duration may be a duration manually set by a user, and the default duration may be an interval duration of a captured video frame initialized by the system.

When the video frame extraction process of the video file is the optional implementation process, the method and the device can be applied to the video file with the slowly-changing shooting object. For example, the application can be applied to shooting the video of the flower blooming process, generally speaking, a long time (e.g., 30 hours) is required for the flower to bloom from the flower bud, and for the video shooting of the process, the remarkable characteristics of each stage of the flower from the flower bud to the blooming process can be obtained by collecting video frames separated by a preset time (e.g., 30 minutes), so that researchers can research the life cycle of the flower according to the remarkable characteristics of each stage, and the understanding degree of the living beings is improved.

In this embodiment of the present disclosure, yet another optional step of performing video frame extraction on a video file to obtain an image sequence includes: and extracting random video frames of the video file to obtain an image sequence.

In this embodiment of the present disclosure, as yet another optional option, performing video frame extraction on a video file to obtain an image sequence includes: extracting key frames in the video file to obtain an image sequence; wherein the video file is encoded based on the h.264 coding principle. H.264 is a coding standard, three kinds of frames are defined in h.264, a completely coded frame is an I frame, a frame generated by referring to a previous I frame and containing only a difference part is a P frame, and a frame coded by referring to previous and subsequent frames is a B frame. The core algorithms of h.264 are intraframe compression, which is an algorithm for generating I frames, and interframe compression, which is an algorithm for generating B and P frames. Specifically, the I frame is intra-coded and represents a key frame, that is, a frame of picture is completely reserved, and the corresponding picture can be generated only by the frame data during decoding. P frames are forward predictive coded frames. The P frame represents the difference between the frame and the previous key frame (or P frame), and when decoding, the difference defined by the frame needs to be superimposed on the previously buffered picture to generate the corresponding picture. The B frame is a bidirectional predictive interpolation coding frame, the difference between the current frame and the previous and subsequent frames is recorded, and the corresponding picture can be generated by the superposition of the previous and subsequent pictures and the current frame data.

In step S320, the image sequences are clustered according to the image features corresponding to the images in the image sequences, so as to obtain at least one image set.

The image feature may be a feature vector, which is used to indicate the position of the image in the vector space. The image feature may be a scene feature, a person feature, a commodity feature, an overall feature, or the like, and the embodiment of the present disclosure is not limited. The image set comprises at least one image, and the images in the image set can be understood as similar images. For example, if the image feature is a scene feature, the images in the image set are images with similar scenes; if the image features are the features of the persons, the images in the image set are the images corresponding to the same persons.

In this embodiment of the present disclosure, optionally, the manner of clustering the image sequence according to the image features corresponding to the images in the image sequence to obtain at least one image set may specifically be: and generating image histograms corresponding to the images in the image sequence, determining the similarity between every two images by calculating the coincidence degree of the image histograms, and clustering the image sequence according to the similarity to obtain at least one image set. The image histogram is used for representing the distribution condition of image pixel values, the range of the representing pixel values is specified through a certain number of cells, and the number of pixels falling into the representing range of each cell is obtained in each cell. A higher coincidence degree of the image histograms of the two images indicates a higher similarity of the two images.

In another alternative embodiment of the present disclosure, please refer to fig. 4, and fig. 4 schematically illustrates a flowchart of clustering an image sequence according to image features corresponding to images in the image sequence according to an embodiment of the present disclosure. As shown in fig. 4, clustering the image sequence according to the image features corresponding to the images in the image sequence to obtain at least one image set, including step S410 and step S420, where:

step S410: and calculating the hash value corresponding to each image in the image sequence, and taking the hash value as the image characteristic corresponding to the image.

Step S420: and determining the similarity between every two images according to the image characteristics, and clustering the image sequences according to the similarity to obtain at least one image set.

It can be seen that implementing this alternative embodiment facilitates obtaining footage for a composite video cover from different categories of images by clustering the images.

The hash value is a segment of data representing an image, and can be understood as a fingerprint of the image, and the hash values corresponding to the images in the image sequence are all different.

In this embodiment of the present disclosure, optionally, calculating a hash value corresponding to each image in the image sequence includes: adjusting the size of each image in the image sequence to a target size (e.g., 8 × 8) and performing image graying to obtain a grayscale image corresponding to each image; calculating an average gray value corresponding to the gray image according to the gray value of each pixel in the gray image, and resetting the gray value of each pixel according to the average gray value; and combining the gray values of the pixels after the pixels are reset, and determining a combination result as a hash value of the corresponding image to obtain the hash value corresponding to each image.

The size of each image in the image sequence is adjusted to be the target size, so that high-frequency information and detail information in the images can be removed, and the difference between the images is reduced; the target size may include 64 pixels. Further, graying an image of a target size is understood to be converting the image into 64-level grays, that is, 64 colors are total for all pixel points in the image. Further, an average gradation value corresponding to the gradation image can be calculated from the gradation value of each pixel in the gradation image. Further, the manner of resetting the gray scale value of each pixel according to the average gray scale value may specifically be: and comparing the gray value of each pixel in the gray image with the average gray value, setting the gray value of the pixel which is greater than the average gray value in the comparison result to be 1, and setting the gray value of the pixel which is less than the average gray value in the comparison result to be 0. Furthermore, the reset gray values may be combined to obtain a hash value corresponding to the image, where the hash value may be a 64-bit integer (e.g., 1010111001011010011010010101101010111001011010011010010101101010111001011010011010010101101001); wherein the combination order of the gray values after being reset for each image in the image sequence is the same.

Therefore, by implementing the optional embodiment, the fingerprints of the images can be obtained by calculating the hash values corresponding to the images, so that the similarity between the images can be conveniently determined according to the fingerprints of the images.

In this embodiment of the present disclosure, optionally, the manner of calculating the hash value corresponding to each image in the image sequence may be: adjusting the size of each image in the image sequence to a target size (e.g., 8 × 8) and performing image graying to obtain a grayscale image corresponding to each image; and performing Discrete Cosine Transform (DCT) on the grayscale image to obtain a matrix corresponding to each image, wherein the size of the matrix can be 32 × 32; calculating a mean value of target data used for representing image low-frequency information in the matrix and comparing the mean value with the target data respectively, wherein the target data can be a matrix of 8 × 8, and the mean value can be a 64-bit integer; setting the target data which are larger than the mean value in the comparison result as first data (such as 1), and setting the target data which are smaller than the mean value in the comparison result as second data (such as 0); and determining a hash value corresponding to the image according to the combination of the first data and the second data.

The Discrete Cosine Transform (DCT) is performed on the grayscale image to obtain a matrix corresponding to each image according to the following formula:

wherein f (i, j) is a pixel value corresponding to a pixel point of the gray image; wherein i is the abscissa of the pixel point in the image, and j is the ordinate of the pixel point in the image; f (u, v) is the corresponding numerical value of the pixel point in the matrix after DCT transformation; wherein u is the number of rows of the pixel point in the matrix, and v is the number of columns of the pixel point in the matrix; c (u) and c (v) are compensation coefficients; and N is the number of pixel points in the image.

It should be noted that, in the matrix obtained after the image is DCT-transformed, data (i.e., target data) in the upper left corner of the matrix is used to represent low-frequency information of the image, and data in the lower right corner is used to represent high-frequency information of the image; the image low-frequency information is used for representing an image main body frame and representing an area with continuously gradually changed gray level in the image; the high frequency information is used to record image details and to represent areas of the image where the gray level changes rapidly, such as the contours of objects (e.g., human faces) in the image. Where DCT transform is generally used for compression of data or images, spatial signals may be converted into the frequency domain.

In this embodiment of the present disclosure, still optionally, the manner of calculating the hash value corresponding to each image in the image sequence may be: adjusting the size of each image in the image sequence to a target size (e.g., 9 × 8) and performing image graying to obtain a grayscale image corresponding to each image; comparing the previous pixel value and the next pixel value of each row in the gray-scale image, setting the previous pixel value which is greater than the next pixel value in the comparison result as first data (such as 1), and setting the previous pixel value which is less than the next pixel value in the comparison result as second data (such as 0); and determining a hash value corresponding to the image according to the combination of the first data and the second data.

Wherein, the similarity between every two images can be represented by similarity and dissimilarity. Alternatively, the similarity between each two images may be represented by a plurality of degrees in a progressive manner, such as similar, more similar, generally similar, … …, less similar, dissimilar, and the like, and the embodiments of the disclosure are not limited.

In this embodiment of the present disclosure, optionally, determining a similarity between every two images according to the image features, and clustering the image sequence according to the similarity to obtain at least one image set, where the method includes:

and calculating the Hamming distance between every two images in the image sequence according to the hash value, taking the Hamming distance as the similarity between every two images, and classifying the images corresponding to the Hamming distance (e.g. 1) smaller than a preset distance (e.g. 5) into the same image set to obtain at least one image set corresponding to the image sequence.

The hamming distance may be the number of bits different between two hash values, for example, hash value 1 is 1001, hash value 2 is 1101, and then the number of positions 2 and 1 in hash value 1 and hash value 2 are different, so the hamming distance between hash value 1 and hash value 2 is 1.

Specifically, the way of calculating the hamming distance between every two images in the image sequence according to the hash value may specifically be: and determining the number of different positions between the hash values corresponding to every two images, and determining the number as the Hamming distance between the two images. In addition, images in an image set may be understood as similar images. The codes corresponding to this alternative embodiment are as follows:

therefore, by implementing the optional embodiment, similar images can be classified into the same image set by comparing the similarity between the images, and the efficiency of selecting the video cover material is improved.

In this embodiment of the present disclosure, another optional method for determining similarity between every two images according to image features and clustering image sequences according to the similarity to obtain at least one image set includes: taking the hash value of the image as an image vector, and calculating the cosine distance between every two images in the image sequence; clustering cosine distances larger than a preset threshold value to obtain at least one image set; wherein the cosine distance has a value range of [0,2 ]]The formula for calculating the cosine distance is as follows:

a and B are vectors corresponding to two images to be calculated respectively.

Or, taking the hash value of the image as an image vector, and calculating the Euclidean distance between every two images in the image sequence; clustering Euclidean distances larger than a preset threshold value to obtainTo at least one image set; wherein, the calculation formula of the Euclidean distance is as follows:

in this embodiment of the present disclosure, optionally, after clustering the image sequence according to the image features corresponding to the images in the image sequence to obtain at least one image set, the method may further include the following steps:

and carrying out object recognition on each image in the image sequence, and adjusting the images in at least one image set according to the object recognition result so that each image in the image set corresponds to the same object.

The object recognition method for each image in the image sequence may specifically be: identifying an image containing an object to be identified in an image sequence according to the characteristic points of the object to be identified; the object to be recognized may be a person, an animal, a scene, a commodity, a plant, and the like, and the embodiment of the disclosure is not limited. For example, if the object to be recognized is a person, the object recognition method for each image in the image sequence may be: the face in each image in the image sequence is identified according to face feature points, wherein the face feature points include but are not limited to eye features, mouth features and the like.

Therefore, by implementing the optional embodiment, the video cover material containing specific elements can be determined by identifying the object in the image, and the efficiency of making the video cover is improved.

In step S330, multi-dimensional evaluation is performed on each image in the image set according to a preset evaluation criterion, and a target image is selected from the image set according to a multi-dimensional evaluation result.

The preset evaluation standard is used for evaluating the output effect of the image in each dimension. The multi-dimensional evaluation may include, but is not limited to, a sharpness evaluation, an aesthetic evaluation, a brightness evaluation, a contrast evaluation, and the like, and the disclosed embodiments are not limited thereto. In addition, the target image can be understood as an image with better output effect in each dimension than other images in the image set, and the target image can be used as a material of a composite video cover.

In this embodiment of the present disclosure, optionally, performing multidimensional evaluation on each image in the image set according to a preset evaluation criterion includes:

determining the definition type corresponding to each image in the image set according to the definition evaluation standard, and determining the beauty score corresponding to each image in the image set according to the beauty evaluation standard; the preset evaluation standard comprises a definition evaluation standard and an aesthetic evaluation standard.

The definition evaluation criterion is a numerical criterion which is defined by a plurality of non-intersection threshold value ranges and corresponds to each definition type to which the image belongs, and the aesthetic degree evaluation criterion is a numerical criterion which is defined by a plurality of non-intersection threshold value ranges and corresponds to each aesthetic degree type to which the image belongs. The threshold range of the sharpness evaluation criterion and the threshold range of the beauty evaluation criterion may be different or the same, and the embodiment of the disclosure is not limited.

Therefore, by implementing the optional embodiment, the images can be evaluated through definition and attractiveness, and better images can be selected as materials, so that the output effect of the synthesized video cover is optimized.

Further, determining a definition type corresponding to each image in the image set according to a definition evaluation standard includes:

convolving each image in the image set through a definition classification network to obtain a first feature vector corresponding to the image, and applying an activation function to the first feature vector to obtain a second feature vector corresponding to the image;

pooling the second feature vector through a definition classification network to obtain a third feature vector corresponding to the image, and fully connecting the first feature vector, the second feature vector and the third feature vector;

and calculating the probability of the image belonging to each definition type according to the full connection result through a definition classification network, and determining the definition type corresponding to each image in the image set according to the probability and the definition evaluation standard.

Wherein, the clarity classification network may be a VGG16 network, and it can be proved that increasing the depth of the network can affect the performance of the network by the VGG network, wherein, the VGG16 network indicates that the depth of the network is 16, that is, the network contains 16 hidden layers, and the convolution of 3 × 3 and the pooling of 2 × 2 are used in the network.

Specifically, the manner of convolving each image in the image set by the sharpness classification network to obtain the first feature vector corresponding to the image may specifically be: performing convolution calculation on each image in the image set through a convolution kernel with a preset size (e.g., 3 x 3) and a preset step length (e.g., 1) to obtain a first feature vector corresponding to each image, wherein the first feature vector is used for representing the image through a dimension lower than that of the original image.

In addition, the manner of applying the activation function to the first feature vector to obtain the second feature vector corresponding to the image may specifically be: converting the values in the first feature vector into a preset range of value range, for example, [ -1,1] through an activation function; obtaining a second feature vector corresponding to the image, wherein the dimension of the second feature vector is the same as that of the first feature vector, and the data corresponding to each position in the second feature vector belong to a preset value range; the activation function may be an Identity function (Identity function), a Step function (Step function), an S-type function (Sigmoidal function), a Ramp function (Ramp function), a hyperbolic tangent function (TanH), an arctangent function (ArcTan), an Inverse Square root function (Inverse Square root unit, ISRU), an Inverse Square root linear function (Inverse Square root linear unit, ISRLU), a Square nonlinear function (SQNL), a linear rectification function (Rectified linear unit, ReLU), a two-level linear rectification function (Bipolar linear unit, BReLU), a parameterized linear rectification function (parametric linear unit, prlu), and the like, but the disclosed embodiment is not limited thereto.

In addition, the mode of pooling the second feature vector through the definition classification network to obtain a third feature vector corresponding to the image may specifically be: and performing average pooling on the second feature vectors to obtain third feature vectors corresponding to the images, wherein the dimensionality of the third feature vectors is lower than that of the second feature vectors, and the third feature vectors are used for representing the corresponding images through lower dimensionality compared with the second feature vectors. And, optionally, may further include the steps of: and applying an activation function to the third feature vector to enable the numerical value corresponding to each position in the third feature vector to belong to a preset value range.

In addition, the first feature vector, the second feature vector and the third feature vector are fully connected to obtain a full connection result, which can be represented as a feature vector for comprehensively representing the first feature vector, the second feature vector and the third feature vector, and the probability that the image belongs to each sharpness type (for example, the probability of belonging to the sharpness type is 20%, the probability of belonging to the general type is 60% and the probability of belonging to the blur type is 20%) can be calculated through the full connection result; among them, the sharpness types may include, but are not limited to: clear, general, and fuzzy.

In addition, the mode of determining the definition type corresponding to each image in the image set according to the probability and the definition evaluation criterion may specifically be: and determining the definition type with the highest probability as the definition type to which the image belongs.

Therefore, by implementing the optional embodiment, the definition type of the image can be determined through a simple VGG16 network, the occupation of computing resources is reduced, and the definition evaluation efficiency of the image is improved.

Further, determining an aesthetic score corresponding to each image in the image set according to the aesthetic evaluation criteria includes:

calculating the beauty degree grading distribution of the input sample image set through a beauty degree evaluation network;

calculating a loss function between the aesthetic degree score distribution and the original aesthetic degree score distribution corresponding to the sample image set;

updating parameters of the beauty evaluation network according to the loss function;

and predicting the beauty degree of each image in the image set through the beauty degree evaluation network after the parameters are updated, wherein the prediction result comprises the beauty degree score corresponding to the image.

The beauty evaluation network may be an Image quality evaluation Network (NIMA), wherein the NIMA is a Neural network based on deep object recognition, and can predict distribution of evaluation opinions of human beings on images from direct perception and attraction. Optionally, the aesthetic score distribution may include 10 points, i.e., the aesthetic score is 1 point, 2 points, 3 points, 4 points, 5 points, 6 points, 7 points, 8 points, 9 points, 10 points. Optionally, the original beauty score corresponding to the sample image with the blurred sample image set in the sharpness type may be set to 1/2/3.

Specifically, the manner of calculating the distribution of the beauty score of the input sample image set through the beauty evaluation network may specifically be: calculating the aesthetic degree score distribution of the input sample image set through NIMA; and, by expression

Calculating the mean value μ (e.g., 5) of the aesthetic score distribution; wherein N is the number of scores in the aesthetic score distribution, s_iIn order to be the ith score,

for images belonging to s_iThe probability of (d); and, this step may also include: by expression

Calculating an aesthetic standard deviation σ (e.g., 0.4); the aesthetic degree score distribution can include 10 scores, and also can be understood as dividing the aesthetic degree of the image into 10 grades, wherein the 10 th grade is more attractive than the 9 th grade, and so on, the image of the next grade is more attractive than the image of the previous grade.

In addition, the way of calculating the loss function between the aesthetic score distribution and the original aesthetic score distribution corresponding to the sample image set may specifically be: based on expressions

A loss function between the aesthetic score distribution and an original aesthetic score distribution corresponding to the sample image set can be calculated; wherein the content of the first and second substances,

for the raw aesthetic score distribution, CDF, corresponding to the sample image set_p(k) Is composed of

The cumulative distribution function of (a) is,

is composed of

R is a constant.

In addition, in updating the parameters of the beauty evaluation network based on the loss function, the parameters of the beauty evaluation network may be expressed as matrix weights. Optionally, the parameter of the beauty evaluation network may be updated according to the loss function until the average μ and standard deviation σ of the beauty score distribution both fall within the standard data range.

In addition, the manner of predicting the beauty degree of each image in the image set through the beauty degree evaluation network after updating the parameters may specifically be: adjusting the size of each image in the image set, randomly cutting the adjusted image (such as 256 × 256), and inputting the cut image (such as 224 × 224) into the NIMA network with updated parameters, so that the NIMA network calculates the aesthetic degree score distribution (i.e. the probability of belonging to each grade of aesthetic degree) corresponding to the image; and determining the beauty degree score with the highest corresponding probability in the beauty degree score distribution as the beauty degree score of the image in the image set, and further obtaining the beauty degree score corresponding to each image in the image set. In addition, the above prediction result also includes a standard deviation of beauty.

Therefore, by implementing the optional embodiment, the quality of the image can be distinguished by evaluating the aesthetic degree of each image, so that the better image can be selected as the video cover material, and the making effect of the video cover can be improved.

In this embodiment of the present disclosure, optionally, selecting a target image from an image set according to a multi-dimensional evaluation result includes:

sequencing the image sequence according to the definition type corresponding to each image in the image set, and adjusting the sequencing result according to the beauty score corresponding to each image in the image set;

identifying the object characteristics of each image in the image sequence, and screening the adjusted sequencing result according to the object characteristics; the object features are used for representing the morphology of the object in each image;

and selecting a target image from the image set according to the screening result.

If the image sequence comprises 3 images with clear definition types, 2 images with general definition types and 1 image with unclear definition types. The manner of sorting the image sequence according to the definition type corresponding to each image in the image set may be: and sequencing the images from clear to unclear according to the definition types corresponding to the images. Then, the ranking result may be: 3 clear images are ranked first, 2 general images are ranked second, and 1 unclear image is ranked third. If the beauty scores of 3 clear images are respectively 10 scores, 9 scores and 8 scores, the beauty scores of 2 common images are respectively 6 scores and 5 scores, and the beauty score of 1 unclear image is 1 score. Adjusting the ranking result according to the beauty score corresponding to each image in the image set, wherein the obtained adjusted ranking result can be as follows: the clear image ranking of 10 points is first, the clear image ranking of 9 points is second, the clear image ranking of 8 points is third, the general image ranking of 6 points is fourth, the general image ranking of 5 points is fifth, and the unclear image ranking of 1 point is sixth. The object feature may be a feature of an image focus, and the image focus may be a person, an animal, a plant, a commodity, or the like. If the image focus is a person, the object features may be face features. In addition, if the object is a person, the shape of the object can be understood as the imaging angle of the person. The adjusted sorting result is screened according to the object features, clear images of 8 points, general images of 5 points and unclear images of 1 point which do not include the object features in the adjusted sorting result can be screened, and the obtained screening result can include: a sharp image of 10 points, a sharp image of 9 points, and a general image of 6 points. In addition, the number of the target images can be one or more, if the number of the target images is one, the target images selected from the image set according to the screening result are clear images of 10 points; the target image is optimal in appearance and definition, contains required object features, can be used as a material of a composite video cover, and the appearance score corresponding to the target image is larger than a preset appearance score (for example, 5 points).

In addition, optionally, the manner of selecting the target image from the image set according to the screening result may be: and respectively selecting target images from the image sets according to the screening result, wherein the number of the target images can be one or more, and the number of the target images selected from the image sets is the same.

Therefore, by implementing the optional embodiment, one or more target images with optimal image quality can be selected according to the definition and the attractiveness of the images to serve as a production material of the video cover, and the quality of the video cover is further improved.

Further, identifying the object feature of each image in the image sequence includes:

extracting human face characteristic points of each image in the image sequence through a human face detection algorithm;

determining object features according to the face feature points, wherein the object features comprise at least one of the following: the human face deflection angle, the human eye closing state, the distance between the human face and the frame of the image and the human face area.

The human face feature points may include, but are not limited to, eye features, nose features, mouth features, ear features, face contour features, and the like, and the embodiments of the present disclosure are not limited thereto. The human face deflection angle can be understood as a shooting angle for a human face, and the human eye closing state can include, but is not limited to, open eyes, half-open eyes and closed eyes. In addition, the human face feature points may also include, but are not limited to, mouth opening and closing states, and the like, and the embodiments of the present disclosure are not limited thereto.

Specifically, the manner of extracting the human face feature points of each image in the image sequence by the human face detection algorithm may specifically be: extracting human face characteristic points of each image in the image sequence through Faceboxes; the Faceboxes are a real-time face detection algorithm and comprise a Rapid Digestion Convolutional Layer (RDCL) and a multi-scale convolutional layer (MSCL), wherein the RDCL is used for guaranteeing the real-time face detection speed, and the MSCL is used for enriching the receptive field and discretizing the faces with different convolutional layers and different scales.

Therefore, by implementing the alternative embodiment, the advantages and disadvantages of the images can be further distinguished through the identification of the object characteristics in the images, so that the images meeting the requirements can be selected as the video cover materials.

Further, selecting a target image from the image set according to the screening result, comprising:

if the synthesis parameter is 1, selecting the image with the highest aesthetic degree score (such as 10 points) from the image set according to the screening result to determine the image as the target image;

if the synthesis parameter is larger than 1, calculating a first inter-frame distance and a second inter-frame distance between every two images in the screening result, selecting two images with the largest difference value between the first inter-frame distance and the second inter-frame distance, and determining the two images with the largest difference value as target images; the first inter-frame distance and the second inter-frame distance are used for representing the similarity between every two images, and the first inter-frame distance and the second inter-frame distance do not belong to a preset distance range.

Wherein the composition parameter indicates the number of target images (e.g., 2) required to compose a particular image representing the video file. The image corresponding to the inter-frame distance in the preset distance range is a general image, the larger the first inter-frame distance of the selected target image is, the higher the similarity of the face height, the face area, the image saturation, the image brightness and the image sharpness of the two images is, the smaller the second inter-frame distance is, the more dissimilar the face orientation directions of the two images is, for example, the face orientation in fig. 1 is towards the right, the face orientation in fig. 2 is towards the left, and the face height, the face area, the image saturation, the image brightness and the image sharpness in fig. 1 and 2 are the same.

In addition, optionally, if at least two image sets with the highest aesthetic score in the image set (e.g., the aesthetic scores of at least two images are 10 points), the image with the highest aesthetic score is selected from the image sets according to the screening result, and the mode of determining the image with the highest aesthetic score as the target image may specifically be: and selecting the image with the highest aesthetic degree score from the image set according to the screening result, and selecting the image which is most close to the theme (such as the shadow of the old page and the animal) from the images with the highest aesthetic degree score according to the theme (such as the old page going to the zoo) of the video file as a target image.

Therefore, by implementing the alternative embodiment, the optimal one or more target images can be selected according to the required number of the target images so as to be used for synthesizing the video cover, and the synthesis efficiency and the synthesis effect are improved.

Further, calculating a first inter-frame distance and a second inter-frame distance between every two images in the screening result includes:

determining a first image characteristic and a second image characteristic of each image in the screening result; wherein the first image characteristic comprises at least one of face height, face area, image saturation, image brightness and image sharpness; the second image feature comprises a face orientation;

and calculating a first inter-frame distance between every two images in the screening result according to the first image characteristics, and calculating a second inter-frame distance between every two images in the screening result according to the second image characteristics.

The human face height is used for representing the distance between the human face and the edge of the image, the human face area is used for representing the area occupied by the human face in the image, and the human face orientation is used for representing the direction faced by the human face in the image.

Specifically, the manner of determining the first image feature of each image in the screening result may be: determining the face position of each image in the screening result according to the face characteristic points, and determining the face height according to the distance between the face position and the image edge, wherein the face height is used for indicating the degree of the face deviating from the image center and is used as a first image characteristic; the face position may include coordinates of the face in the image; drawing the face contour of each image in the screening result according to the coordinates of the face characteristic points in the images, and calculating the face area in each image according to the face contour to serve as a first image characteristic; and acquiring the image saturation, the image brightness and the image sharpness of each image in the screening result as first image characteristics.

Specifically, the manner of determining the second image feature of each image in the screening result may be: and determining the face orientation of each image in the screening result according to the face characteristic points. For example, if the face feature points include a left face and a left ear, the face may be considered to face to the right.

Therefore, by implementing the optional embodiment, the image quality of the selected target image can be ensured by determining the first image feature and the second image feature.

In this disclosure, optionally, the method may further include the following steps:

if the number of the target images is 1, adding map materials in the target images, adjusting the sizes of the target images (for example, 1024 × 1024), and determining the target images after size adjustment as specific images for representing the video files; and if the number of the target images is more than 1, synthesizing the target images into a specific image for representing the video file according to the image synthesis rule.

Wherein a particular image used to represent a video file may be understood as a cover page of the video file, for presentation or as a window into the video file. In addition, the image synthesis rule may include a stitching rule, an overlapping rule, and the like, and the embodiment of the present disclosure is not limited; the splicing rule is used for specifying how to splice the target images, for example, the left edge of the image with the face facing left is used as a first splicing position, the right edge of the image with the face facing right is used as a second splicing position, and the first splicing position and the second splicing position are spliced to obtain a specific image; the overlap rule is used for how the vertices overlap. Additionally, mapping material added to the target image may be extracted from a mapping library, and the mapping material may include various types of identifiers (e.g., lightning identifiers, battle identifiers, etc.) for enhancing the expressiveness of the particular image.

In addition, optionally, the manner of synthesizing the target image into the specific image for representing the video file according to the image synthesis rule may specifically be: and splicing the target images according to the splicing rules in the image synthesis rules, and carrying out size adjustment on the spliced images to further obtain the specific images for representing the video files.

Therefore, by implementing the optional embodiment, the specific image representing the video file can be synthesized through the selected target image, so that the automation degree of video cover synthesis is improved, and the synthesis effect of the video cover is improved.

Therefore, by implementing the image selection method shown in fig. 3, the problem of low efficiency of material selection performed manually can be solved, and the material (i.e., the target image) for synthesizing the video cover can be determined by extracting and analyzing the video frame, so that the efficiency of material selection is improved; and the selection quality of the video cover material can be ensured through clustering and multi-dimensional evaluation of the images.

Referring to fig. 5, fig. 5 schematically illustrates an implementation of an image selecting method according to an embodiment of the disclosure. As shown in fig. 5, the image processing system comprises a video file 501, an image sequence 502, n image sets 503, an ordering result 504, a screening result 505 and a target image 506; wherein n is a positive integer.

Specifically, video frames of the video file 501 may be extracted to obtain an image sequence 502 including a plurality of video frames; furthermore, the image sequence 502 may be clustered according to image features corresponding to each video frame (i.e., each image) in the image sequence to obtain n image sets 503, where the n image sets 503 include an image set 1, an image set 2, … …, and an image set n, and each image set may include one or more images; furthermore, the images in the n image sets 503 may be ranked according to the sharpness evaluation criterion and the beauty evaluation criterion to obtain a ranking result 504, where the images included in the ranking result 504 are the image 1 in the image set 2, the image 1 in the image set 1, the image 2 in the image set 2, the image 1 in the image set n, the image 2 in the image set 1, the images 2 and … … in the image set n, the image m in the image set n, the image in the image set 1, and the image m in the image set 2, where m is a positive integer; furthermore, images may be selected from each image set according to the above-mentioned ranking result 504 to obtain a selection result 505, and the selection result 505 includes image 1 in image set 2, image sets 1 and … … in image set 1, and image 1 in image set n; further, a target image 506 for synthesizing a specific image (i.e., a cover page) representing the video file 501 may be selected from the screening results 505, and the target image 506 is one or more images optimal in the screening results, such as image set 1 in image set 1.

Therefore, by implementing the embodiment of the image selection method shown in fig. 5, the problem of low efficiency of manually selecting materials can be solved, and the materials (i.e., the target images) for synthesizing the video cover can be determined by extracting and analyzing the video frames, so that the efficiency of selecting the materials is improved.

Referring to fig. 6, fig. 6 schematically illustrates an application diagram of an image selecting method according to an embodiment of the present disclosure. As shown in fig. 6, video frames of a video file 601 may be extracted to obtain a plurality of video frames, i.e., an image sequence 602; furthermore, the image sequence 602 may be clustered according to image features corresponding to each video frame (i.e., each image) in the image sequence, so as to obtain 5 image sets 603, i.e., an image set 6031, an image set 6032, an image set 6033, an image set 6034, and an image set 6035; wherein each image set comprises a plurality of images respectively; furthermore, the ranking result 604 obtained by ranking the 5 image sets 603 according to the sharpness evaluation criterion and the beauty evaluation criterion may include images 6041, 6042, 6043, 6044, … …, and 6049 arranged in order, where the number of images included in the ranking result 604 is equal to the sum of the number of images of each image set; furthermore, the sorting result 604 may be filtered, and the obtained filtering result 605 may include an image 6051, an image 6052, an image 6053, an image 6054, and an image 6055; further, a target image 606 for synthesizing a specific image (i.e., a cover) representing the video file 601 may be selected from the filtering result 605, and in the exemplary illustration of fig. 6, the number of target images may be 2, that is, an image 6061 and an image 6062; in turn, a particular image 608 representing the video file 601 may be synthesized by stitching the image 6061 and the image 6062.

Therefore, by implementing the embodiment of the image selection method shown in fig. 6, the problem of low efficiency of manually selecting materials can be solved, and the materials (i.e., target images) for synthesizing the video cover can be determined by extracting and analyzing the video frames, so that the efficiency of selecting the materials is improved.

Referring to fig. 7, fig. 7 schematically illustrates a flowchart of an image selecting method according to another embodiment of the present disclosure, where the image selecting method may be performed by the server 105 and/or the

terminal devices

101, 102, and 103 shown in fig. 1, and the embodiment of the present disclosure is not limited. As shown in fig. 7, the method includes steps S710 to S760, in which:

step S710: and performing video extraction on the video file to obtain an image sequence.

Step S720: and clustering the image sequences to obtain at least one image set.

Step S730: people in the images of each image set are identified.

Step S740: detecting whether the image contains a person, if so, executing step S750; if not, the process is ended.

Step S750: and merging the image sets of the same person.

Step S760: and performing multi-dimensional evaluation on each image in the image set according to a preset evaluation standard, and selecting a target image from the image set according to a multi-dimensional evaluation result, wherein the target image is used for synthesizing a specific image representing a video file.

The video extraction can be performed on the video file to obtain an image sequence, and then the image sequence is clustered to obtain at least one image set, at this time, each image set corresponds to a person, for example, the image set 1 corresponds to the king, the image set 2 corresponds to the old, and a small number of old images may be classified into the image set 1 in the clustering process. And according to the person identification of the images in each image set, the misclassified images can be adjusted back to the image set to which the misclassified images belong so as to further perform multi-dimensional evaluation on the images in the image sets according to a preset evaluation standard, and a target image is selected from the image sets according to a multi-dimensional evaluation result.

Referring to fig. 8, fig. 8 schematically illustrates a flowchart of an image selecting method according to another embodiment of the present disclosure, where the image selecting method may be performed by the server 105 and/or the

terminal devices

101, 102, and 103 shown in fig. 1, and the embodiment of the present disclosure is not limited. As shown in fig. 8, the method includes steps S800 to S824, in which:

step S800: and extracting video frames of the video file to obtain an image sequence.

Step S802: and adjusting the size of each image in the image sequence to a target size and carrying out image graying to obtain a grayscale image corresponding to each image.

Step S804: and calculating the average gray value corresponding to the gray image according to the gray value of each pixel in the gray image, and resetting the gray value of each pixel according to the average gray value.

Step S806: and combining the gray values of the pixels after being reset, and determining a combination result as a hash value of the corresponding image to obtain the hash value corresponding to each image.

Step S808: and calculating the Hamming distance between every two images in the image sequence according to the Hash value, taking the Hamming distance as the similarity between every two images, and classifying the images corresponding to the Hamming distance smaller than the preset distance into the same image set to obtain at least one image set corresponding to the image sequence.

Step S810: and carrying out object recognition on each image in the image sequence, and adjusting the images in at least one image set according to the object recognition result so that each image in the image set corresponds to the same object.

Step S812: and determining the definition type corresponding to each image in the image set according to the definition evaluation standard, and determining the beauty score corresponding to each image in the image set according to the beauty evaluation standard.

Step S814: sequencing the image sequence according to the definition type corresponding to each image in the image set, and adjusting the sequencing result according to the beauty score corresponding to each image in the image set; further, if the synthesis parameter is 1, step S816 is executed; if the synthesis parameter is greater than 1, step S818 is performed.

Step S816: and selecting the image with the highest aesthetic degree score from the image set according to the screening result, and determining the image with the highest aesthetic degree score as the target image.

Step S818: determining a first image characteristic and a second image characteristic of each image in the screening result; wherein the first image characteristic comprises at least one of face height, face area, image saturation, image brightness and image sharpness; the second image feature includes a face orientation.

Step S820: and calculating a first inter-frame distance between every two images in the screening result according to the first image characteristics, and calculating a second inter-frame distance between every two images in the screening result according to the second image characteristics.

Step S822: and selecting two images with the largest difference value between the first inter-frame distance and the second inter-frame distance to determine as target images, wherein the first inter-frame distance and the second inter-frame distance are used for representing the similarity between every two images, and the first inter-frame distance and the second inter-frame distance do not belong to a preset distance range.

Step S824: a specific image representing the video file is generated from the target image.

It should be noted that steps S800 to S824 correspond to the embodiment described in fig. 3, and therefore, the content of the embodiment corresponding to steps S800 to S824 and the limitation please refer to the content of the embodiment corresponding to fig. 3, which is not described herein again.

Therefore, by implementing the method shown in fig. 8, the problem of low efficiency of manually selecting the material can be solved, and the material for synthesizing the video cover can be determined by extracting and analyzing the video frame, so that the efficiency of selecting the material is improved; and the selection quality of the video cover material can be ensured through clustering and multi-dimensional evaluation of the images.

Further, in the present exemplary embodiment, an image selecting apparatus is also provided. The image selecting device can be applied to a server or a terminal device. Referring to fig. 9, the image selecting apparatus 900 may include a video frame extracting unit 901, an image clustering unit 902, and a target image selecting unit 903, where:

a video frame extraction unit 901, configured to extract video frames from a video file to obtain an image sequence;

an image clustering unit 902, configured to cluster the image sequences according to image features corresponding to the images in the image sequences to obtain at least one image set;

and the target image selecting unit 903 is configured to perform multi-dimensional evaluation on each image in the image set according to a preset evaluation standard, and select a target image from the image set according to a multi-dimensional evaluation result, where the target image is used to synthesize a specific image representing a video file.

Therefore, by implementing the image selection device shown in fig. 9, the problem of low efficiency of material selection performed manually can be solved, and the material (i.e., the target image) for synthesizing the video cover can be determined by extracting and analyzing the video frame, so that the efficiency of material selection is improved; and the selection quality of the video cover material can be ensured through clustering and multi-dimensional evaluation of the images.

In an exemplary embodiment of the present disclosure, the video frame extracting unit 901 performs video frame extraction on a video file, and a manner of obtaining an image sequence may specifically be:

the video frame extraction unit 901 extracts video frames corresponding to important contents according to video frame identifiers used for representing the important contents on a time axis of a video file to obtain an image sequence; alternatively, the first and second electrodes may be,

the video frame extraction unit 901 extracts video frames corresponding to video contents including a target object in a video file to obtain an image sequence.

Therefore, by implementing the exemplary embodiment, the required image can be obtained by extracting the video frame, and compared with the manual selection of the image, the efficiency of image selection can be improved.

In an exemplary embodiment of the disclosure, the image clustering unit 902 clusters the image sequence according to the image features corresponding to the images in the image sequence, and a manner of obtaining at least one image set may specifically be:

the image clustering unit 902 calculates a hash value corresponding to each image in the image sequence, and takes the hash value as an image feature corresponding to the image;

the image clustering unit 902 determines similarity between every two images according to the image features, and clusters the image sequence according to the similarity to obtain at least one image set.

It can be seen that implementing this exemplary embodiment, it is possible to facilitate acquisition of footage for a composite video cover from different categories of images by clustering the images.

In an exemplary embodiment of the disclosure, the way for the image clustering unit 902 to calculate the hash value corresponding to each image in the image sequence may specifically be:

the image clustering unit 902 adjusts the size of each image in the image sequence to a target size and performs image graying to obtain a grayscale image corresponding to each image;

the image clustering unit 902 calculates an average gray value corresponding to the gray image according to the gray value of each pixel in the gray image, and resets the gray value of each pixel according to the average gray value;

the image clustering unit 902 combines the gray values after the pixels are reset, and determines the combination result as the hash value of the corresponding image to obtain the hash value corresponding to each image.

Therefore, by implementing the exemplary embodiment, the fingerprints of the images can be obtained by calculating the hash values corresponding to the images, and the similarity between the images can be conveniently determined according to the fingerprints of the images.

In an exemplary embodiment of the disclosure, the image clustering unit 902 determines a similarity between every two images according to the image features, and clusters the image sequence according to the similarity, and a manner of obtaining at least one image set may specifically be:

the image clustering unit 902 calculates a hamming distance between every two images in the image sequence according to the hash value, uses the hamming distance as a similarity between every two images, and classifies the images corresponding to the hamming distance smaller than a preset distance as the same image set to obtain at least one image set corresponding to the image sequence.

Therefore, by implementing the exemplary embodiment, similar images can be classified into the same image set by comparing the similarity between the images, which is beneficial to improving the efficiency of selecting the video cover material.

In an exemplary embodiment of the present disclosure, the apparatus may further include an object recognition unit (not shown), wherein:

and the object identification unit is used for clustering the image sequence according to the image characteristics corresponding to the images in the image sequence by the image clustering unit to obtain at least one image set, then carrying out object identification on the images in the image sequence, and adjusting the images in the at least one image set according to the object identification result so that the images in the image set all correspond to the same object.

Therefore, by implementing the exemplary embodiment, the video cover material containing specific elements can be determined by identifying the object in the image, and the efficiency of video cover production can be improved.

In an exemplary embodiment of the disclosure, the manner of performing the multidimensional evaluation on each image in the image set by the target image selecting unit 903 according to the preset evaluation criterion may specifically be:

the target image selection unit 903 determines the definition type corresponding to each image in the image set according to the definition evaluation standard, and determines the beauty score corresponding to each image in the image set according to the beauty evaluation standard; the preset evaluation standard comprises a definition evaluation standard and an aesthetic evaluation standard.

Therefore, by implementing the exemplary embodiment, the images can be evaluated through definition and aesthetic degree, and better images can be selected from the images as materials, so that the output effect of the synthesized video cover is optimized.

In an exemplary embodiment of the disclosure, the manner in which the target image selecting unit 903 determines the sharpness type corresponding to each image in the image set according to the sharpness evaluation criterion may specifically be:

the target image selection unit 903 convolves each image in the image set through a definition classification network to obtain a first feature vector corresponding to the image, and applies an activation function to the first feature vector to obtain a second feature vector corresponding to the image;

the target image selection unit 903 pools the second feature vector through a definition classification network to obtain a third feature vector corresponding to the image, and fully connects the first feature vector, the second feature vector and the third feature vector;

the target image selecting unit 903 calculates the probability that the image belongs to each definition type according to the full connection result through a definition classification network, and determines the definition type corresponding to each image in the image set according to the probability and the definition evaluation standard.

Therefore, by implementing the exemplary embodiment, the definition type of the image can be determined through a simple VGG16 network, the occupation of computing resources is reduced, and the definition evaluation efficiency of the image is improved.

In an exemplary embodiment of the disclosure, the manner in which the target image selection unit 903 determines the beauty score corresponding to each image in the image set according to the beauty evaluation criterion may specifically be:

the target image selection unit 903 calculates the beauty degree score distribution of the input sample image set through a beauty degree evaluation network;

the target image selecting unit 903 calculates a loss function between the aesthetic degree score distribution and the original aesthetic degree score distribution corresponding to the sample image set;

the target image selection unit 903 updates the parameters of the beauty evaluation network according to the loss function;

the target image selection unit 903 predicts the beauty degree of each image in the image set through the beauty degree evaluation network after the parameters are updated, and the prediction result includes the beauty degree score corresponding to the image.

Therefore, by implementing the exemplary embodiment, the quality of the image can be distinguished by evaluating the aesthetic degree of each image, so that the better image can be selected as the video cover material, and the making effect of the video cover can be improved.

In an exemplary embodiment of the disclosure, the manner in which the target image selecting unit 903 selects the target image from the image set according to the multi-dimensional evaluation result may specifically be:

the target image selection unit 903 sorts the image sequence according to the definition type corresponding to each image in the image set, and adjusts the sorting result according to the beauty score corresponding to each image in the image set;

the target image selection unit 903 identifies the object characteristics of each image in the image sequence, and screens the adjusted sorting result according to the object characteristics; the object features are used for representing the morphology of the object in each image;

the target image selecting unit 903 selects a target image from the image set according to the screening result.

Therefore, by implementing the exemplary embodiment, one or more target images with optimal image quality can be selected according to the definition and the aesthetic measure of the images to serve as the production material of the video cover, and the quality of the video cover is further improved.

In an exemplary embodiment of the disclosure, the manner in which the target image selecting unit 903 identifies the object feature of each image in the image sequence may specifically be:

the target image selection unit 903 extracts the face feature points of each image in the image sequence through a face detection algorithm;

the target image selecting unit 903 determines object features according to the face feature points, wherein the object features include at least one of the following: the human face deflection angle, the human eye closing state, the distance between the human face and the frame of the image and the human face area.

It can be seen that, by implementing the exemplary embodiment, the advantages and disadvantages of the images can be further distinguished through the identification of the object features in the images, so that the images meeting the requirements can be selected as the video cover materials.

In an exemplary embodiment of the disclosure, the manner in which the target image selecting unit 903 selects the target image from the image set according to the screening result may specifically be:

if the synthesis parameter is 1, the target image selection unit 903 selects an image with the highest aesthetic degree score from the image set according to the screening result, and determines the image with the highest aesthetic degree score as a target image;

if the synthesis parameter is greater than 1, the target image selection unit 903 calculates a first inter-frame distance and a second inter-frame distance between every two images in the screening result, selects two images with the largest difference between the first inter-frame distance and the second inter-frame distance, and determines the two images with the largest difference as target images; the first inter-frame distance and the second inter-frame distance are used for representing the similarity between every two images, and the first inter-frame distance and the second inter-frame distance do not belong to a preset distance range.

Therefore, by implementing the exemplary embodiment, an optimal target image or images can be selected through the required number of target images to be used for synthesizing the video cover, so that the synthesizing efficiency and the synthesizing effect are improved.

In an exemplary embodiment of the disclosure, the manner in which the target image selecting unit 903 calculates the first inter-frame distance and the second inter-frame distance between every two images in the screening result may specifically be:

the target image selection unit 903 determines a first image characteristic and a second image characteristic of each image in the screening result; wherein the first image characteristic comprises at least one of face height, face area, image saturation, image brightness and image sharpness; the second image feature comprises a face orientation;

the target image selecting unit 903 calculates a first inter-frame distance between every two images in the screening result according to the first image characteristics, and calculates a second inter-frame distance between every two images in the screening result according to the second image characteristics.

Therefore, by implementing the exemplary embodiment, the image quality of the selected target image can be ensured through the determination of the first image characteristic and the second image characteristic.

In an exemplary embodiment of the present disclosure, the apparatus may further include an image synthesizing unit (not shown), wherein:

an image synthesizing unit for adding a map material to the target image and adjusting the size of the target image when the number of the target images is 1, and determining the size-adjusted target image as a specific image for representing a video file; when the number of the target images is larger than 1, the target images are synthesized into a specific image representing the video file according to the image synthesis rule.

Therefore, by implementing the exemplary embodiment, the specific image used for representing the video file can be synthesized through the selected target image, so that the automation degree of video cover synthesis is improved, and the synthesis effect of the video cover is improved.

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

For details that are not disclosed in the embodiments of the apparatus of the present disclosure, please refer to the embodiments of the image selecting method of the present disclosure for the details that are not disclosed in the embodiments of the apparatus of the present disclosure.

As another aspect, the present application also provides a computer-readable medium, which may be contained in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by an electronic device, cause the electronic device to implement the method described in the above embodiments.

It should be noted that the computer readable media shown in the present disclosure may be computer readable signal media or computer readable storage media or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer-readable signal medium may include a propagated data signal with computer-readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software, or may be implemented by hardware, and the described units may also be disposed in a processor. Wherein the names of the elements do not in some way constitute a limitation on the elements themselves.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. An image selection method, comprising:

extracting video frames of the video file to obtain an image sequence;

clustering the image sequence according to the image characteristics corresponding to the images in the image sequence to obtain at least one image set, wherein the images in each image set correspond to the same object;

carrying out multi-dimensional evaluation on each image in the image set according to a preset evaluation standard, and selecting a target image from the image set according to a multi-dimensional evaluation result, wherein the target image is used for synthesizing a specific image representing the video file, and the preset evaluation standard comprises a definition evaluation standard and an aesthetic degree evaluation standard;

after clustering the image sequence according to the image features corresponding to the images in the image sequence to obtain at least one image set, the method further comprises:

performing object recognition on each image in the image sequence, and adjusting the images in the at least one image set according to an object recognition result so that each image in the image set corresponds to the same object;

performing multidimensional evaluation on each image in the image set according to a preset evaluation standard, wherein the multidimensional evaluation comprises the following steps:

determining a definition type corresponding to each image in the image set according to a definition evaluation standard, determining an aesthetic degree score corresponding to each image in the image set according to an aesthetic degree evaluation standard, and screening the images in the image set according to the definition type and the aesthetic degree score corresponding to each image in the image set;

the method further comprises the following steps: selecting the target image from the image set according to a screening result, wherein the selecting the target image from the image set according to the screening result comprises:

if the synthesis parameter is larger than 1, determining a first image characteristic and a second image characteristic of each image in the screening result; wherein the first image characteristic comprises at least one of face height, face area, image saturation, image brightness and image sharpness; the second image feature comprises a face orientation; calculating a first inter-frame distance between every two images in the screening result according to the first image characteristics, and calculating a second inter-frame distance between every two images in the screening result according to the second image characteristics; selecting two images with the largest difference value between the first inter-frame distance and the second inter-frame distance, and determining the two images with the largest difference value as the target image; the first inter-frame distance and the second inter-frame distance are used for representing the similarity between every two images, the first inter-frame distance and the second inter-frame distance do not belong to a preset distance range, and the synthesis parameter represents the number of target images required for synthesizing a specific image for representing the video file.

2. The method of claim 1, wherein the extracting video frames from the video file to obtain the image sequence comprises:

extracting video frames corresponding to the important content according to the video frame identification which is used for representing the important content on the time axis of the video file so as to obtain the image sequence; alternatively, the first and second electrodes may be,

and extracting video frames corresponding to the video content containing the target object in the video file to obtain the image sequence.

3. The method of claim 1, wherein clustering the image sequence according to image features corresponding to each image in the image sequence to obtain at least one image set comprises:

calculating a hash value corresponding to each image in the image sequence, and taking the hash value as an image feature corresponding to the image;

and determining the similarity between every two images according to the image characteristics, and clustering the image sequence according to the similarity to obtain at least one image set.

4. The method of claim 3, wherein computing the hash value for each image in the sequence of images comprises:

adjusting the size of each image in the image sequence to a target size and carrying out image graying to obtain a grayscale image corresponding to each image;

calculating an average gray value corresponding to the gray image according to the gray value of each pixel in the gray image, and resetting the gray value of each pixel according to the average gray value;

and combining the gray values of the pixels after being reset, and determining a combination result as the corresponding hash value of the image to obtain the hash value corresponding to each image.

5. The method of claim 3, wherein determining a similarity between each two of the images according to the image features, and clustering the image sequence according to the similarity to obtain at least one image set comprises:

and calculating the Hamming distance between every two images in the image sequence according to the Hash value, taking the Hamming distance as the similarity between every two images, and classifying the images corresponding to the Hamming distance smaller than a preset distance into the same image set so as to obtain at least one image set corresponding to the image sequence.

6. The method of claim 1, wherein determining the sharpness type corresponding to each image in the image set according to a sharpness evaluation criterion comprises:

pooling the second feature vector through the definition classification network to obtain a third feature vector corresponding to a sample, and fully connecting the first feature vector, the second feature vector and the third feature vector;

and calculating the probability of the image belonging to each definition type according to a full connection result through the definition classification network, and determining the definition type corresponding to each image in the image set according to the probability and the definition evaluation standard.

7. The method of claim 1, wherein determining the aesthetic score for each image in the set of images according to an aesthetic evaluation criterion comprises:

calculating a loss function between the aesthetic score distribution and an original aesthetic score distribution corresponding to the sample image set;

updating the parameters of the aesthetic degree evaluation network according to the loss function;

8. The method of claim 1, wherein selecting a target image from the set of images based on a multi-dimensional evaluation further comprises:

sequencing the image sequence according to the definition type corresponding to each image in the image set, and adjusting the sequencing result according to the aesthetic degree score corresponding to each image in the image set;

identifying the object characteristics of each image in the image sequence, and screening the adjusted sequencing result according to the object characteristics; wherein the object features are used to characterize the morphology of the object in the respective images;

and selecting the target image from the image set according to the screening result.

9. The method of claim 1, wherein identifying object features for each image in the sequence of images comprises:

determining object features from the face feature points, the object features including at least one of: the image processing method comprises the steps of human face deflection angle, human eye closing state, distance between a human face and a frame of the image and human face area.

10. The method of claim 1, wherein selecting the target image from the set of images based on the screening results further comprises:

and if the synthesis parameter is 1, selecting the image with the highest aesthetic degree score from the image set according to the screening result, and determining the image with the highest aesthetic degree score as the target image.

11. An image selecting apparatus, comprising:

the image clustering unit is used for clustering the image sequence according to the image characteristics corresponding to the images in the image sequence to obtain at least one image set, and the images in each image set correspond to the same object;

the image identification unit is used for carrying out object identification on each image in the image sequence and adjusting the images in the at least one image set according to an object identification result so that each image in the image set corresponds to the same object; the target image selecting unit is used for carrying out multi-dimensional evaluation on each image in the image set according to a preset evaluation standard and selecting a target image from the image set according to a multi-dimensional evaluation result, wherein the target image is used for synthesizing a specific image representing the video file, and the preset evaluation standard comprises a definition evaluation standard and an aesthetic degree evaluation standard;

the target image selecting unit is further configured to: if the synthesis parameter is larger than 1, determining a first image characteristic and a second image characteristic of each image in the screening result; wherein the first image characteristic comprises at least one of face height, face area, image saturation, image brightness and image sharpness; the second image feature comprises a face orientation; calculating a first inter-frame distance between every two images in the screening result according to the first image characteristics, and calculating a second inter-frame distance between every two images in the screening result according to the second image characteristics; selecting two images with the largest difference value between the first inter-frame distance and the second inter-frame distance, and determining the two images with the largest difference value as the target image; the first inter-frame distance and the second inter-frame distance are used for representing the similarity between every two images, the first inter-frame distance and the second inter-frame distance do not belong to a preset distance range, and the synthesis parameter represents the number of target images required for synthesizing a specific image for representing the video file.

12. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of claims 1-10.

13. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the method of any of claims 1-10 via execution of the executable instructions.