CN114898419A

CN114898419A - Method, device, medium and computing equipment for extracting key images in image sequence

Info

Publication number: CN114898419A
Application number: CN202210303514.9A
Authority: CN
Inventors: 虞勇波; 黄安麒; 赵剑; 赵翔宇; 刘华平
Original assignee: Hangzhou Netease Cloud Music Technology Co Ltd
Current assignee: Hangzhou Netease Cloud Music Technology Co Ltd
Priority date: 2022-03-24
Filing date: 2022-03-24
Publication date: 2022-08-12

Abstract

The embodiment of the invention provides a method, a device, a medium and a computing device for extracting key images in an image sequence, wherein the method comprises the following steps: iteratively executing the following image sampling detection process, and determining candidate images with face detection results meeting preset rules in a plurality of frame candidate images extracted from an image sequence as key images in the image sequence: acquiring a face detection result which is output by a face detection model and corresponds to the current frame candidate image; determining an image sampling interval corresponding to the current frame candidate image based on the face detection result; extracting the number of images separated from the current frame candidate image in the image sequence from the image sequence, and extracting the next frame candidate image matched with the image sampling interval; and taking the next frame candidate image as a new current frame candidate image, and inputting the new current frame candidate image into the face detection model so that the face detection model carries out face detection on the next frame candidate image. The method and the device can improve the extraction efficiency and accuracy of the key images.

Description

Method, device, medium and computing equipment for extracting key images in image sequence

Technical Field

The embodiment of the invention relates to the technical field of computer application, in particular to a method, a device, a medium and a computing device for extracting a key image in an image sequence.

Background

This section is intended to provide a background or context to the embodiments of the invention. The description herein is not admitted to be prior art by inclusion in this section.

With the continuous development of multimedia technology, video and other image sequences containing multiple frames of images are increasing day by day. The image sequence is generally a series of images sequentially acquired at different times and different orientations in order for a photographic target, for example: the video may be a time-series continuous series of images acquired over a period of time for a photographic subject in motion.

Image sequences can provide rich information, such as: the shooting target itself, the motion state and motion trajectory of the shooting target, and the like, however, the image sequence also contains much redundant information. In practical applications, it is generally desirable to extract one or more frames of images that are more critical from an image sequence as key images (also referred to as key frames) so as to process only these key images. Wherein a key image may generally be an image in a sequence of images that has some prominent feature on the content compared to other images, or an image where the content can be noticeable.

Disclosure of Invention

In this context, embodiments of the present invention are intended to provide a method, an apparatus, a medium, and a computing device for extracting a key image in an image sequence.

In a first aspect of the embodiments of the present invention, there is provided a method for extracting a key image in an image sequence, the method including:

iteratively executing an image sampling detection process to extract a plurality of frame candidate images from an image sequence and carry out face detection on the plurality of frame candidate images;

determining candidate images of which the face detection results conform to a preset rule in the plurality of frames of candidate images as key images in the image sequence;

wherein, the image sampling detection process comprises:

acquiring a face detection result which is output by a face detection model and corresponds to the current frame candidate image;

determining an image sampling interval corresponding to the current frame candidate image based on the face detection result;

extracting a next frame candidate image corresponding to the current frame candidate image from the image sequence; wherein the number of images in the image sequence at which the next frame candidate image and the current frame candidate image are spaced matches the image sampling interval;

and taking the next frame candidate image as a new current frame candidate image, and inputting the next frame candidate image into the face detection model so that the face detection model performs face detection on the next frame candidate image.

In a second aspect of embodiments of the present invention, there is provided an apparatus for extracting a key image from an image sequence, the apparatus comprising:

the detection module is used for iteratively executing an image sampling detection process so as to extract a plurality of frame candidate images from an image sequence and carry out face detection on the plurality of frame candidate images;

the determining module is used for determining a candidate image of which the face detection result accords with a preset rule in the plurality of frames of candidate images as a key image in the image sequence;

wherein the detection module comprises:

the acquisition submodule is used for acquiring a face detection result which is output by the face detection model and corresponds to the current frame candidate image;

a determining sub-module for determining an image sampling interval corresponding to the current frame candidate image based on the face detection result;

an extraction submodule, configured to extract a next-frame candidate image corresponding to the current-frame candidate image from the image sequence; wherein the number of images in the image sequence at which the next frame candidate image and the current frame candidate image are spaced matches the image sampling interval;

and the detection sub-module is used for taking the next frame candidate image as a new current frame candidate image and inputting the new current frame candidate image into the face detection model so that the face detection model can carry out face detection on the next frame candidate image.

In a third aspect of the disclosed embodiments, there is provided a medium having stored thereon a computer program that, when executed by a processor, implements a method of extracting and processing a key image in any of the above-described image sequences.

In a fourth aspect of embodiments of the present disclosure, there is provided a computing device comprising:

a processor;

a memory for storing a processor executable program;

the processor executes the executable program to realize the method for extracting the key image in any image sequence.

According to the embodiment of the disclosure, an image sampling detection process can be iteratively executed, and candidate images of which the face detection result meets a preset rule in a plurality of frames of candidate images extracted from an image sequence are determined as key images in the image sequence; when the image sampling detection process is executed, specifically, an image sampling interval corresponding to the current frame candidate image may be determined based on a face detection result corresponding to the current frame candidate image output by the face detection model, and based on the image interval, a next frame candidate image corresponding to the current frame candidate image is extracted from the image sequence and is input into the face detection model as a new current frame candidate image for face detection.

By adopting the method, the number of the images at the interval between the current frame candidate image and the next frame candidate image can be dynamically adjusted, so that each frame image in the image sequence or a plurality of frames of images extracted at equal intervals can be prevented from being respectively calculated, the calculated amount is reduced, and the extraction efficiency of the key images is improved.

Furthermore, the key image is determined by determining whether the face detection rule meets the preset rule, so that the extraction accuracy of the key image can be improved.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

FIG. 1 schematically illustrates a schematic diagram of a user interface according to an embodiment of the present disclosure;

FIG. 2 schematically illustrates a schematic diagram of an application scenario of image processing according to an embodiment of the present disclosure;

FIG. 3 schematically shows a flow chart of a method of image processing according to an embodiment of the present disclosure;

FIG. 4 schematically illustrates a diagram of 49 personal face keypoints, according to an embodiment of the present disclosure;

FIG. 5 schematically illustrates a diagram of 68 individual facial keypoints, according to an embodiment of the present disclosure;

FIG. 6 schematically illustrates a schematic view of an image according to an embodiment of the present disclosure;

FIG. 7 schematically illustrates a schematic view of a medium according to an embodiment of the disclosure;

fig. 8 schematically shows a block diagram of an image processing apparatus according to an embodiment of the present disclosure;

FIG. 9 schematically shows a schematic diagram of a computing device in accordance with an embodiment of the present disclosure.

In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

Detailed Description

The principles and spirit of the present invention will be described with reference to a number of exemplary embodiments. It is understood that these embodiments are given solely for the purpose of enabling those skilled in the art to better understand and to practice the invention, and are not intended to limit the scope of the invention in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

As will be appreciated by one skilled in the art, embodiments of the present invention may be embodied as a system, apparatus, device, method, or computer program product. Accordingly, the present disclosure may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.

According to the embodiment of the invention, a method, a device, a medium and a computing device for extracting key images in an image sequence are provided.

In this document, it is to be understood that any number of elements in the figures are provided by way of illustration and not limitation, and any nomenclature is used for differentiation only and not in any limiting sense.

The principles and spirit of the present disclosure are explained in detail below with reference to several representative embodiments of the present disclosure.

Summary of The Invention

In practical applications, it is often the case that key images need to be extracted from an image sequence such as a video.

Taking an Application (Application) or a Music APP capable of providing videos such as movies, dramas, MVs (Music videos), etc. to a user as an example, in this type of Application, information related to a plurality of videos is usually output through a user interface, so that the user can select an interesting Video from the videos for viewing according to the related information.

When outputting the related information of each video through the user interface, one frame of image related to the video is usually used as a video cover of the video, and the video cover and a video title of the video are displayed in the user interface, as shown in fig. 1. Wherein the video cover is typically a key image in the video. In this case, it is necessary to extract a key image for each video so that the extracted key image serves as a video cover of the video.

In the related art, for an image sequence, a key image can be extracted from the image sequence by analyzing a plurality of frames of images included in the image sequence.

Specifically, two adjacent images in a plurality of images included in the image sequence may be analyzed, a difference or a similarity of the two images in a specific dimension (for example, a shooting target, a shooting background, and the like) is calculated, and the calculated difference or similarity is compared with a preset threshold. If the difference or the similarity is larger than the threshold value, the difference between the two frames of images is larger; if the difference or the similarity is smaller than the threshold, the difference between the two frames of images is small. In this case, the images with smaller differences can be filtered out from the plurality of images, and one or more frames of images can be screened out from the images with larger differences from the plurality of frames of images as the key images in the image sequence.

For the multi-frame images included in the image sequence, the multi-frame images may be extracted from the image sequence at equal intervals according to a preset image sampling interval. In order to ensure the suitability of the key image extracted from the image sequence, the image sampling interval generally cannot be set too large. In this case, the adjacent two images of the plurality of images may be the two images closest in order in the image sequence. For example: assuming that the image sequence contains a total of 5 frames of images, i.e., image 1, image 2, image 3, image 4, and image 5, which are sequentially acquired, and the image sampling interval is 2, 3 frames of images, i.e., image 1, image 3(1+2 is 3), and image 5(3+2 is 5), can be extracted from the image sequence at equal intervals; of these 3 frame images, image 1 and image 3 may be regarded as two adjacent frame images, and image 3 and image 5 may also be regarded as two adjacent frame images.

Alternatively, the multi-frame images included in the image sequence may be all the images included in the image sequence. In this case, the two adjacent frames of the plurality of frames of images may be two sequentially adjacent frames of images in the image sequence. For example, assuming that the image sequence contains 3 frames of images, i.e., image 1, image 2, and image 3, which are successively acquired in sequence, in the 3 frames of images, image 1 and image 2 may be regarded as two adjacent frames of images, and image 2 and image 3 may also be regarded as two adjacent frames of images.

However, in the related art, in the process of extracting the key image from the image sequence, each frame of image in the image sequence or a plurality of frames of images extracted at equal intervals need to be calculated respectively, so the calculation amount is usually large, and the extraction efficiency of the key image is low; further, since the key image is determined by comparing the calculated difference or similarity with the threshold, which is usually set in advance manually, it is very important whether the manually set threshold is appropriate, and if the threshold is not appropriate, the extraction result of the key image is greatly affected.

In order to solve the above problem, the present disclosure provides a technical solution for extracting a key image in an image sequence. In the technical scheme, an image sampling detection process can be executed iteratively, and candidate images of which the face detection result meets a preset rule in a plurality of frames of candidate images extracted from an image sequence are determined as key images in the image sequence; when the image sampling detection process is executed, specifically, an image sampling interval corresponding to the current frame candidate image may be determined based on a face detection result corresponding to the current frame candidate image output by the face detection model, and based on the image interval, a next frame candidate image corresponding to the current frame candidate image is extracted from the image sequence and is input into the face detection model as a new current frame candidate image for face detection.

Having described the general principles of the invention, various non-limiting embodiments of the invention are described in detail below.

Application scene overview

Referring to fig. 2, fig. 2 schematically shows a schematic diagram of an application scenario of image processing according to an embodiment of the present disclosure.

As shown in fig. 2, in an application scenario of image processing, a server and at least one client (e.g., clients 1-N) accessing the server through any type of wired or wireless network may be included.

The server can be deployed on a server comprising an independent physical host or a server cluster consisting of a plurality of independent physical hosts; or, the server may be a server built based on a cloud computing service.

The client may correspond to an application installed in a device used by the user; the device may be a smart phone, a tablet Computer, a notebook Computer, a PC (Personal Computer), a pda (Personal Digital Assistants), a wearable device (e.g., smart glasses, smart watches, etc.), a smart car-mounted device, or a game console.

The device carrying the server or the client can also be provided with shooting hardware for collecting images or videos, such as an embedded camera, an external camera and the like. In this case, the shooting hardware may be invoked to capture an image or video of a shooting target, and obtain an image sequence corresponding to the captured multi-frame image or video itself.

The server or the client may be equipped with a face detection model, so as to perform face detection on a specific image in the image sequence based on the face detection model, and determine a key image in the image sequence based on a face detection result corresponding to the specific image.

Exemplary method

A method for extraction of key images in an image sequence according to an exemplary embodiment of the present disclosure is described below with reference to fig. 3-6 in conjunction with the application scenario of fig. 2. It should be noted that the above application scenarios are only illustrated for the convenience of understanding the spirit and principles of the present disclosure, and the embodiments of the present disclosure are not limited in any way in this respect. Rather, embodiments of the present disclosure may be applied to any scenario where applicable.

Referring to fig. 3, fig. 3 schematically shows a flowchart of a method for extracting a key image in an image sequence according to an embodiment of the present disclosure.

The method for extracting the key images in the image sequence can be applied to the server or the client carrying a machine learning model (which may be called a face detection model) for face detection.

For the server and the client, generally, the hardware resources of the device where the server is located are richer than those of the device where the client is located. In this case, the computing power of the server is stronger than that of the client, and the complexity of the machine learning model that can be executed in the server is higher than that of the machine learning model that can be executed in the client.

Besides, for machine learning models that need to perform real-time computation tasks (e.g., real-time image processing task during video call, real-time audio processing task during voice call, etc.), the computation rate of such machine learning models is usually required to be high, i.e., such machine learning models are required to be able to output computation results in a short time.

Therefore, the face detection model running in the server can be a full-scale machine learning model; the face detection model running in the above-described client may be a lightweight machine learning model. The full-scale machine learning model can be a machine learning model obtained by training a preset machine learning model for face detection; the lightweight machine learning model may be a machine learning model obtained by cutting a repetitive structure such as a convolutional layer in the full-scale machine learning model.

Specifically, a preset machine learning model for face detection may be trained to obtain a full-scale machine learning model. Subsequently, the lightweight machine learning model can learn the Knowledge of the trained full-scale machine learning model through Knowledge Distillation (Knowledge Distillation), so that the model effect of the lightweight machine learning model is similar to the model effect of the full-scale machine learning model. That is, the model parameters migrated from the full-scale machine learning model by knowledge distillation may be used as the model parameters of the lightweight machine learning model.

The method for extracting the key image in the image sequence can comprise the following steps:

step 301: and iteratively executing an image sampling detection process to extract a plurality of frame candidate images from the image sequence and carry out face detection on the plurality of frame candidate images.

The image sampling detection process may include the following steps:

step 3011: and acquiring a face detection result which is output by the face detection model and corresponds to the current frame candidate image.

Step 3012: and determining an image sampling interval corresponding to the current frame candidate image based on the face detection result.

Step 3013: extracting a next frame candidate image corresponding to the current frame candidate image from the image sequence; wherein the number of images in the image sequence at which the next frame candidate image and the current frame candidate image are spaced matches the image sampling interval.

Step 3014: and taking the next frame candidate image as a new current frame candidate image, and inputting the next frame candidate image into the face detection model so that the face detection model performs face detection on the next frame candidate image.

In practical applications, there is usually a greater preference to using a person as a target for shooting, and the image sequence is acquired for the person. For example, videos such as movies, television shows, and MVs are usually captured by taking an actor as a shooting target and acquiring an image sequence for the actor. Therefore, when extracting the key image in the image sequence, an image with a better effect of the presented human face can be extracted from the image sequence as the key image in the image sequence.

In this embodiment, for an image sequence that needs to extract a key image, an image sampling detection procedure may be iteratively performed on the image sequence to extract several frames of images (which may be referred to as candidate images) from the image sequence, and perform face detection on the extracted several frames of candidate images.

An image sample detection flow will be described below.

Firstly, a frame of image can be extracted from the image sequence to be used as a current frame candidate image, and the current frame candidate image is input into the face detection model, so that the face detection model performs face detection on the current frame candidate image. In this case, the face detection result corresponding to the current frame candidate image output by the face detection model may be obtained.

Next, an image sampling interval corresponding to the current frame candidate image may be determined based on a face detection result corresponding to the current frame candidate image.

Then, in the image sequence, a frame of image whose number of images spaced from the current frame candidate image matches the image sampling interval may be determined, and the frame of image may be extracted as a next frame candidate image corresponding to the current frame candidate image. That is, the next frame candidate image corresponding to the current frame candidate image is actually determined by the face detection result corresponding to the current frame candidate image.

Finally, the next frame candidate image may be input into the face detection model, so that the face detection model performs face detection on the next frame candidate image. In this case, the next frame candidate image also becomes a new current frame candidate image.

It should be noted that, for the next frame candidate image, in addition to requiring that the number of images of the next frame candidate image and the current frame candidate image spaced apart in the image sequence match with the image sampling interval, the order of the next frame candidate image in the image sequence may be required to be located after the order of the current frame candidate image in the image sequence, or the order of the next frame candidate image in the image sequence may be required to be located before the order of the current frame candidate image in the image sequence. Alternatively, both of the previous and subsequent frames of images whose number of images spaced apart from the current frame of candidate image in the image sequence matches the image sampling interval may be used as the next frame of candidate image corresponding to the current frame of candidate image; at this time, the two next frame candidate images may be respectively input into the face detection model for face detection, that is, the two next frame candidate images may be respectively used as new current frame images. The present disclosure is not so limited.

For example, assuming that the image sequence includes 5 frames of images, i.e., image 1, image 2, image 3, image 4, and image 5, which are successively acquired in sequence, and the image sample interval corresponding to image 3, which is determined based on the face detection result corresponding to image 3 as the current frame candidate image, is 2, image 5(3+2 ═ 5) may be used as the next frame candidate image corresponding to image 3, or image 1(3-2 ═ 1) may be used as the next frame candidate image corresponding to image 3, or both image 1 and image 5 may be used as the next frame candidate image corresponding to image 3.

It should be further noted that, the number of images of the next frame candidate image and the current frame candidate image spaced apart in the image sequence matches the image sampling interval, the difference between the order of the next frame candidate image in the image sequence and the order of the current frame candidate image in the image sequence may be equal to the image sampling interval, or the number of images of the next frame candidate image and the current frame candidate image spaced apart in the image sequence may be equal to the image sampling interval. The present disclosure is not so limited.

In one embodiment shown, the sequence of images may be a video. Since different videos generally have different frame rates, in order to adapt to different video frame rates, when determining an image sampling interval corresponding to the current frame candidate image based on the face detection result corresponding to the current frame candidate image, specifically, an image sampling rate corresponding to the face detection result may be determined first, and then a product of the image sampling rate and the frame rate of the video may be determined as the image sampling interval corresponding to the current frame candidate image, as shown in the following formula:

X＝fps×v

where X denotes an image sampling interval corresponding to the current frame candidate image, fps denotes a frame rate of the video, and v denotes an image sampling rate corresponding to the face detection result of the current frame candidate image. For example, assuming that the frame rate of the video is 60 frames per second and the image sample rate corresponding to the face detection result of the current frame candidate image is 0.1, the image sample interval corresponding to the current frame candidate image may be determined to be 6(60 × 0.1 — 6) frames.

When the image sample detection procedure is performed iteratively on the image sequence, an image with an order of 1 in the image sequence (i.e., a first frame image in the image sequence) may be extracted from the image sequence as a first frame candidate image. Alternatively, one frame of image may be extracted randomly from the image sequence as the first frame candidate image. The present disclosure is not so limited.

For example, assuming that the image sequence contains 5 frames of images, i.e., image 1, image 2, image 3, image 4, and image 5, which are successively acquired in sequence, image 1 in the order of 1 in the image sequence may be extracted as a first frame candidate image from the image sequence, or image 2 may be extracted randomly as a first frame candidate image.

Subsequently, the first frame candidate image may be input to the face detection model, so that a face detection result output by the face detection model and corresponding to the first frame candidate image may be obtained, an image sampling interval corresponding to the first frame candidate image may be determined based on the face detection result, and a second frame candidate image whose number of images spaced from the first frame candidate image matches the image sampling interval may be extracted from the image sequence.

When the second frame candidate image is extracted, the second frame candidate image may be input to the face detection model, so that a face detection result corresponding to the second frame candidate image output by the face detection model may be acquired, an image sampling interval corresponding to the second frame candidate image may be determined based on the face detection result, and a third frame candidate image whose number of images spaced from the second frame candidate image matches the image sampling interval may be extracted from the image sequence.

And the like until the candidate image of the last frame is extracted from the image sequence. For the last frame candidate image, there is no image in the image sequence whose number of images spaced from the last frame candidate image matches the image sampling interval corresponding to the last frame candidate image.

For example, assuming that the image sequence contains 5 frames of images, namely, image 1, image 2, image 3, image 4, and image 5, which are sequentially and continuously acquired, and image 1 is extracted from the image sequence as a first frame candidate image, image 1 may be input into the face detection model, and a face detection result corresponding to image 1 and output by the face detection model is obtained; assuming that the image sampling interval corresponding to the image 1 determined based on the face detection result corresponding to the image 1 is 2, an image 3(1+2 ═ 3) can be extracted from the image sequence as a second frame candidate image, the image 3 is input into the face detection model, and a face detection result corresponding to the image 3 output by the face detection model is obtained; assuming that the image sampling interval corresponding to the image 3 determined based on the face detection result corresponding to the image 3 is 1, an image 4(3+1 is 4) may be extracted from the image sequence as a third frame candidate image, the image 4 is input to the face detection model, and a face detection result corresponding to the image 4 output by the face detection model is obtained; assuming that the image sample interval corresponding to the image 4 determined based on the face detection result corresponding to the image 4 is 2, it means that the image 4 is the candidate image of the last frame.

In summary, by iteratively performing the image sampling detection procedure on the image sequence, several frame candidate images can be extracted from the image sequence. In addition, in the process of extracting the plurality of frame candidate images from the image sequence, the face detection of any one frame candidate image in the plurality of frame candidate images is already finished, and a corresponding face detection result is obtained.

Step 302: and determining the candidate image of which the face detection result accords with a preset rule in the plurality of frames of candidate images as a key image in the image sequence.

In this embodiment, when face detection results respectively corresponding to the plurality of frame candidate images are obtained, a candidate image whose face detection result meets a preset rule in the plurality of frame candidate images may be determined as a key image in the image sequence.

According to the embodiment of the disclosure, an image sampling detection process can be iteratively executed, and candidate images of which the face detection result meets a preset rule in a plurality of frames of candidate images extracted from an image sequence are determined as key images in the image sequence; when the image sampling detection process is executed, an image sampling interval corresponding to the current frame candidate image may be determined based on a face detection result corresponding to the current frame candidate image output by the face detection model, and a next frame candidate image corresponding to the current frame candidate image may be extracted from the image sequence based on the image interval, and input to the face detection model as a new current frame candidate image for face detection.

Two ways of determining the image sampling interval corresponding to the current frame candidate image based on the face detection result corresponding to the current frame candidate image in any image sampling detection process are described below.

(1) First way of determining the sampling interval of an image

In an illustrated embodiment, the face detection module performs face detection on any frame of image, on one hand, may detect the number of faces included in the frame of image, and on the other hand, may detect a key region in each face in the frame of image. In this case, the face detection result output by the face detection model corresponding to the frame of image may include: the number of faces and the key regions of the faces corresponding to the frame of image.

Accordingly, when the image sampling interval corresponding to the current frame candidate image is determined based on the face detection result corresponding to the current frame candidate image, specifically, the face state type corresponding to the current frame candidate image may be determined based on the number of faces corresponding to the current frame candidate image and the size of the face key region corresponding to the current frame candidate image, and then the image sampling interval corresponding to the face state type may be used as the image sampling interval corresponding to the current frame candidate image.

Further, in the illustrated embodiment, since the shape of the eyes and mouth of a person is usually changed when the person expresses, the eye area and the mouth area may be used as the above-mentioned key areas of the face for the face.

Further, in an illustrated embodiment, the following manner may be adopted to determine the face state type corresponding to the current frame candidate image based on the number of faces corresponding to the current frame candidate image and the size of the face key region corresponding to the current frame candidate image:

if the number of the faces corresponding to the current frame candidate image is 0, it indicates that the current frame candidate image does not contain a face, and at this time, the face state type corresponding to the current frame candidate image may be determined as the first type face state.

If the number of faces corresponding to the current frame candidate image is greater than a preset threshold (which may be referred to as a first threshold), it indicates that the number of faces included in the current frame candidate image is greater, and at this time, the face state type corresponding to the current frame candidate image may be determined as the second type face state.

If the number of faces corresponding to the current frame candidate image is greater than 0 and less than or equal to the first threshold (i.e. 0 < the number of faces corresponding to the current frame candidate image is less than or equal to the first threshold), the face state type corresponding to the current frame candidate image may be further determined based on a size relationship between the size of the eye region corresponding to the current frame candidate image and a preset threshold (may be referred to as a second threshold), and a size relationship between the size of the mouth region corresponding to the current frame candidate image and a preset threshold (may be referred to as a third threshold).

Specifically, the face state type corresponding to the current frame candidate image may be determined to be the third type face state when the size of the eye region corresponding to the current frame candidate image is smaller than the second threshold.

The face state type corresponding to the current frame candidate image may be determined to be a fourth type face state in a case where the size of the eye region corresponding to the current frame candidate image is greater than or equal to the second threshold and the size of the mouth region corresponding to the current frame candidate image is greater than the third threshold.

The face state type corresponding to the current frame candidate image may be determined to be a fifth-type face state in a case where the size of the eye region corresponding to the current frame candidate image is greater than or equal to the second threshold and the size of the mouth region corresponding to the current frame candidate image is less than or equal to the third threshold.

In practical application, the face detection model can detect face key points of any face in any frame of image, and label the detected face key points on the image, so that the image with the face key points labeled can be output. Among them, face key points (Facial Landmark) are points for locating a face contour, each Facial region, and the like, and 49 face key points as shown in fig. 4, 68 face key points as shown in fig. 5, and 108 face key points are generally employed.

In addition, the face detection model can also establish a rectangular coordinate system for any frame of image, and output the coordinates of the vertex of the rectangular region where any face in the image is located and the coordinates of the face key points corresponding to the face. As shown in fig. 6, the face detection model may output coordinates of vertex 1, vertex 2, vertex 3, and vertex 4 of a rectangular region where the face 1 is located, and coordinates of a face key point corresponding to the face 1.

When determining the size of the face key region corresponding to the current frame candidate image, the size of the face key region may be specifically calculated based on the coordinates of the vertex of the face key region. Because the sizes of the rectangular areas where the faces are located in different images are usually different, in order to adapt to different face ratios, normalization processing can be performed on the sizes of the key areas of the faces based on the coordinates of the vertexes of the rectangular areas where the faces are located.

Continuing with the example of the image shown in fig. 6, assuming that the coordinates of vertex 1 are (x1, y1), the coordinates of vertex 2 are (x2, y2), the coordinates of vertex 3 are (x3, y3), and the coordinates of vertex 4 are (x4, y4), when face key point detection is performed on face 1 using 68 face key points as shown in fig. 5, the size R of the right-eye region in face 1 can be calculated based on the coordinates of face key point No. 39 (assumed to be (xd39, yd39)) and face key point No. 41 (assumed to be (xd41, yd41)) on face 1 and the coordinates of vertices 1 and 2 (or vertices 3 and 4), as shown in the following equation:

similarly, the size L of the left eye region in the face 1 can be calculated based on the coordinates of face key point No. 44 (assumed to be (xd44, yd44)), face key point No. 48 (assumed to be (xd48, yd48)), and the coordinates of vertex 1 and vertex 2 on the face 1 in relation to the longitudinal length of the left eye region, as shown in the following equation:

similarly, the size M of the mouth region in the face 1 can be calculated based on the coordinates of the face key point No. 52 (assumed to be (xd52, yd52)), the coordinates of the face key point No. 58 (assumed to be (xd58, yd58)), and the coordinates of the vertex 1 and vertex 2, which are related to the longitudinal length of the mouth region, as shown in the following equation:

(2) second way of determining the sampling interval of an image

In another illustrated embodiment, the face detection performed by the face detection model on any one frame of image may detect, on one hand, angle information corresponding to each face in the frame of image, on the other hand, color value information corresponding to each face in the frame of image, and on the other hand, a rectangular region (i.e., a face region) where each face in the frame of image is located. In this case, the face detection result output by the face detection model corresponding to the frame of image may include: face angle information, face value information, and face region corresponding to the frame of image.

Accordingly, when determining the image sampling interval corresponding to the current frame candidate image based on the face detection result corresponding to the current frame candidate image, specifically, an image quality score corresponding to the current frame candidate image may be calculated based on a preset scoring rule, face angle information and face color value information corresponding to the current frame candidate image, and position information of a face region corresponding to the current frame candidate image relative to the current frame candidate image, and then the image sampling interval corresponding to the image quality score may be used as the image sampling interval corresponding to the current frame candidate image.

Further, in one illustrated embodiment, a plurality of image quality score sections may be set in advance, and a corresponding image sampling interval may be set for each section. In this case, when the image sampling interval corresponding to the image quality score of the current frame candidate image is determined, specifically, an image quality score section to which the image quality score corresponding to the current frame candidate image belongs may be determined based on a correspondence relationship between the image quality score section and the image sampling interval, and the image sampling interval corresponding to the image quality score section may be determined as the image sampling interval corresponding to the image quality score of the current frame candidate image.

In practical application, the angle information corresponding to the face is used to represent three states of the face in cartesian three-dimensional coordinates, which are pitch, yaw, and roll, respectively. Wherein, pitch is the angle of turning up and down, i.e. the angle of rotation of the object around the X axis; yaw is the angle of the left-right flip, i.e., the angle at which the object rotates about the Y-axis; a roll is the angle of rotation in a plane, i.e., the angle of rotation of an object about the Z-axis. The face detection model can detect the angle of any face in any frame of image, and outputs pitch, yaw and roll corresponding to the face as face angle information. In this case, a score FA corresponding to the face angle information may be calculated as shown in the following equation:

FA＝para _pitch ×score _pitch +para _yaw ×SCore _yaw +para _roll ×sCore _roll

wherein, para _pitch 、para _yaw 、para _roll The weights set for pitch, yaw, roll, respectively.

The face detection model can also detect the face value of any face in any frame of image, and output the face value score corresponding to the face and the probability of detecting the face. In this case, the score FB corresponding to the face angle information may be calculated as shown in the following equation:

where FA represents a score corresponding to face angle information, F _beauty Denotes the color value score, F _prob The probability of being detected as a human face is represented, and eps represents a natural number greater than 0 (a specific value, usually 0.000001, can be set according to actual calculation conditions).

That is, for any one face, a score corresponding to face value information of the face may be calculated based on a score corresponding to face angle information of the face.

The face detection model can also establish a rectangular coordinate system for any frame of image and output the coordinates of the vertex of the rectangular area where any face in the image is located. In this case, a score FL corresponding to the positional information of the face region with respect to the image can be calculated as shown in the following equation:

wherein (c) _x0 ，c _y0 ) As coordinates of the center position of the image, (c) _x1 ，c _y1 ) Is the coordinates of the center position of the face region.

Continuing with the image shown in fig. 6 as an example, assuming that a rectangular coordinate system is established with the lower left vertex of the image as the origin, the side of the rectangular region where the face 1 is located is parallel to the x-axis or the y-axis, w represents the lateral width of the image in the rectangular coordinate system, h represents the longitudinal length of the image in the rectangular coordinate system, the coordinates of the vertex 1 are (x1, y1), the coordinates of the vertex 2 are (x2, y2), the coordinates of the vertex 3 are (x3, y3), the coordinates of the vertex 4 are (x4, y4), the coordinates of the center position of the image and the coordinates of the center position of the rectangular region where the face 1 is located can be calculated as shown in the following formula:

in practical application, for any frame of image, if there are multiple faces, the average value of the scores corresponding to the face angle information of the multiple faces can be used as the score corresponding to the face angle information of the image; taking an average value of scores corresponding to the face value information of the plurality of faces as a score corresponding to the face value information of the image; the average of the scores corresponding to the position information of the plurality of faces with respect to the image is set as the score corresponding to the position information of the image.

In the case where the scores FA, FB, FL corresponding to the face angle information, face value information, and position information thereof, respectively, are calculated for any one frame image, the image quality score F corresponding to the image may be further calculated. Since different images are generally different in size, in order to adapt to different image sizes, the image quality score F may be normalized based on the coordinates of the center position of the image, as shown in the following formula:

three ways of determining a candidate image, whose face detection result meets a preset rule, among a plurality of frame candidate images extracted from an image sequence as a key image in the image sequence are described below with respect to any one image sequence.

(1) First way of determining key images

In an embodiment shown, in a case that the face detection result includes the number of faces and a face key region, candidate images of which the number of faces and the size of the face key region both meet a preset rule in a plurality of frames of candidate images extracted from the image sequence may be determined as key images in the image sequence.

It should be noted that, the foregoing related contents may be referred to for calculating the size of the key region of the face, and the disclosure is not repeated herein.

In practical applications, the preset rule may include: the size of the eye region is greater than or equal to the second threshold value, and the size of the mouth region is less than or equal to the third threshold value. That is, the candidate image whose face state type is the aforementioned fifth type of face state may be determined as the key image in the image sequence.

(2) Second way of determining key images

In another illustrated embodiment, in a case where the face detection result includes face angle information, face value information, and a face region, an image quality score corresponding to any one candidate image may be calculated based on a preset scoring rule, and the face angle information, the face value information, and position information of the face region with respect to the candidate image, so that a candidate image whose image quality score is greater than or equal to a preset threshold (may be referred to as a fourth threshold) may be determined as a key image in the image sequence. Alternatively, all candidate images may be ranked in order of decreasing image quality scores, and a preset number of candidate images ranked in front may be determined as the key images in the image sequence.

It should be noted that, the aforementioned related contents may be referred to for the calculation of the image quality score, and the details of the disclosure are not repeated herein.

(3) Third way of determining key images

In yet another illustrated embodiment, in a case that the face detection result includes a number of faces, a face key region, face angle information, face value information, and a face region, a candidate image in which the number of faces and the size of the face key region both meet a preset rule in a plurality of frame candidate images extracted from the image sequence may be determined as a target image, and further, for any target image, an image quality score corresponding to the target image may be calculated based on a preset scoring rule, the face angle information, the face value information, and position information of the face region relative to the target image, so that a target image whose image quality score is greater than or equal to the fourth threshold may be determined as a key image in the image sequence. Alternatively, all the target images may be sorted in the order of decreasing image quality scores, and a preset number of target images ranked in the front may be determined as the key images in the image sequence.

It should be noted that, both the calculation of the size of the key region of the face and the calculation of the image quality score can refer to the related contents, and the details of the disclosure are not repeated herein.

Exemplary Medium

Having described the method of the exemplary embodiment of the present disclosure, next, a medium for extraction of a key image in an image sequence of the exemplary embodiment of the present disclosure will be described with reference to fig. 7.

In the present exemplary embodiment, the above-described method may be implemented by a program product, such as a portable compact disc read only memory (CD-ROM) and including program code, and may be executed on a device, such as a personal computer. However, the program product of the present disclosure is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium.

A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RE, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the C language or similar programming languages. The program code may execute entirely on the user computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

Exemplary devices

Having described the medium of the exemplary embodiment of the present disclosure, next, an apparatus for extraction of a key image in an image sequence of the exemplary embodiment of the present disclosure will be described with reference to fig. 8.

For the implementation processes of the functions and actions of each module in the following apparatus, reference is specifically made to the implementation processes of the corresponding steps in the above method, and this disclosure is not described herein again. For the device embodiments, they substantially correspond to the method embodiments, and therefore, reference may be made to the partial description of the method embodiments for relevant points.

Referring to fig. 8, fig. 8 schematically illustrates an extraction apparatus for a key image in an image sequence according to an embodiment of the present disclosure.

The above-mentioned extraction device of key images in an image sequence may include:

a detection module 801, configured to iteratively perform an image sampling detection procedure to extract a plurality of frame candidate images from an image sequence, and perform face detection on the plurality of frame candidate images;

a determining module 802, configured to determine a candidate image in which a face detection result in the plurality of frames of candidate images meets a preset rule as a key image in the image sequence;

wherein the detection module 801 comprises:

an obtaining sub-module 8011, configured to obtain a face detection result, which is output by the face detection model and corresponds to the current frame candidate image;

a determining sub-module 8012 configured to determine, based on the face detection result, an image sampling interval corresponding to the current frame candidate image;

an extracting sub-module 8013, configured to extract a next frame candidate image corresponding to the current frame candidate image from the image sequence; wherein the number of images in the image sequence at which the next frame candidate image and the current frame candidate image are spaced matches the image sampling interval;

the detecting sub-module 8014 is configured to use the next frame candidate image as a new current frame candidate image, and input the face detection model, so that the face detection model performs face detection on the next frame candidate image.

Optionally, the sequence of images is a video;

the determination sub-module 8012 is specifically configured to:

determining an image sampling rate corresponding to the face detection result;

and determining the product of the image sampling rate and the frame rate of the video as the image sampling interval corresponding to the current frame candidate image.

Optionally, the face detection result includes the number of faces and a face key region;

the determination sub-module 8012 is specifically configured to:

determining a face state type corresponding to the current frame candidate image based on the number of the faces and the size of the key region of the faces;

and determining an image sampling interval corresponding to the face state type.

Optionally, the face key regions include eye regions and mouth regions.

Optionally, the determining sub-module 8012 is specifically configured to:

if the face number is 0, determining that the face state type corresponding to the current frame candidate image is a first type face state;

if the number of the human faces is larger than a preset first threshold value, determining that the human face state type corresponding to the current frame candidate image is a second human face state;

and if the number of the human faces is greater than 0 and less than or equal to the first threshold, determining the human face state type corresponding to the current frame candidate image based on the size relationship between the size of the eye region and a preset second threshold and the size relationship between the size of the mouth region and a preset third threshold.

Optionally, the determining sub-module 8012 is specifically configured to:

if the size of the eye region is smaller than a preset second threshold, determining that the face state type corresponding to the current frame candidate image is a third-class face state;

if the size of the eye region is larger than or equal to the second threshold and the size of the mouth region is larger than a preset third threshold, determining that the face state type corresponding to the current frame candidate image is a fourth type face state;

and if the size of the eye region is greater than or equal to the second threshold and the size of the mouth region is less than or equal to the third threshold, determining that the face state type corresponding to the current frame candidate image is a fifth-class face state.

Optionally, the face detection result includes face angle information, face value information, and a face region;

the determination sub-module 8012 is specifically configured to:

calculating an image quality score corresponding to the current frame candidate image based on a preset scoring rule, the face angle information, the face value information and the position information of the face region relative to the current frame candidate image;

an image sampling interval corresponding to the image quality score is determined.

Optionally, the determining sub-module 8012 is specifically configured to:

the method comprises the steps of determining an image quality score interval to which an image quality score belongs based on a corresponding relation between a preset image quality score interval and an image sampling interval, and determining the image sampling interval corresponding to the image quality score interval as the image sampling interval corresponding to the image quality score.

the determining module 802 is specifically configured to:

and determining the candidate images in which the number of the human faces and the size of the human face key region in the plurality of frames of candidate images meet preset rules as key images in the image sequence.

Optionally, the face detection result further includes face angle information, face value information, and a face region;

the determining module 802 is specifically configured to:

determining candidate images in which the number of the human faces and the size of the human face key area in the plurality of frames of candidate images meet preset rules as target images;

calculating an image quality score corresponding to any target image based on a preset scoring rule, the face angle information, the face value information and the position information of the face region relative to the target image;

and determining the target image with the image quality score larger than or equal to a preset fourth threshold value as a key image in the image sequence.

the determining module 802 is specifically configured to:

calculating an image quality score corresponding to any candidate image based on a preset scoring rule, the face angle information, the face value information and the position information of the face region relative to the candidate image;

determining the candidate images with the image quality scores larger than or equal to a preset fifth threshold value in the plurality of frame candidate images as key images in the image sequence.

Exemplary computing device

Having described the methods, media, and apparatus of the exemplary embodiments of the present disclosure, a computing device for extraction of key images in a sequence of images of the exemplary embodiments of the present disclosure is described next with reference to fig. 9.

The computing device 900 shown in fig. 9 is only one example and should not impose any limitations on the functionality or scope of use of embodiments of the disclosure.

As shown in fig. 9, computing device 900 is embodied in a general purpose computing device. Components of computing device 900 may include, but are not limited to: the at least one processing unit 901 and the at least one storage unit 902 are coupled to a bus 903 which connects various system components (including the processing unit 901 and the storage unit 902).

The bus 903 includes a data bus, a control bus, and an address bus.

The storage unit 902 may include readable media in the form of volatile memory, such as a Random Access Memory (RAM)9021 and/or a cache memory 9022, and may further include readable media in the form of non-volatile memory, such as a Read Only Memory (ROM) 9023.

Storage unit 902 may also include a program/utility 9025 having a set (at least one) of program modules 9024, such program modules 9024 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Computing device 900 may also communicate with one or more external devices 904, such as a keyboard, pointing device, etc.

Such communication may occur via input/output (I/O) interfaces 905. Moreover, computing device 900 may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the internet) via network adapter 906. As shown in fig. 9, the network adapter 906 communicates with the other modules of the computing device 900 over the bus 903. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with computing device 900, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, to name a few.

It should be noted that although in the above detailed description several units/modules or sub-units/modules of the extraction means of key images in an image sequence are mentioned, such a division is merely exemplary and not mandatory. Indeed, the features and functions of two or more of the units/modules described above may be embodied in one unit/module according to embodiments of the invention. Conversely, the features and functions of one unit/module described above may be further divided into embodiments by a plurality of units/modules.

Moreover, while the operations of the method of the invention are depicted in the drawings in a particular order, this does not require or imply that the operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

While the spirit and principles of the invention have been described with reference to several particular embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, nor is the division of aspects, which is for convenience only as the features in such aspects may not be combined to benefit. The invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. A method of extracting key images in an image sequence, the method comprising:

wherein, the image sampling detection process comprises:

2. The method of claim 1, the sequence of images being a video;

the determining an image sampling interval corresponding to the current frame candidate image based on the face detection result includes:

determining an image sampling rate corresponding to the face detection result;

3. The method of claim 1, wherein the face detection result comprises the number of faces and face key regions;

4. The method of claim 1, the face detection result comprising face angle information, face value information, and face region;

5. The method of claim 1, wherein the face detection result comprises the number of faces and key regions of the faces;

determining the candidate image of which the face detection result accords with a preset rule in the plurality of frames of candidate images as a key image in the image sequence, wherein the determining comprises the following steps:

6. The method of claim 5, the face detection result further comprising face angle information, face value information, and face region;

determining the candidate images in which the number of the faces and the size of the key regions of the faces in the plurality of frames of candidate images meet preset rules as key images in the image sequence, including:

7. The method of claim 1, the face detection result comprising face angle information, face value information, and face region;

8. An apparatus for extracting a key image from an image sequence, the apparatus comprising:

wherein the detection module comprises:

the acquisition sub-module is used for acquiring a face detection result which is output by the face detection model and corresponds to the current frame candidate image;

9. A medium having stored thereon a computer program which, when executed by a processor, carries out the method of any one of claims 1-7.

10. A computing device, comprising:

a processor;

a memory for storing a processor executable program;

wherein the processor implements the method of any one of claims 1-7 by running the executable program.