CN114187557A

CN114187557A - Method, device, readable medium and electronic equipment for determining key frame

Info

Publication number: CN114187557A
Application number: CN202111539658.6A
Authority: CN
Inventors: 陈维识
Original assignee: Beijing ByteDance Network Technology Co Ltd
Current assignee: Beijing ByteDance Network Technology Co Ltd
Priority date: 2021-12-15
Filing date: 2021-12-15
Publication date: 2022-03-15

Abstract

The disclosure relates to a method, a device, a readable medium and an electronic device for determining a key frame, comprising: acquiring a plurality of frames of images to be determined; inputting a plurality of frames of images to be determined into a pre-trained feature vector acquisition model so as to acquire an image feature vector corresponding to each frame of image to be determined output by the feature vector acquisition model; dividing a plurality of frames of images to be determined into at least one image group according to the image characteristic vector corresponding to each frame of image to be determined; and determining key frames corresponding to the multiple frames of images to be determined from the images to be determined in each image group. That is to say, according to the image feature vector corresponding to each frame of image to be determined, the multi-frame image to be determined can be divided into at least one image group of different image scenes, and then the key frame corresponding to the multi-frame image to be determined is determined from each image group, so that the key frame can be extracted more quickly and accurately, and the efficiency of subsequent image analysis or image content identification is improved.

Description

Method, device, readable medium and electronic equipment for determining key frame

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to a method and an apparatus for determining a key frame, a readable medium, and an electronic device.

Background

With the development of network technology, massive image and video resources are uploaded to a network and are propagated in a flowing mode. The traditional method for identifying and classifying the contents of the images and the videos through a visual inspection mode of personnel has the disadvantages of large workload and long time consumption, and has huge challenges on how to realize automatic content identification of long videos.

In the related art, automatic identification of video content is realized through Artificial Intelligence (AI), however, this approach needs to identify each video frame in the video, which makes the identification process computationally intensive and inefficient.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

In a first aspect, the present disclosure provides a method of determining a key frame, the method comprising:

acquiring a plurality of frames of images to be determined;

inputting a plurality of frames of images to be determined into a pre-trained feature vector acquisition model so as to acquire image feature vectors corresponding to each frame of images to be determined output by the feature vector acquisition model;

dividing a plurality of frames of images to be determined into at least one image group according to the image characteristic vector corresponding to each frame of images to be determined, wherein the image scenes corresponding to different image groups are different;

and determining key frames corresponding to the multiple frames of images to be determined from the images to be determined in each image group.

In a second aspect, the present disclosure provides an apparatus for determining a key frame, the apparatus comprising:

the image acquisition module is used for acquiring a plurality of frames of images to be determined;

the characteristic vector acquisition module is used for inputting a plurality of frames of images to be determined into a pre-trained characteristic vector acquisition model so as to acquire image characteristic vectors corresponding to each frame of images to be determined output by the characteristic vector acquisition model;

the image dividing module is used for dividing the multi-frame image to be determined into at least one image group according to the image characteristic vector corresponding to each frame of image to be determined, wherein the image scenes corresponding to different image groups are different;

and the key frame determining module is used for determining the key frames corresponding to the plurality of frames of images to be determined from the images to be determined in each image group.

In a third aspect, the present disclosure provides a computer readable medium having stored thereon a computer program which, when executed by a processing apparatus, performs the steps of the method of the first aspect of the present disclosure.

In a fourth aspect, the present disclosure provides an electronic device comprising:

a storage device having a computer program stored thereon;

processing means for executing the computer program in the storage means to implement the steps of the method of the first aspect of the present disclosure.

According to the technical scheme, multiple frames of images to be determined are obtained; inputting a plurality of frames of images to be determined into a pre-trained feature vector acquisition model so as to acquire image feature vectors corresponding to each frame of images to be determined output by the feature vector acquisition model; dividing a plurality of frames of images to be determined into at least one image group according to the image characteristic vector corresponding to each frame of images to be determined, wherein the image scenes corresponding to different image groups are different; and determining key frames corresponding to the multiple frames of images to be determined from the images to be determined in each image group. That is to say, according to the image feature vector corresponding to each frame of image to be determined, the multi-frame image to be determined can be divided into at least one image group of different image scenes, and then the key frame corresponding to the multi-frame image to be determined is determined from each image group, so that the key frame can be extracted more quickly and accurately, and the efficiency of subsequent image analysis or image content identification is improved.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale. In the drawings:

FIG. 1 is a flow chart illustrating a method of determining key frames according to an exemplary embodiment of the present disclosure;

FIG. 2 is a flow chart illustrating another method of determining key frames according to an exemplary embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a network architecture according to an exemplary embodiment of the present disclosure;

FIG. 4 is a block diagram illustrating an apparatus for determining key frames in accordance with an exemplary embodiment of the present disclosure;

FIG. 5 is a block diagram illustrating a second apparatus for determining key frames according to an exemplary embodiment of the present disclosure;

FIG. 6 is a block diagram illustrating a third apparatus for determining key frames according to an exemplary embodiment of the present disclosure;

fig. 7 is a block diagram illustrating an electronic device according to an exemplary embodiment of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

First, an application scenario of the present disclosure will be explained. The identification of key frames is a very critical loop in the machine understanding and analysis of video content, and the key frames commonly used at present include two types: one is the frames before and after the scene change and the other is the center frame which can represent the entire scene. In the related art, one method is to determine whether there is a scene change according to the difference between adjacent frames (sum of point-to-point euclidean distances), but this method needs to determine a threshold in advance, and the screening of the threshold has no mature standard, which affects the accuracy of the determination, and meanwhile, the recognition effect is poor in the case of inter-cut and mirror movement. The other method is to perform model training according to manually labeled data, that is, manually label some key frames, and then simulate the manually labeled behavior based on a CNN (Convolutional Neural Network) model, but this method requires a high manpower cost and requires a high technical capability for the labeling personnel, resulting in less labeled data related to the key frames, and thus making the model training more difficult.

In order to solve the existing problems, the present disclosure provides a method, an apparatus, a readable medium and an electronic device for determining a key frame, which may divide a plurality of frames of images to be determined into at least one image group of different image scenes according to an image feature vector corresponding to each frame of image to be determined, and then determine a key frame corresponding to the plurality of frames of images to be determined from each image group, so that the key frame can be extracted more quickly and accurately, thereby improving the efficiency of subsequent image analysis or image content identification.

The present disclosure is described below with reference to specific examples.

Fig. 1 is a flowchart illustrating a method of determining a key frame according to an exemplary embodiment of the present disclosure, which may include, as shown in fig. 1:

s101, acquiring multiple frames of images to be determined.

In this step, a target video of a key frame to be determined may be obtained first, and then each frame of video image in the target video is extracted to obtain multiple frames of images to be determined. The target video may be a homemade video (long video, short video, etc.) online on the video platform, a movie, a sports event, etc., or an animation, and the present disclosure does not limit the type of the target video.

S102, inputting a plurality of frames of images to be determined into a pre-trained feature vector acquisition model so as to acquire image feature vectors corresponding to each frame of images to be determined output by the feature vector acquisition model.

The image feature vector is used for representing the spatial distance between the images to be determined of different frames.

In this step, after obtaining a plurality of frames of images to be determined, the plurality of frames of images to be determined may be input into the feature vector obtaining model, and each frame of image to be determined is processed by the feature vector obtaining model to obtain an image feature vector corresponding to each frame of image to be determined. For example, image feature vectors corresponding to multiple frames of images to be determined may be represented as { Z0, Z1, Z2,........ Zn }, where n is the number of images to be determined.

The feature vector acquisition model can be obtained by training in the following way: and obtaining a plurality of sample sets, wherein the sample sets comprise a positive sample image, a simple sample image and a difficult sample image, and training the target neural network model through the plurality of sample sets to obtain the characteristic vector obtaining model.

S103, dividing the multiple frames of images to be determined into at least one image group according to the image feature vectors corresponding to the multiple frames of images to be determined.

Wherein, the image scenes corresponding to different image groups can be different.

In this step, after the image feature vector corresponding to each frame of image to be determined is obtained, the plurality of frames of image to be determined may be divided into at least one image group according to different image scenes according to the plurality of image feature vectors.

And S104, determining key frames corresponding to the multiple frames of images to be determined from the images to be determined in each image group.

The key frames may include transition frames before and after the scene is changed and a center frame of the whole scene. For example, if the 5 th frame image in the images to be determined is two handbags and the 6 th frame image is a plurality of flowerpots, the 5 th frame image and the 6 th frame image are transition frames; if the 1 st to 5 th images are all two bags, the 3 rd image is the center frame of the scene, and thus the 3 rd, 5 th and 6 th images are the key frames. .

In this step, for each image group, the image group has a different image scene from that of an adjacent image group, a first frame to-be-determined image and a last frame to-be-determined image in the image group may be used as transition frames, and a middle frame to-be-determined image in the image group may be used as a central frame, so as to obtain a key frame corresponding to the image group.

By adopting the method, the multi-frame image to be determined can be divided into at least one image group of different image scenes according to the image characteristic vector corresponding to each frame of image to be determined, and the key frame corresponding to the multi-frame image to be determined is determined from each image group, so that the key frame can be extracted more quickly and accurately, and the efficiency of subsequent image analysis or image content identification is improved.

Fig. 2 is a flowchart illustrating another method of determining a key frame according to an exemplary embodiment of the present disclosure, which may include, as shown in fig. 2:

s201, acquiring multiple frames of images to be determined.

S202, inputting a plurality of frames of images to be determined into a pre-trained feature vector acquisition model so as to acquire image feature vectors corresponding to each frame of images to be determined output by the feature vector acquisition model.

The image feature vector is used for representing the spatial distance between the images to be determined of different frames. The feature vector acquisition model can be obtained by training through the following method:

and S1, acquiring a plurality of sample sets.

The sample set may include a positive sample image, a simple sample image, and a difficult sample image, wherein the positive sample image may include two sample images with smaller difference, for example, the positive sample image may include two adjacent sample images, the simple sample image may include two sample images with larger difference, for example, the simple sample image may include two sample images in two videos, and the difficult sample image may include two sample images with middle difference, for example, the difficult sample image may be two sample images with farther separation in a unified video. A plurality of reference sample images may be acquired first, and for each reference sample image, the positive sample image, the simple sample image, and the difficult sample image corresponding to the reference sample image may be acquired.

In a possible implementation manner, the reference sample image and an adjacent sample image may be taken as the positive sample image, where the adjacent sample image is any frame sample image adjacent to the reference sample image in the first sample video to which the reference sample image belongs; and taking any one frame sample image of the reference sample image and at least one second sample video as the simple sample image, wherein the second sample video is different from the first sample video. For example, the positive sample image may include x and x +, x being the reference sample image, and x + being the next frame sample image adjacent to the reference sample image.

In addition, a scene type corresponding to the first sample video may be determined according to the reference sample image, and the difficult sample image may be determined according to the reference sample image and the scene type. The scene type may include a long video scene and a short video scene, and the present disclosure may identify the reference sample image by an image identification method in the related art, determine an image content corresponding to the reference sample image, and determine a scene type corresponding to the first sample video according to the image content.

After determining the scene type corresponding to the first sample video, a target interval may be determined according to the scene type, and the reference sample image and any frame sample image in the first sample video, in which an interval between the reference sample image and the reference sample image is greater than or equal to the target interval, may be used as the difficult sample image. For example, the target interval corresponding to the scene type may be determined through a pre-established interval association relationship, where the interval association relationship may include a correspondence relationship between different scene types and intervals. For long video scenes, a longer target interval may be set, for example, the target interval may be 30s, and for short video scenes, a shorter target interval may be set, for example, the target interval may be 15 s. The above-described setting manner of the target interval is merely an example, and the present disclosure does not limit this. For example, the difficult sample image may include x and x-, where x is the reference sample image, and x-is any frame sample image in the first sample video having an interval with the reference sample image greater than or equal to the target interval.

It should be noted that, for each sample set, based on the reference sample image x in the sample set, x + may be used as a positive sample, x-may be used as a negative sample, the similarity between x and x + is relatively high, and the difference between x-and x is relatively large.

And S2, training the target neural network model through a plurality of sample sets to obtain the feature vector acquisition model.

The target neural network model may be a ResNet network, fig. 3 is a schematic diagram of a network structure shown according to an exemplary embodiment of the present disclosure, as shown in fig. 3, the left side is an overall schematic diagram of the network structure, and the right side is a schematic diagram corresponding to a convolution process. After the sample set is input into the target neural network model, two convolution operations can be performed on the sample images in the sample set through two convolution layers, then the processing result after the convolution operations and the sample set are processed through three convolution processes (convolution process 1, convolution process 2 and convolution process 3), finally, dimension reduction processing is performed through a smoothing layer, and classification processing is performed through a full connection layer. The network structures corresponding to the 3 convolution processes are the same, and each of the 3 convolution processes includes a convolution layer (Conv2D), a pooling layer, a layer normalization (LayerNorm), and a batch normalization (BatchNorm).

In this step, the target neural network model may be trained according to the deskew loss function with reference to a model training method in the prior art, which is not described herein again. Wherein the expression of the deskew loss function is:

wherein the content of the first and second substances,

indicating expectation, f (x), f (x +), f (x-) and f (v) are image feature vectors, x is a sample image, p is 1/N, N is the number of sample images in the sample set,

the probability that the selected sample image is a positive sample, τ + is the probability of selecting a positive sample (1/M, M is the number of sample videos to which the sample image belongs), τ -is the probability of selecting a negative sample ((M-1)/M), Q is the total number of negative samples, and v represents the desired operator.

S203, clustering the image characteristic vectors corresponding to the multiple frames of images to be determined to obtain at least one clustering category.

Wherein different cluster categories may correspond to different image scenes.

In this step, a hierarchical Clustering (hierarchical Clustering) method in the prior art may be used to perform Clustering processing on a plurality of image feature vectors corresponding to a plurality of frames of images to be determined, so as to obtain at least one cluster category, where different cluster categories correspond to different cluster sets, and each cluster set includes at least one image feature vector.

And S204, taking the image to be determined corresponding to each cluster type as an image group.

In this step, after clustering the plurality of image feature vectors to obtain at least one cluster category, for each cluster category, an image to be determined corresponding to the cluster category may be determined according to at least one target image feature vector in a cluster set corresponding to the cluster category, and the image to be determined corresponding to the cluster category is used as an image group. For example, after clustering a plurality of image feature vectors, 5 clustering categories are obtained: the image to be determined in the cluster set corresponding to the cluster category a may be used as an image group a, the image to be determined in the cluster set corresponding to the cluster category B may be used as an image group B, the image to be determined in the cluster set corresponding to the cluster category C may be used as an image group C, the image to be determined in the cluster set corresponding to the cluster category D may be used as an image group D, and the image to be determined in the cluster set corresponding to the cluster category E may be used as an image group E.

S205, determining a cluster center corresponding to each cluster type, and taking an image to be determined corresponding to the cluster center as a center frame image of the image group corresponding to the cluster type.

In this step, after clustering the plurality of image feature vectors to obtain at least one cluster type, for each cluster type, the image feature vector of the cluster center corresponding to the cluster type can be determined, and the image to be determined corresponding to the image feature vector of the cluster center is used as the center frame image of the image group corresponding to the cluster type.

S206, regarding each image group, taking the first frame image to be determined, the last frame image to be determined and the center frame image in the image group as the key frames corresponding to the multiple frames of images to be determined.

In this step, the image scenes corresponding to different image groups are different, and after the central frame image of each image group is determined, for each image group, the first frame image to be determined and the last frame image to be determined in the image group may be determined, and the first frame image to be determined, the last frame image to be determined, and the central frame image in the image group are used as the key frames corresponding to the multiple frames of images to be determined.

By adopting the method, the image characteristic vector corresponding to each frame of image to be determined is obtained through the characteristic vector obtaining model, the image characteristic vector can represent the spatial distance between different images to be determined, then, the clustering processing is carried out on a plurality of image characteristic vectors, a plurality of frames of images to be determined are divided into at least one image group, the clustering types corresponding to different image groups are different, and different clustering types correspond to different image scenes, so that the key frames corresponding to the plurality of frames of images to be determined can be determined from the images to be determined of each image group, the key frames can be extracted more quickly and accurately, and the efficiency of subsequent image analysis or image content identification is improved.

Fig. 4 is a block diagram illustrating an apparatus for determining a key frame according to an exemplary embodiment of the present disclosure, and as shown in fig. 4, the apparatus may include:

an image obtaining module 401, configured to obtain multiple frames of images to be determined;

a feature vector obtaining module 402, configured to input multiple frames of the image to be determined into a pre-trained feature vector obtaining model, so as to obtain an image feature vector corresponding to each frame of the image to be determined, where the image feature vector is output by the feature vector obtaining model;

an image dividing module 403, configured to divide multiple frames of the image to be determined into at least one image group according to an image feature vector corresponding to each frame of the image to be determined, where image scenes corresponding to different image groups are different;

a key frame determining module 404, configured to determine, from the to-be-determined images in each image group, a key frame corresponding to the multiple frames of to-be-determined images.

Optionally, fig. 5 is a block diagram illustrating a second apparatus for determining a key frame according to an exemplary embodiment of the disclosure, and as shown in fig. 5, the apparatus may further include:

a model training module 405, configured to obtain a plurality of sample sets, where the sample sets include a positive sample image, a simple sample image, and a difficult sample image; and training the target neural network model through a plurality of sample sets to obtain the characteristic vector acquisition model.

Optionally, the model training module 405 is further configured to:

acquiring a plurality of reference sample images;

and acquiring the positive sample image, the simple sample image and the difficult sample image corresponding to the reference sample image for each reference sample image.

Optionally, the model training module 405 is further configured to:

taking the reference sample image and an adjacent sample image as the positive sample image, wherein the adjacent sample image is any frame of sample image adjacent to the reference sample image in the first sample video to which the reference sample image belongs;

and taking any one frame sample image of the reference sample image and at least one second sample video as the simple sample image, wherein the second sample video is different from the first sample video.

Optionally, the model training module 405 is further configured to:

determining a scene type corresponding to the first sample video according to the reference sample image;

the difficult sample image is determined based on the reference sample image and the scene type.

Optionally, the model training module 405 is further configured to:

determining a target interval according to the scene type;

and taking any frame sample image of the reference sample image and the first sample video, the interval between which and the reference sample image is larger than or equal to the target interval, as the difficult sample image.

Optionally, the image dividing module 403 is further configured to:

clustering the image characteristic vectors corresponding to the multiple frames of images to be determined to obtain at least one clustering category, wherein different clustering categories correspond to different image scenes;

and taking the image to be determined corresponding to each cluster type as the image group.

Optionally, fig. 6 is a block diagram illustrating a third apparatus for determining a key frame according to an exemplary embodiment of the disclosure, and as shown in fig. 6, the apparatus may further include:

a center frame image determining module 406, configured to determine, for each cluster type, a cluster center corresponding to the cluster type, and use an image to be determined corresponding to the cluster center as a center frame image of the image group corresponding to the cluster type;

the key frame determination module 404 is further configured to:

and regarding each image group, taking the first frame image to be determined, the last frame image to be determined and the center frame image in the image group as the key frames corresponding to the multiple frames of images to be determined.

By the device, the multi-frame image to be determined can be divided into at least one image group of different image scenes according to the image characteristic vector corresponding to each frame of image to be determined, and the key frame corresponding to the multi-frame image to be determined is determined from each image group, so that the key frame can be extracted more quickly and accurately, and the efficiency of subsequent image analysis or image content identification is improved.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Referring now to fig. 7, a schematic diagram of an electronic device (e.g., a terminal device or server) 700 suitable for use in implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 7, electronic device 700 may include a processing means (e.g., central processing unit, graphics processor, etc.) 701 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)702 or a program loaded from storage 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data necessary for the operation of the electronic apparatus 700 are also stored. The processing device 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Generally, the following devices may be connected to the I/O interface 705: input devices 706 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 707 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 708 including, for example, magnetic tape, hard disk, etc.; and a communication device 709. The communication means 709 may allow the electronic device 700 to communicate wirelessly or by wire with other devices to exchange data. While fig. 7 illustrates an electronic device 700 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via the communication means 709, or may be installed from the storage means 708, or may be installed from the ROM 702. The computer program, when executed by the processing device 701, performs the above-described functions defined in the methods of the embodiments of the present disclosure.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some implementations, the electronic devices may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may be interconnected with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring a plurality of frames of images to be determined; inputting a plurality of frames of images to be determined into a pre-trained feature vector acquisition model so as to acquire image feature vectors corresponding to each frame of images to be determined output by the feature vector acquisition model; dividing a plurality of frames of images to be determined into at least one image group according to the image characteristic vector corresponding to each frame of images to be determined, wherein the image scenes corresponding to different image groups are different; and determining key frames corresponding to the multiple frames of images to be determined from the images to be determined in each image group.

Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present disclosure may be implemented by software or hardware. The name of a module does not in some cases constitute a limitation of the module itself, and for example, an image acquisition module may also be described as a "module that acquires a plurality of images to be determined".

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

Example 1 provides, in accordance with one or more embodiments of the present disclosure, a method of determining a key frame, the method comprising: acquiring a plurality of frames of images to be determined; inputting a plurality of frames of images to be determined into a pre-trained feature vector acquisition model so as to acquire image feature vectors corresponding to each frame of images to be determined output by the feature vector acquisition model; dividing a plurality of frames of images to be determined into at least one image group according to the image characteristic vector corresponding to each frame of images to be determined, wherein the image scenes corresponding to different image groups are different; and determining key frames corresponding to the multiple frames of images to be determined from the images to be determined in each image group.

Example 2 provides the method of example 1, the feature vector acquisition model being trained by: obtaining a plurality of sample sets, wherein the sample sets comprise a positive sample image, a simple sample image and a difficult sample image; and training a target neural network model through a plurality of sample sets to obtain the feature vector acquisition model.

Example 3 provides the method of example 2, the obtaining a plurality of sample sets comprising: acquiring a plurality of reference sample images; and acquiring the positive sample image, the simple sample image and the difficult sample image corresponding to the reference sample image for each reference sample image.

Example 4 provides the method of example 3, wherein the obtaining the positive sample image, the simple sample image corresponding to the reference sample image includes: taking the reference sample image and an adjacent sample image as the positive sample image, wherein the adjacent sample image is any frame sample image adjacent to the reference sample image in the first sample video to which the reference sample image belongs; and taking any one frame sample image of the reference sample image and at least one second sample video as the simple sample image, wherein the second sample video is different from the first sample video.

Example 5 provides the method of example 4, the obtaining the difficult sample image corresponding to the reference sample image including: determining a scene type corresponding to the first sample video according to the reference sample image; and determining the difficult sample image according to the reference sample image and the scene type.

Example 6 provides the method of example 5, the determining the difficult sample image from the reference sample image and the scene type including: determining a target interval according to the scene type; and taking any frame sample image of the reference sample image and the first sample video, the interval between which and the reference sample image is larger than or equal to the target interval, as the difficult sample image.

Example 7 provides the method of any one of examples 1 to 6, wherein dividing the plurality of frames of the image to be determined into at least one image group according to the image feature vector corresponding to the image to be determined of each frame includes: clustering the image characteristic vectors corresponding to the multiple frames of images to be determined to obtain at least one clustering category, wherein different clustering categories correspond to different image scenes; and taking the image to be determined corresponding to each cluster category as an image group.

Example 8 provides the method of example 7, in accordance with one or more embodiments of the present disclosure, before determining, from the images to be determined in each of the image groups, a key frame corresponding to the plurality of frames of images to be determined, the method further includes: determining a cluster center corresponding to each cluster type, and taking an image to be determined corresponding to the cluster center as a center frame image of the image group corresponding to the cluster type; the determining, from the to-be-determined images in each of the image groups, key frames corresponding to the plurality of frames of to-be-determined images includes: and regarding each image group, taking the first frame image to be determined, the last frame image to be determined and the center frame image in the image group as the key frames corresponding to the multiple frames of images to be determined.

Example 9 provides, in accordance with one or more embodiments of the present disclosure, an apparatus to determine a key frame, the apparatus comprising: the image acquisition module is used for acquiring a plurality of frames of images to be determined; the characteristic vector acquisition module is used for inputting a plurality of frames of images to be determined into a pre-trained characteristic vector acquisition model so as to acquire image characteristic vectors corresponding to each frame of images to be determined output by the characteristic vector acquisition model; the image dividing module is used for dividing the multi-frame image to be determined into at least one image group according to the image characteristic vector corresponding to each frame of image to be determined, wherein the image scenes corresponding to different image groups are different; and the key frame determining module is used for determining the key frames corresponding to the plurality of frames of images to be determined from the images to be determined in each image group.

Example 10 provides a computer-readable medium having stored thereon a computer program that, when executed by a processing device, performs the steps of the method of any of examples 1-8, in accordance with one or more embodiments of the present disclosure.

Example 11 provides, in accordance with one or more embodiments of the present disclosure, an electronic device, comprising: a storage device having a computer program stored thereon; processing means for executing said computer program in said storage means to carry out the steps of the method of any of examples 1-8.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Claims

1. A method of determining a key frame, the method comprising:

acquiring a plurality of frames of images to be determined;

2. The method of claim 1, wherein the feature vector acquisition model is trained by:

obtaining a plurality of sample sets, wherein the sample sets comprise a positive sample image, a simple sample image and a difficult sample image;

and training a target neural network model through a plurality of sample sets to obtain the feature vector acquisition model.

3. The method of claim 2, wherein the obtaining a plurality of sample sets comprises:

acquiring a plurality of reference sample images;

4. The method according to claim 3, wherein the acquiring the positive sample image and the simple sample image corresponding to the reference sample image comprises:

taking the reference sample image and an adjacent sample image as the positive sample image, wherein the adjacent sample image is any frame sample image adjacent to the reference sample image in the first sample video to which the reference sample image belongs;

5. The method of claim 4, wherein obtaining the difficult sample image corresponding to the reference sample image comprises:

and determining the difficult sample image according to the reference sample image and the scene type.

6. The method of claim 5, wherein determining the difficult sample image from the reference sample image and the scene type comprises:

determining a target interval according to the scene type;

7. The method according to any one of claims 1 to 6, wherein the dividing the plurality of frames of the image to be determined into at least one image group according to the image feature vector corresponding to each frame of the image to be determined comprises:

and taking the image to be determined corresponding to each cluster category as an image group.

8. The method according to claim 7, wherein before determining, from the images to be determined in each of the image groups, the key frame corresponding to the plurality of frames of images to be determined, the method further comprises:

determining a cluster center corresponding to each cluster type, and taking an image to be determined corresponding to the cluster center as a center frame image of the image group corresponding to the cluster type;

the determining, from the to-be-determined images in each of the image groups, key frames corresponding to the plurality of frames of to-be-determined images includes:

9. An apparatus for determining a key frame, the apparatus comprising:

10. A computer-readable medium, on which a computer program is stored, characterized in that the program, when being executed by processing means, carries out the steps of the method of any one of claims 1 to 8.

11. An electronic device, comprising:

a storage device having a computer program stored thereon;

processing means for executing the computer program in the storage means to carry out the steps of the method according to any one of claims 1 to 8.