WO2022247406A1

WO2022247406A1 - Systems and methods for determining key frame images of video data

Info

Publication number: WO2022247406A1
Application number: PCT/CN2022/081557
Authority: WO
Inventors: Qiuchen SUN; Heqing Li; Xiaobiao CHEN; Jianchao Li; Hongxiang QIU; Jinlong Zhang
Original assignee: Zhejiang Dahua Technology Co., Ltd.
Priority date: 2021-05-26
Filing date: 2022-03-17
Publication date: 2022-12-01
Also published as: CN113542868A

Abstract

A method and system for obtaining a key frame may be provided. A target frame image of video data may be obtained. A motion amplitude of one or more target subjects in the target frame image may be determined based on the target frame image and a determined key frame adjacent to the target frame image. The target frame image may be designated as a key fame image based on the motion amplitude of one or more target subjects in the target frame image.

Description

SYSTEMS AND METHODS FOR DETERMINING KEY FRAME IMAGES OF VIDEO DATA

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims priority to Chinese Application No. 202110580563.2 filed on May 26, 2021, the entire contents of which are hereby incorporated by reference.

TECHNICAL FIELD

The present disclosure relates to image processing, and in particular, to systems and methods for determining one or more key frame images of video data.

BACKGROUND

A video stream may include multiple key frames and multiple non-key frames (e.g, . P-frames, B-frames, etc. ) . Key frames are the least compressible but don't require other frames to decode. Non-key frames may be decoded based on the key frames. Therefore, key frames of a video stream are vital to the video stream. It is desirable to provide systems and methods for accurately determining key frames of a video stream.

SUMMARY

According to an aspect of the present disclosure, a system for obtaining a key frame may be provided. The system may include at least one storage device and at least one processor configured to communicate with the at least one storage device. The at least one storage device may store a set of instructions. When the at least one processor execute the set of instructions, the at least one processor may be directed to cause the system to perform one or more of the following operations. The system may obtain a target frame image of video data. The system may determine a motion amplitude of one or more target subjects in the target frame image based on the target frame image and a determined key frame adjacent to the target frame image. The system may designate the target frame image as a key fame image based on the motion amplitude of one or more target subjects in the target frame image.

In some embodiments, to obtain a target frame image of video data, the system may determine whether a frame image to be determined of the video data is the last frame image of the video data. In response to determining that the frame image to be determined of the video data is not the last frame image of the video data, the system may further designate the frame image to be determined as the target frame image.

In some embodiments, the at least one processor may be directed to cause the system to further perform the following operations. The system may determine a clarity of the target frame image. The system may further designate the target frame image as the key fame image based on the clarity of the target frame image and the motion amplitude of one or more target subjects in the target frame image.

In some embodiments, to determine a clarity of the target frame image, the system may determine a single-channel gray image of the target frame image. The system may further determine the clarity of the target frame image based on the single-channel gray image of the target frame image.

In some embodiments, to determine a clarity of the target frame image, the system may determine the clarity of the target frame image according to a Laplace gradient function algorithm.

In some embodiments, to designate the target frame image as a key fame image based on the clarity of the target frame image and the motion amplitude of one or more target subjects in the target frame image, the system may compare the clarity of the target frame image with a clarity threshold. The system may also compare the motion amplitude of the one or more target subjects with an amplitude threshold. In response to determining that the clarity of the target frame image is greater than the clarity threshold and the motion amplitude of the one or more target subjects is greater than the amplitude threshold, the system may further designate the target frame image as the key fame image.

In some embodiments, the amplitude threshold may be associated with at least one of a size of the video data, a size of the one or more target subjects, a count of the one or more target subjects, or a count of one or more frame images between the target frame image and the key fame image adjacent to the target frame image.

In some embodiments, to designate the target frame image as a key fame image based on the clarity of the target frame image and the motion amplitude of one or more target subjects in the target frame image, the system may determine a clarity of each frame image of the video data. The system may also determine an average of the clarities of all frame images of the video data based on the clarity of each frame image of the video data. The system may further designate the average as the clarity threshold.

In some embodiments, to determine a motion amplitude of one or more target subjects in the target frame image based on the target frame image and a determined key frame adjacent to the target frame image, the system may extract one or more first motion target regions of the target frame image. The one or more first motion target regions may include the one or more target subjects in the target frame image. The system may also extract one or more second motion target regions of the determined key frame image adjacent to the target frame image. Each of the one or more second motion target regions may correspond one of the one or more first motion target regions. The system may determine an optical flow value between the target frame image and the determined key frame image adjacent to the target frame image based on the one or more first motion target regions and the one or more second motion target regions. The system may further determine the motion amplitude of the one or more target subjects in the target frame image based on the optical flow value between the target frame image and the determined key frame image adjacent to the target frame image.

In some embodiments, to extract one or more first motion target regions of the target frame image or extracting one or more second motion target regions of the determined key frame image adjacent to the target frame image, the system may determine a difference image of the target frame image or the determined key frame image using a background-difference algorithm. The system may also determine a binary image of the target frame image or the determined key frame image by performing a binarization operation on the difference image. The system may determine one or more connected regions of the binary image by performing a morphological filtering operation on the binary image. The system may also determine a first bounding box corresponding to each of the one or more connected regions based on the one or more connected regions. The system may further extract the one or more first motion target regions or the one or more second motion target regions based on the first bounding box corresponding to each connected region.

In some embodiments, to extract the one or more first motion target regions or the one or more second motion target regions based on the first bounding box corresponding to each connected region, for each of the one or more connected regions in the target frame image or the determined key frame image, the system may determine a second bounding box of the connected region by extending the first bounding box by one or more pixels in a first direction and a second direction. The system may further extract the one or more first motion target regions or the one or more second motion target regions based on the second bounding box corresponding to each connected region.

In some embodiments, to determine an optical flow value between the target frame image and the determined key frame image adjacent to the target frame image, based on the one or more first motion target regions and the one or more second motion target regions, the system may determine an optical flow value between each of the one or more first motion target regions and a second motion target region corresponding to the first motion target region. The system may further determine the optical flow value between the target frame image and the determined key frame image adjacent to the target frame image based on the optical flow value between each of the one or more first motion target regions and the second motion target region corresponding to the first motion target region.

In some embodiments, to determine the optical flow value between the target frame image and the determined key frame image adjacent to the target frame image based on the optical flow value between each of the one or more first motion target regions and the second motion target region corresponding to the first motion target region, the system may determine the optical flow value between the target frame image and the determined key frame image adjacent to the target frame image by summing the optical flow value between each of the one or more first motion target regions and the second motion target region corresponding to the first motion target region.

In some embodiments, to determine an optical flow value between each of the one or more first motion target regions and a second motion target region corresponding to the first motion target region, the system may perform the following operations. For each first pixel in the first motion target region, the system may determine a first optical flow value between the first pixel and a second pixel corresponding to the first pixel in the corresponding second motion target region in a third direction. For each first pixel in the first motion target region, the system may determine a second optical flow value between the first pixel and the second pixel corresponding to the first pixel in the corresponding second motion target region in a fourth direction. The system may further determine the optical flow value between the first motion target region and the second motion target region corresponding to the first motion target region based on the first optical flow value and the second optical flow value between each first pixel and the second pixel corresponding to the first pixel.

According to another aspect of the present disclosure, a method for obtaining a key frame may be provided. The method may include obtaining a target frame image of video data. The method may also include determining a motion amplitude of one or more target subjects in the target frame image based on the target frame image and a determined key frame adjacent to the target frame image. The method may further include designating the target frame image as a key fame image based on the motion amplitude of one or more target subjects in the target frame image.

According to yet another aspect of the present disclosure, a system for obtaining a key frame may be provided. The system may include an acquisition module and a determination module. The acquisition module may be configured to obtain a target frame image of video data. The determination module may be configured to determine a motion amplitude of one or more target subjects in the target frame image based on the target frame image and a determined key frame adjacent to the target frame image. The determination module may also be configured to designate, based on the motion amplitude of one or more target subjects in the target frame image, the target frame image as a key fame image.

According to yet another aspect of the present disclosure, a non-transitory computer readable medium may be provided. The non-transitory computer readable medium may include a set of instruction for obtaining a key frame. When executed by at least one processor of a computing device, the set of instructions may cause the computing device to perform a method. The method may include obtaining a target frame image of video data. The method may also include determining a motion amplitude of one or more target subjects in the target frame image based on the target frame image and a determined key frame adjacent to the target frame image. The method may further include designating the target frame image as a key fame image based on the motion amplitude of one or more target subjects in the target frame image.

Additional features will be set forth in part in the description which follows, and in part will become apparent to those skilled in the art upon examination of the following and the accompanying drawings or may be learned by production or operation of the examples. The features of the present disclosure may be realized and attained by practice or use of various aspects of the methodologies, instrumentalities, and combinations set forth in the detailed examples discussed below.

BRIEF DESCRIPTION OF THE DRAWINGS

The present disclosure is further described in terms of exemplary embodiments. These exemplary embodiments are described in detail with reference to the drawings. These embodiments are non-limiting exemplary embodiments, in which like reference numerals represent similar structures throughout the several views of the drawings, and wherein:

FIG. 1 is a schematic diagram illustrating an exemplary image processing system according to some embodiments of the present disclosure;

FIG. 2 is a schematic diagram illustrating exemplary hardware and/or software components of an exemplary computing device according to some embodiments of the present disclosure;

FIG. 3 is a schematic diagram illustrating exemplary hardware and/or software components of an exemplary mobile device according to some embodiments of the present disclosure;

FIG. 4A is a block diagram illustrating exemplary processing device 112A for determining a key frame image of video data according to some embodiments of the present disclosure.

FIG. 4B is a block diagram illustrating exemplary processing device 112B for training a preliminary model according to some embodiments of the present disclosure;

FIG. 5 is a flowchart illustrating an exemplary process for determining a key frame image of video data according to some embodiments of the present disclosure;

FIG. 6 is a flowchart illustrating an exemplary process for determining a motion amplitude of one or more target subjects in a target frame image according to some embodiments of the present disclosure; and

FIG. 7 is a flowchart illustrating an exemplary process for determining a key frame image of video data according to some embodiments of the present disclosure.

DETAILED DESCRIPTION

In the following detailed description, numerous specific details are set forth by way of examples in order to provide a thorough understanding of the relevant disclosure. However, it should be apparent to those skilled in the art that the present disclosure may be practiced without such details. In other instances, well-known methods, procedures, systems, components, and/or circuitry have been described at a relatively high-level, without detail, in order to avoid unnecessarily obscuring aspects of the present disclosure. Various modifications to the disclosed embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the present disclosure. Thus, the present disclosure is not limited to the embodiments shown, but to be accorded the widest scope consistent with the claims.

It will be understood that the terms “system, ” “engine, ” “unit, ” “module, ” and/or “block” used herein are one method to distinguish different components, elements, parts, sections, or assemblies of different levels in ascending order. However, the terms may be displaced by other expressions if they may achieve the same purpose.

Generally, the words “module, ” “unit, ” or “block” used herein, refer to logic embodied in hardware or firmware, or to a collection of software instructions. A module, a unit, or a block described herein may be implemented as software and/or hardware and may be stored in any type of non-transitory computer-readable medium or other storage devices. In some embodiments, a software module/unit/block may be compiled and linked into an executable program. It will be appreciated that software modules can be callable from other modules/units/blocks or from themselves, and/or may be invoked in response to detected events or interrupts. Software modules/units/blocks configured for execution on computing devices (e.g., processor 220 illustrated in FIG. 2) may be provided on a computer-readable medium, such as a compact disc, a digital video disc, a flash drive, a magnetic disc, or any other tangible medium, or as a digital download (and can be originally stored in a compressed or installable format that needs installation, decompression, or decryption prior to execution) . Such software code may be stored, partially or fully, on a storage device of the executing computing device, for execution by the computing device. Software instructions may be embedded in firmware, such as an EPROM. It will be further appreciated that hardware modules (or units or blocks) may be included in connected logic components, such as gates and flip-flops, and/or can be included in programmable units, such as programmable gate arrays or processors. The modules (or units or blocks) or computing device functionality described herein may be implemented as software modules (or units or blocks) , but may be represented in hardware or firmware. In general, the modules (or units or blocks) described herein refer to logical modules (or units or blocks) that may be combined with other modules (or units or blocks) or divided into sub-modules (or sub-units or sub-blocks) despite their physical organization or storage.

It will be understood that when a unit, an engine, a module, or a block is referred to as being “on, ” “connected to, ” or “coupled to” another unit, engine, module, or block, it may be directly on, connected or coupled to, or communicate with the other unit, engine, module, or block, or an intervening unit, engine, module, or block may be present, unless the context clearly indicates otherwise. As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items.

The terminology used herein is for the purposes of describing particular examples and embodiments only and is not intended to be limiting. As used herein, the singular forms “a, ” “an, ” and “the” may be intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “include” and/or “comprise, ” when used in this disclosure, specify the presence of integers, devices, behaviors, stated features, steps, elements, operations, and/or components, but do not exclude the presence or addition of one or more other integers, devices, behaviors, features, steps, elements, operations, components, and/or groups thereof.

In addition, it should be understood that in the description of the present disclosure, the terms “first” , “second” , or the like, are only used for the purpose of differentiation, and cannot be interpreted as indicating or implying relative importance, nor can be understood as indicating or implying the order.

The flowcharts used in the present disclosure illustrate operations that systems implement according to some embodiments of the present disclosure. It is to be expressly understood, the operations of the flowcharts may be implemented not in order. Conversely, the operations may be implemented in an inverted order, or simultaneously. Moreover, one or more other operations may be added to the flowcharts. One or more operations may be removed from the flowcharts.

As used in the present disclosure, terms “frame image” and “frame” can be used interchangeably, indicating an image of a video stream.

An aspect of the present disclosure relates to systems and methods for determining a key frame image of video data. The systems may obtain a target frame image of video data. The systems may determine a motion amplitude of one or more target subjects in the target frame image based on the target frame image and a determined key frame adjacent to the target frame image. Then, the systems may determine whether the motion amplitude of one or more target subjects in the target frame image is greater than an amplitude threshold. In response to determining that the motion amplitude of one or more target subjects in the target frame image is greater than an amplitude threshold, the systems may designate the target frame image as a key fame image of the video data.

Conventionally, whether a frame image is a key frame image is determined using a Lucas-Kanade algorithm. An optical flow field obtained using the field Lucas-Kanade algorithm is a sparse optical flow field. However, only whether one or more subjects in the frame image are moving can be determined and the motion amplitude of one or more subjects is not accurately determined according to the sparse optical flow field. Compared with the conventional approach for determining key frame images, systems and methods of the present disclosure may obtain more accurate key frame images to reduce the redundancy of the key frame images.

Moreover, according to some embodiments of the present disclosure, before the motion amplitude of one or more target subjects in the target frame image is determined, the systems may determine whether the clarity of the target frame image is greater than a clarity threshold. In response to determining that the clarity of the target frame image is greater than the clarity threshold, the systems may further determine the motion amplitude of one or more target subjects in the target frame image, and designate the target frame image as the key fame image based on the motion amplitude of one or more target subjects in the target frame image. In this way, if the clarity of the target frame image is not greater than the clarity threshold, the target frame image may be directly determined a non-key frame image, and the systems does not need to further determine the motion amplitude of one or more target subjects, which may further improve the accuracy and efficiency of the determining of the key frame image.

FIG. 1 is a schematic diagram illustrating an exemplary image processing system 100 according to some embodiments of the present disclosure. In some embodiments, the image processing system 100 may be applied in various application scenarios, for example, video data storage, video data transmission, etc. As shown in FIG. 1, the image processing system 100 may include a server 110, a network 120, a capturing device 130, a terminal 140, and a storage device 150.

The server 110 may be a single server or a server group. The server group may be centralized or distributed (e.g., the server 110 may be a distributed system) . In some embodiments, the server 110 may be local or remote. For example, the server 110 may access information and/or data stored in the capturing device 130, the terminal 140, and/or the storage device 150 via the network 120. As another example, the server 110 may be directly connected to the capturing device 130, the terminal 140, and/or the storage device 150 to access stored information and/or data. In some embodiments, the server 110 may be implemented on a cloud platform. Merely by way of example, the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud, a distributed cloud, an inter-cloud, a multi-cloud, or the like, or any combination thereof. In some embodiments, the server 110 may be implemented on a computing device 200 including one or more components illustrated in FIG. 2 of the present disclosure.

In some embodiments, the server 110 may include a processing device 112. The processing device 112 may process information and/or data relating to video data to perform one or more functions described in the present disclosure. For example, the processing device 112 may obtain a frame image of video data captured by the capturing device 130. The processing device 112 may determine a motion amplitude of one or more subjects in the frame image based on the frame image and a determined key frame adjacent to the frame image. Further, the processing device 112 may determine whether the frame image is a key fame image based on the motion amplitude of one or more subjects in the frame image. In some embodiments, the processing device 112 may include one or more processing devices (e.g., single-core processing device (s) or multi-core processor (s) ) . Merely by way of example, the processing device 112 may include a central processing unit (CPU) , an application-specific integrated circuit (ASIC) , an application-specific instruction-set processor (ASIP) , a graphics processing unit (GPU) , a physics processing unit (PPU) , a digital signal processor (DSP) , a field-programmable gate array (FPGA) , a programmable logic device (PLD) , a controller, a microcontroller unit, a reduced instruction set computer (RISC) , a microprocessor, or the like, or any combination thereof.

In some embodiments, the server 110 may be unnecessary and all or part of the functions of the server 110 may be implemented by other components (e.g., the capturing device 130, the terminal 140) of the image processing system 100. For example, the processing device 112 may be integrated into the capturing device 130 or the terminal 140 and the functions of the processing device 112 may be implemented by the capturing device 130 or the terminal 140.

The network 120 may facilitate exchange of information and/or data for the image processing system 100. In some embodiments, one or more components (e.g., the server 110, the capturing device 130, the terminal 140, the storage device 150) of the image processing system 100 may transmit information and/or data to other component (s) of the image processing system 100 via the network 120. For example, the server 110 may obtain the video data from the capturing device 130 via the network 120. As another example, the server 110 may transmit the video data to the terminal 140 via the network 120. In some embodiments, the network 120 may be any type of wired or wireless network, or combination thereof. Merely by way of example, the network 120 may include a cable network (e.g., a coaxial cable network) , a wireline network, an optical fiber network, a telecommunications network, an intranet, an Internet, a local area network (LAN) , a wide area network (WAN) , a wireless local area network (WLAN) , a metropolitan area network (MAN) , a public telephone switched network (PSTN) , a Bluetooth network, a ZigBee network, a near field communication (NFC) network, or the like, or any combination thereof.

The capturing device 130 may be configured to acquire video data. In some embodiments, the capturing device 130 may include a camera 130-1, a video recorder 130-2, an image sensor 130-3, etc. The camera 130-1 may include a gun camera, a dome camera, an integrated camera, a monocular camera, a binocular camera, a multi-view camera, or the like, or any combination thereof. The video recorder 130-2 may include a PC Digital Video Recorder (DVR) , an embedded DVR, or the like, or any combination thereof. The image sensor 130-3 may include a Charge Coupled Device (CCD) image sensor, a Complementary Metal Oxide Semiconductor (CMOS) image sensor, or the like, or any combination thereof. In some embodiments, the capturing device 130 may include a plurality of components each of which can acquire video data. For example, the capturing device 130 may include a plurality of sub-cameras that can capture videos simultaneously. In some embodiments, the capturing device 130 may transmit the acquired videos to one or more components (e.g., the server 110, the terminal 140, the storage device 150) of the image processing system 100 via the network 120.

The terminal 140 may be configured to receive information and/or data from the server 110, the capturing device 130, and/or the storage device 150, via the network 120. For example, the terminal 140 may receive the video data from the server 110. In some embodiments, the terminal 140 may process information and/or data received from the server 110, the capturing device 130, and/or the storage device 150, via the network 120. In some embodiments, the terminal 140 may provide a user interface via which a user may view information and/or input data and/or instructions to the image processing system 100. For example, the user may view the video data via the user interface. In some embodiments, the terminal 140 may include a mobile phone 140-1, a computer 140-2, a wearable device 140-3, or the like, or any combination thereof. In some embodiments, the terminal 140 may include a display that can display information in a human-readable form, such as text, image, audio, video, graph, animation, or the like, or any combination thereof. The display of the terminal 140 may include a cathode ray tube (CRT) display, a liquid crystal display (LCD) , a light-emitting diode (LED) display, a plasma display panel (PDP) , a three-dimensional (3D) display, or the like, or a combination thereof.

The storage device 150 may be configured to store data and/or instructions. The data and/or instructions may be obtained from, for example, the server 110, the capturing device 130, and/or any other component of the image processing system 100. In some embodiments, the storage device 150 may store data and/or instructions that the server 110 may execute or use to perform exemplary methods described in the present disclosure. In some embodiments, the storage device 150 may include a mass storage, a removable storage, a volatile read-and-write memory, a read-only memory (ROM) , or the like, or any combination thereof. Exemplary mass storage may include a magnetic disk, an optical disk, a solid-state drive, etc. Exemplary removable storage may include a flash drive, a floppy disk, an optical disk, a memory card, a zip disk, a magnetic tape, etc. Exemplary volatile read-and-write memory may include a random access memory (RAM) . Exemplary RAM may include a dynamic RAM (DRAM) , a double date rate synchronous dynamic RAM (DDR SDRAM) , a static RAM (SRAM) , a thyristor RAM (T-RAM) , and a zero-capacitor RAM (Z-RAM) , etc. Exemplary ROM may include a mask ROM (MROM) , a programmable ROM (PROM) , an erasable programmable ROM (EPROM) , an electrically erasable programmable ROM (EEPROM) , a compact disk ROM (CD-ROM) , and a digital versatile disk ROM, etc. In some embodiments, the storage device 150 may be implemented on a cloud platform. Merely by way of example, the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud, a distributed cloud, an inter-cloud, a multi-cloud, or the like, or any combination thereof.

In some embodiments, the storage device 150 may be connected to the network 120 to communicate with one or more components (e.g., the server 110, the capturing device 130, the terminal 140) of the image processing system 100. One or more components of the image processing system 100 may access the data or instructions stored in the storage device 150 via the network 120. In some embodiments, the storage device 150 may be directly connected to or communicate with one or more components (e.g., the server 110, the capturing device 130, the terminal 140) of the image processing system 100. In some embodiments, the storage device 150 may be part of other components of the image processing system 100, such as the server 110, the capturing device 130, or the terminal 140.

It should be noted that the above description is merely provided for the purposes of illustration, and not intended to limit the scope of the present disclosure. For persons having ordinary skills in the art, multiple variations and modifications may be made under the teachings of the present disclosure. However, those variations and modifications do not depart from the scope of the present disclosure.

FIG. 2 is a schematic diagram illustrating exemplary hardware and/or software components of an exemplary computing device 200 according to some embodiments of the present disclosure. In some embodiments, the server 110 and/or the capturing device 130 may be implemented on the computing device 200. For example, the processing device 112 may be implemented on the computing device 200 and configured to perform functions of the processing device 112 disclosed in this disclosure.

The computing device 200 may be used to implement any component of the image processing system 100 as described herein. For example, the processing device 112 may be implemented on the computing device 200, via its hardware, software program, firmware, or a combination thereof. Although only one such computer is shown, for convenience, the computer functions relating to object measurement as described herein may be implemented in a distributed fashion on a number of similar platforms to distribute the processing load.

The computing device 200, for example, may include communication (COM) ports 250 connected to and from a network connected thereto to facilitate data communications. The computing device 200 may also include a processor (e.g., a processor 220) , in the form of one or more processors (e.g., logic circuits) , for executing program instructions. For example, the processor 220 may include interface circuits and processing circuits therein. The interface circuits may be configured to receive electronic signals from a bus 210, wherein the electronic signals encode structured data and/or instructions for the processing circuits to process. The processing circuits may conduct logic calculations, and then determine a conclusion, a result, and/or an instruction encoded as electronic signals. Then the interface circuits may send out the electronic signals from the processing circuits via the bus 210.

The computing device 200 may further include program storage and data storage of different forms including, for example, a disk 270, a read-only memory (ROM) 230, or a random-access memory (RAM) 240, for storing various data files to be processed and/or transmitted by the computing device 200. The computing device 200 may also include program instructions stored in the ROM 230, RAM 240, and/or another type of non-transitory storage medium to be executed by the processor 220. The methods and/or processes of the present disclosure may be implemented as the program instructions. The computing device 200 may also include an I/O component 260, supporting input/output between the computing device 200 and other components. The computing device 200 may also receive programming and data via network communications.

Merely for illustration, only one processor is illustrated in FIG. 2. Multiple processors 220 are also contemplated; thus, operations and/or method steps performed by one processor 220 as described in the present disclosure may also be jointly or separately performed by the multiple processors. For example, if in the present disclosure the processor 220 of the computing device 200 executes both step A and step B, it should be understood that step A and step B may also be performed by two different processors 220 jointly or separately in the computing device 200 (e.g., a first processor executes step A and a second processor executes step B, or the first and second processors jointly execute steps A and B) .

FIG. 3 is a schematic diagram illustrating exemplary hardware and/or software components of an exemplary mobile device 300 according to some embodiments of the present disclosure. In some embodiments, the terminal 140 may be implemented on the mobile device 300 shown in FIG. 3.

As illustrated in FIG. 3, the mobile device 300 may include a communication platform 310, a display 320, a graphic processing unit (GPU) 330, a central processing unit (CPU) 340, an I/O 350, a memory 360, and a storage 390. In some embodiments, any other suitable component, including but not limited to a system bus or a controller (not shown) , may also be included in the mobile device 300.

In some embodiments, an operating system 370 (e.g., iOS ^TM, Android ^TM, Windows Phone ^TM) and one or more applications (Apps) 380 may be loaded into the memory 360 from the storage 390 in order to be executed by the CPU 340. The applications 380 may include a browser or any other suitable mobile apps for receiving and rendering information relating to video data or other information from the processing device 112. User interactions may be achieved via the I/O 350 and provided to the processing device 112 and/or other components of the image processing system 100 via the network 120.

FIG. 4A is a block diagram illustrating exemplary processing device 400A for determining a key frame image of video data according to some embodiments of the present disclosure. FIG. 4B is a block diagram illustrating exemplary processing device 400B for training a preliminary model according to some embodiments of the present disclosure. In some embodiments, the processing device 400A may be configured to determine key frames of video data. The processing device 400B may be configured to generate one or more machine learning models (e.g., an optical flow model) for determining key frames of the video data. In some embodiments, the processing device 400A and/or the processing devices 400B may be implemented on the server 110 (e.g., the processing device 112) , the capturing device 130, or the terminal 140. In some embodiments, the processing devices 400B may be implemented on an external device of the image processing system 100. In some embodiments, the

processing devices

400A and 400B may be respectively implemented on a processing unit (e.g., the processor 220 illustrated in FIG. 2, the GPU 330, or the CPU 340 as illustrated in FIG. 3) . Merely by way of example, the processing devices 400A may be implemented on the CPU 340 of a terminal device, and the processing device 400B may be implemented on the computing device 200. Alternatively, both of the

processing devices

400A and 400B may be implemented on the computing device 200 or the CPU 340.

As shown in FIG. 4A, the processing device 400A may include an acquisition module 402 and a determination module 404.

The acquisition module 402 may be configured to obtain information relating to the image processing system 100. For example, the acquisition module 402 may obtain a target frame image of video data. In some embodiments, the processing device 400A may designate one of a plurality of frame images of the video data as the target frame image. In some embodiments, the processing device 400A may obtain the video data from one or more components of the image processing system 100. More descriptions regarding the obtaining of the target frame image of video data may be found elsewhere in the present disclosure. See, e.g., operation 510 in FIG. 5, operation 710 in FIG. 7, and relevant descriptions thereof.

The determination module 404 may be configured to determine a motion amplitude of one or more target subjects in the target frame image based on the target frame image and a determined key frame adjacent to the target frame image. In some embodiments, the determined key frame adjacent to the target frame image may be a determined key frame prior to the target frame image. More descriptions regarding the determining of a motion amplitude of one or more target subjects may be found elsewhere in the present disclosure. See, e.g., operation 520 in FIG. 5, operation 740 in FIG. 7, and relevant descriptions thereof. In some embodiments, the determination module 404 may be configured to determine a clarity of the target frame image and whether the clarity of the target frame image is greater than a clarity threshold. More descriptions regarding the determining of the clarity of the target frame image and whether the clarity of the target frame image is greater than a clarity threshold may be found elsewhere in the present disclosure. See, e.g., operation 720 and operation 730 in FIG. 7, and relevant descriptions thereof.

In some embodiments, the determination module 404 may be configured to designate the target frame image as a key fame image based on the motion amplitude of one or more target subjects in the target frame image. In some embodiments, the processing device 400A may compare the motion amplitude of the one or more target subjects with an amplitude threshold. In response to determining that the motion amplitude of the one or more target subjects is greater than the amplitude threshold, the processing device 400A may designate the target frame image as a key fame image. In response to determining that the motion amplitude of the one or more target subjects is not greater the amplitude threshold, the processing device 400A may designate the target frame image as a non-key frame image. More descriptions regarding the designating the target frame image as a key fame image may be found elsewhere in the present disclosure. See, e.g., operation 530 in FIG. 5, operation 750 in FIG. 7, and relevant descriptions thereof.

As shown in FIG. 4B, the processing device 400B may include an acquisition module 406 and a module generation module 408.

The acquisition module 406 may be configured to obtain one or more training samples and a preliminary model. Each training sample may include a pair of sample images and a ground truth optical flow field. The preliminary model to be trained may include one or more model parameters, such as the number (or count) of layers, the number (or count) of nodes, a loss function, or the like, or any combination thereof. Before training, the preliminary model may have one or more initial parameter values of the model parameter (s) . The module generation module 408 may be configured to train the preliminary model based on the one or more training samples to obtain the optical flow model. In some embodiments, the preliminary model may be generated according to a machine learning algorithm. The machine learning algorithm may include but not be limited to an artificial neural network algorithm, a deep learning algorithm, a decision tree algorithm, an association rule algorithm, an inductive logic programming algorithm, a support vector machine algorithm, a clustering algorithm, a Bayesian network algorithm, a reinforcement learning algorithm, a representation learning algorithm, a similarity and metric learning algorithm, a sparse dictionary learning algorithm, a genetic algorithm, a rule-based machine learning algorithm, or the like, or any combination thereof. The machine learning algorithm used to generate the one or more machine learning models may be a supervised learning algorithm, a semi-supervised learning algorithm, an unsupervised learning algorithm, or the like. More descriptions regarding the generation of the optical flow model may be found elsewhere in the present disclosure. See, e.g., operation 630 in FIG. 6, and relevant descriptions thereof.

It should be noted that the above description is merely provided for the purposes of illustration, and not intended to limit the scope of the present disclosure. For persons having ordinary skills in the art, multiple variations and modifications may be made under the teachings of the present disclosure. However, those variations and modifications do not depart from the scope of the present disclosure. In some embodiments, the processing device 400A as described in FIG. 4A and/or the processing device 400B as described in FIG. 4B may share two or more of the modules, and any one of the modules may be divided into two or more units. For instance, the processing device 400A as described in FIG. 4A and the processing device 400B as described in FIG. 4B may share a same acquisition module; that is, the acquisition module 402 and the acquisition module 406 are a same module. In some embodiments, the processing device 400A as described in FIG. 4A and/or the processing device 400B as described in FIG. 4B may include one or more additional modules, such as a storage module (not shown) for storing data. In some embodiments, the processing device 400A as described in FIG. 4A and the processing device 400B as described in FIG. 4B may be integrated into one processing device 112.

FIG. 5 is a flowchart illustrating an exemplary process 500 for determining a key frame image of video data according to some embodiments of the present disclosure. In some embodiments, the process 500 may be executed by the image processing system 100. For example, the process 500 may be implemented as a set of instructions (e.g., an application) stored in a storage device (e.g., the storage device 150, the ROM 230, the RAM 240, and/or the storage 390) . The operations of the illustrated process presented below are intended to be illustrative. In some embodiments, the process 500 may be accomplished with one or more additional operations not described and/or without one or more of the operations discussed. Additionally, the order of the operations of process 500 illustrated in FIG. 5 and described below is not intended to be limiting.

In some embodiments, the process 500 may be executed by the server 110 (e.g., the processing device 112) , the capturing device 130, or the terminal device 140 (e.g., the processor 220, the CPU 340, the GUP 330, and/or one or more modules illustrated in FIG. 4A) . For example, the capturing device 130 may be a front-end device and capture video data in real time. The capturing device 130 may perform the process 500 to determine key frames of the video data and code the video data based on the key frames. The capturing device 130 may transmit the coded video data to the server 110, the storage device 150, or the terminal 140 that is a back-end device. As another example, when transmitting video data to other devices, e.g., the storage device 150 or an external device, the server 110 or the terminal 140 may perform the process 500 to determine key frames of the video data, code the video data based on the key frames, and transmit the coded video data.

In order to facilitate transmission and storage of video data, frame images of the video data may be compressed or coded. According to the compression types, coded video data may include multiple key frame images and multiple non-key frame images. A key frame image of video data refers to a frame image of the video data that is least compressible and does not require other frame images to decode. A non-key frame image of video data refers to a frame image that is compressed to hold only changes in the frame image compared to a frame image adjacent to the frame image. In some embodiments, the non-key frames images (e.g., P-frames, B-frames) may be more compressible than the key frame images, and decoded based on the key frame images.

In 510, the processing device 400A (e.g., the acquisition module 402) may obtain a target frame image of video data. In some embodiments, the processing device 400A may designate one of a plurality of frame images of the video data as the target frame image.

In some embodiments, the processing device 400A may obtain the video data from one or more components of the image processing system 100, such as the capturing device 130, the terminal 140, a storage device (e.g., the storage device 150, the ROM 230, the RAM 240, and/or the storage 390) , etc. Alternatively, or additionally, the processing device 400A may obtain the video data from an external source (e.g., a cloud disk) via the network 120.

In some embodiments, the processing device 400A may directly obtain the target frame image of the video data. For example, the processing device 400A may perform a framing operation on the video data to obtain a plurality of frame images in the video data. The processing device 400A may designate a frame image in the plurality of frame images as the target frame image. In some embodiments, the processing device 400A may transmit the video data to another computing device. The computing device may obtain a plurality of frame images in the video data from the video data and transmit the plurality of frame images in the video data back to the processing device 400A. The processing device 400A may designate a frame image in the plurality of frame images as the target frame image.

In some embodiments, the processing device 400A may directly designate the first frame image and the last frame image of the video data as key frame images of the video data. In some embodiments, the processing device 400A may determine whether a frame image to be determined of the video data is the first frame image or the last frame image of the video data. In response to determining that the frame image to be determined of the video data is not the first frame image or the last frame image of the video data, the processing device 400A may designate the frame image to be determined as the target frame image.

In 520, the processing device 400A (e.g., the determination module 404) may determine a motion amplitude of one or more target subjects in the target frame image based on the target frame image and a determined key frame adjacent to the target frame image. In some embodiments, the determined key frame adjacent to the target frame image may be a determined key frame prior to the target frame image.

In some embodiments, the one or more target subjects may be one or more moving subjects. In some embodiments, the processing device 400A may extract one or more first motion target regions of the target frame image. The one or more first motion target regions may include the one or more target subjects in the target frame image. For example, a first motion target region may include at least one of the one or more target subjects. The processing device 400A may also extract one or more second motion target regions of the determined key frame image adjacent to the target frame image. Each of the one or more second motion target regions may correspond one of the one or more first motion target regions. For example, a second motion target region may include the same target subject (s) as the corresponding first motion target region. The processing device 400A may determine an optical flow value between the target frame image and the determined key frame image adjacent to the target frame image based on the one or more first motion target regions and the one or more second motion target regions. In some embodiments, the processing device 400A may determine the motion amplitude of the one or more target subjects in the target frame image based on the optical flow value between the target frame image and the determined key frame image adjacent to the target frame image. Optical flow refers to a distribution of apparent velocities of movement of a subject in an image. An optical flow value between two images refers to instantaneous velocities of movements of pixels on moving subjects in the two images. A corresponding relationship between two images may be determined based on changes of pixels in the two images in a time domain and a correlation between the two images. Motion information of the moving subjects in the two images may be further determined based on the corresponding relationship between the two images. Therefore, the optical flow value between the two images may be used for representing a motion amplitude of the one or more moving subjects in the two images. More descriptions for the determining of the motion amplitude of the one or more target subjects in the target frame image may be found elsewhere in the present disclosure. See, e.g., FIG. 6 and relevant descriptions thereof.

In 530, the processing device 400A (e.g., the determination module 404) may designate, based on the motion amplitude of one or more target subjects in the target frame image, the target frame image as a key fame image.

In some embodiments, the processing device 400A may compare the motion amplitude of the one or more target subjects with an amplitude threshold. The amplitude threshold may be determined based on a size of the video data, a size of the one or more target subjects, a count of the one or more target subjects, a count of one or more frame images between the target frame image and the key fame image adjacent to the target frame image, or the like, or any combination thereof. For example, the more the count of the one or more target subjects is, the greater the amplitude threshold may be. As another example, the more the count of one or more frame images between the target frame image and the key fame image is, the greater the amplitude threshold may be. In some embodiments, the amplitude threshold may be set manually by a user (e.g., an engineer) according to an experience value or a default setting of the image processing system 100, or determined by the processing device 400A according to an actual need.

In response to determining that the motion amplitude of the one or more target subjects is greater than the amplitude threshold, the processing device 400A may designate the target frame image as a key fame image. In response to determining that the motion amplitude of the one or more target subjects is not greater the amplitude threshold, the processing device 400A may designate the target frame image as a non-key frame image. The processing device 400A may obtain another frame image (e.g., a frame image immediately after the target frame image) , and perform the process 500 to determine whether to designate the another frame image as a key fame image.

It should be noted that the above description regarding the process 500 is merely provided for the purposes of illustration, and not intended to limit the scope of the present disclosure. For persons having ordinary skills in the art, multiple variations and modifications may be made under the teachings of the present disclosure. However, those variations and modifications do not depart from the scope of the present disclosure. In some embodiments, the process 500 may be accomplished with one or more additional operations not described and/or without one or more of the operations discussed above. For example, the process 500 may include an additional storing operation to store the obtained key frame images in a storage device (e.g., the storage device 150) disclosed elsewhere in the present disclosure.

FIG. 6 is a flowchart illustrating an exemplary process 600 for determining a motion amplitude of one or more target subjects in a target frame image according to some embodiments of the present disclosure. In some embodiments, the process 600 may be executed by the image processing system 100. For example, the process 600 may be implemented as a set of instructions (e.g., an application) stored in a storage device (e.g., the storage device 150, the ROM 230, the RAM 240, and/or the storage 390) . In some embodiments, the process 600 may be executed by the server 110 (e.g., the processing device 112) , the capturing device 130, or the terminal device 140 (e.g., the processor 220, the CPU 340, the GUP 330, and/or one or more modules illustrated in FIG. 4A) . The operations of the illustrated process presented below are intended to be illustrative. In some embodiments, the process 600 may be accomplished with one or more additional operations not described and/or without one or more of the operations discussed. Additionally, the order of the operations of process 600 illustrated in FIG. 6 and described below is not intended to be limiting. In some embodiments, one or more operations of the process 600 may be performed to achieve at least part of operation 520 as described in connection with FIG. 5.

In 610, the processing device 400A (e.g., the determination module 404) may extract one or more first motion target regions of the target frame image. The one or more first motion target regions may include the one or more target subjects in the target frame image. For example, a first motion target region may include at least one of the one or more target subjects.

In some embodiments, the processing device 400A may determine a first difference image of the target frame image using a background-difference algorithm (i.e., a background subtraction algorithm) . A background-difference algorithm may be an algorithm for detecting a moving subject. Merely by way of example, the video data relates to a scene where one or more target subjects (moving subjects) move across a stationary background. The processing device 400A may obtain a background image of the stationary background. The processing device 400A may determine one or more regions in the target frame image that are different from the background image by performing a differentiate operation on the background image and the target frame image using the background-difference algorithm. The one or more regions in the target frame image that are different from the background image may include the one or more target subjects in the target frame image. Further, the processing device 400A may generate the first difference image of the target frame image based on the one or more regions in the target frame image that are different from the background image.

In some embodiments, the processing device 400A may determine a first binary image of the target frame image by performing a binarization operation on the first difference image. A gray value of each pixel (or each voxel) in the one or more regions in the target frame image that are different from the background image may be set to a first value, while a gray value of each pixel (or each voxel) in other regions in the target frame image may be set to a second value. For example, the first value may be 0 and the second value may be 255, so that pixels (or voxels) corresponding to the one or more regions in the target frame image that are different from the background image may be displayed in white, while pixels (or voxels) corresponding to other regions in the target frame image may be displayed in black in the first binary image of the target frame image.

In some embodiments, the processing device 400A may determine one or more first initial connected regions of the first binary image by performing a morphological filtering operation on the first binary image. Each of the one or more first initial connected regions may correspond one of the one or more regions in the target frame image that are different from the background image. The first initial connected region may include a four connected region or an eight connected region. Exemplary morphological filtering operations may include a denoising operation, an enhancing operation, a corrosion operation, a dilating operation, or the like, or any combination thereof. The processing device 400A may determine one or more of the one or more first initial connected regions whose area exceed an area threshold as one or more first connected regions. The one or more first connected regions may include the one or more target subjects in the target frame image. In some embodiments, a first connected region may include at least one of the one or more target subjects. For example, each of the one or more first initial connected regions may correspond to one of the one or more target subjects. The processing device 400A may combine, as a single connected region, at least two first initial connected regions of which the overlapping portion is larger than an overlapping threshold. If the area of the single connected region is larger than the area threshold, the single connected region may be determined as a first connected region and include the target subjects corresponding to the at least two combined first initial connected regions. The area threshold may be set manually by a user (e.g., an engineer) according to an experience value or a default setting of the image processing system 100, or determined by the processing device 400A according to an actual need.

In some embodiments, for each of the one or more first connected regions, the processing device 400A may determine a first bounding box enclosing the first connected region. Merely by way of example, the processing device 400A may obtain coordinates of the first connected region. The processing device 400A may determine the first bounding box corresponding to the first connected region based on the coordinates of the first connected region. In some embodiments, the first bounding box of each of the one or more first connected regions may be the minimum bounding box of the each of the one or more first connected regions. In some embodiments, the processing device 400A may determine a second bounding box corresponding to the first connected region by extending the first bounding box by one or more pixels (e.g., 5 pixels) in a first direction (e.g., the vertical direction) and a second direction (e.g., the horizontal direction) , which may avoid or reduce missing pixels of one or more target subjects corresponding to the first connected region. A count of pixels expanded in the first direction may be the same as or different from a count of pixels expanded in the second direction. In some embodiments, if a distance between two bounding boxes (e.g., two first bounding boxes or two second bounding boxes) is less than a distance threshold, the two bounding boxes may be merged into a single bounding box. That the distance between two bounding boxes is less than a distance threshold may include that a distance between the closest points of the two bounding boxes is less than the distance threshold, or that the two bounding boxes partially overlap.

In some embodiments, a bounding box (e.g., the first bounding box or the second bounding box) may have the shape of a square, a rectangle, a triangle, a polygon, a circle, an ellipse, an irregular shape, or the like. Optionally, the bounding box may be annotated in the first binary image. In some embodiments, information relating to the bounding box may be determined. Exemplary information relating to the bounding box may include a shape, a size, a position (e.g., a coordinate of the center point of the bounding box) , or the like, or any combination thereof.

In some embodiments, the processing device 400A may extract the one or more first motion target regions based on the first bounding box or the second bounding box corresponding to each first connected region. For example, the processing device 400A may determine one or more regions enclosed by the one or more first bounding boxes or the second bounding boxes corresponding to the one or more first connected regions as the one or more first motion target regions. The processing device 400A may segment the one or more first motion target regions from the first binary image based on the information relating to the one or more first bounding boxes or the second bounding boxes.

In 620, the processing device 400A (e.g., the determination module 404) may extract one or more second motion target regions of the determined key frame image adjacent to the target frame image. Each of the one or more second motion target regions may correspond one of the one or more first motion target regions. For example, a second motion target region may include the same target subject (s) as the corresponding first motion target region.

In some embodiments, the extracting of the one or more second motion target regions may be performed in a similar manner as that of the one or more first motion target regions as described elsewhere in this disclosure, and the descriptions of which are not repeated here.

In 630, the processing device 400A (e.g., the determination module 404) may determine, based on the one or more first motion target regions and the one or more second motion target regions, an optical flow value between the target frame image and the determined key frame image adjacent to the target frame image.

In some embodiments, the processing device 400A may determine one or more pairs of motion target regions each pair of which includes a first motion target region and a second motion target region corresponding to the first motion target region. In some embodiments, the processing device 400A may determine the one or more pairs of motion target regions based on positions of the one or more first motion target regions and the one or more second motion target regions. Merely by way of example, the processing device 400A may obtain coordinates of the one or more first motion target regions and the one or more second motion target regions in a same coordinate system. In some embodiments, the processing device 400A may determine the one or more pairs of motion target regions based on the coordinates of the one or more first motion target regions and the one or more second motion target regions. For example, if a distance between a coordinate of a center point of a first motion target region and a coordinate of a center point of a second motion target region is the smallest, the processing device 400A may determine the first motion target region and the second motion target region as a pair of motion target regions. As another example, if an overlapping area of a first motion target region and a second motion target region is the largest, the processing device 400A may determine the first motion target region and the second motion target region as a pair of motion target regions.

In some embodiments, the processing device 400A may determine an optical flow field corresponding to each pair of motion target regions using an optical flow model based on the one or more pairs of motion target regions. An optical flow field corresponding to a pair of motion target regions refers to a vector formed by movements between pixels in the pair of motion target regions. The optical flow model may be a trained model (e.g., a Flow Net) used for determining an optical flow field. In some embodiments, the processing device 400A may obtain the optical flow model from one or more components of the image processing system 100 (e.g., the storage device 150) or an external source via a network (e.g., the network 120) . For example, the optical flow model may be previously trained by a computing device (e.g., the processing device 400B) , and stored in a storage device (e.g., the storage device 150, the storage 220, and/or the storage 390) of the image processing system 100. The processing device 400A may access the storage device and retrieve the optical flow model. In some embodiments, the optical flow model may be generated according to a machine learning algorithm as described elsewhere in this disclosure (e.g., FIG. 4B and the relevant descriptions) .

Merely by way of example, the optical flow model may be trained according to a supervised learning algorithm by the processing device 400B or another computing device (e.g., a computing device of a vendor of the optical flow model) . The processing device 400B may obtain one or more training samples and a preliminary model. Each training sample may include a pair of sample images and a ground truth optical flow field. The preliminary model to be trained may include one or more model parameters, such as the number (or count) of layers, the number (or count) of nodes, a loss function, or the like, or any combination thereof. Before training, the preliminary model may have one or more initial parameter values of the model parameter (s) .

The training of the preliminary model may include one or more iterations to iteratively update the model parameters of the preliminary model based on the training sample (s) until a termination condition is satisfied in a certain iteration. Exemplary termination conditions may be that the value of a loss function obtained in the certain iteration is less than a threshold value, that a certain count of iterations has been performed, that the loss function converges such that the difference of the values of the loss function obtained in a previous iteration and the current iteration is within a threshold value, etc. The loss function may be used to measure a discrepancy between an optical flow field predicted by the preliminary model in an iteration and the ground truth optical flow field. For example, the pair of sample images of each training sample may be input into the preliminary model, and the preliminary model may output an optical flow field. The loss function may be used to measure a difference between the predicted optical flow field and the ground truth optical flow field of each training sample. Exemplary loss functions may include a focal loss function, a log loss function, a cross-entropy loss, a Dice ratio, or the like. If the termination condition is not satisfied in the current iteration, the processing device 400B may further update the preliminary model to be used in a next iteration according to, for example, a backpropagation algorithm. If the termination condition is satisfied in the current iteration, the processing device 400B may designate the preliminary model in the current iteration as the optical flow model.

In some embodiments, each pair of motion target regions may be input into the optical flow model, and the optical flow model may output an optical flow field corresponding to the pair of motion target regions. Merely by way of example, the optical flow model may include an input block, one or more down blocks, one or more up blocks, and an output block. A pair of motion target regions S including a first motion target region S _1i and a second motion target region S _2i may be input into the optical flow model. Each of the first motion target region S _1i and the second motion target region S _2i may have a size of 384×512×3, wherein, 384, 512, and 3 denote a height (e.g., a count of pixels in the vertical direction) , a width (e.g., a count of pixels in the horizontal direction) , and a count of channels, respectively. The input block of the optical flow model may connect the first motion target region S _1i and the second motion target region S _2i to obtain a connected motion target region S _i with a size of 384×512×6, and output the connected motion target region S _i to a down block connected to the input block. The one or more down blocks may receive the connected motion target region S _i, perform one or more convolution operations for the connected motion target region S _i, and output multiple feature maps with a size of 6×8. The one or more up blocks may receive and process the multiple feature maps with a size of 6×8, and output multiple feature maps with a size of 96×128. Specifically, each up block may receive an input including multiple feature maps from a block immediately upstream to and connected to the up block, and process the received input. For example, each up block may perform a deconvolution operation on the received input, and obtain multiple predicted optical flow fields each of which corresponds to one of the multiple feature maps in the received input. Then, the up block may perform a bilinear interpolation operation on each predicted optical flow field to obtain a bilinear interpolation result. Further, the up block may connect each feature map obtained by performing deconvolution operation and a bilinear interpolation result corresponding to the feature map, and output the connected result to the next block connected to the up block. The output block may receive the multiple feature maps with a size of 96×128, and perform one or more deconvolution operations and bilinear interpolation operations to obtain a predicted optical flow field corresponding to the pair of motion target regions S. The predicted optical flow field corresponding to the pair of motion target regions S may be an image with the same resolution as the pair of motion target regions S.

In some embodiments, a plurality of pairs of motion target regions may be input into the optical flow model, and the optical flow model may output an optical flow field corresponding to each pair of motion target regions. In some embodiments, a plurality of first motion target regions in a plurality of pairs of motion target regions may be connected to a connected first motion target region. A plurality of first motion target regions the plurality of pairs of motion target regions may be connected to a connected first motion target region in a same manner as how the plurality of first motion target regions are connected. The connected first motion target region and the connected second motion target region may form a pair of connected motion target regions. The pair of connected motion target regions may be input into the optical flow model, and the optical flow model may output an optical flow field corresponding to the pair of connected motion target regions. The processing device 400A may determine an optical flow field corresponding to each pair of motion target regions in the plurality of pairs of motion target regions based on the optical flow field corresponding to the pair of connected motion target regions.

In some embodiments, the processing device 400A may determine an optical flow value corresponding to each pair of motion target regions (i.e., an optical flow value between each of the one or more first motion target regions and a second motion target region corresponding to the first motion target region) based on the optical flow field corresponding to the pair of motion target regions. Merely by way of example, for each first pixel in the first motion target region in the pair of motion target regions, the processing device 400A may determine an optical flow value between the first pixel and a second pixel corresponding to the first pixel in the corresponding second motion target region based on the optical flow field corresponding to the pair of motion target regions. Further, the processing device 400A may determine the optical flow value corresponding to the pair of motion target regions based on the optical flow value between each first pixel and the second pixel corresponding to the first pixel. For example, the processing device 400A may determine the optical flow value corresponding to the pair of motion target regions by summing the optical flow value between each first pixel and the second pixel corresponding to the first pixel. The optical flow value between a first pixel and a corresponding second pixel may indicate a location difference between the first pixel and the corresponding second pixel.

In some embodiments, the optical flow field corresponding to the pair of motion target regions may have two channels, e.g., the optical flow field corresponding to the pair of motion target regions may include a first component in a third direction (e.g., the vertical direction) and a second component in a fourth direction (e.g., the horizontal direction) . In some embodiments, for each first pixel in the first motion target region in a pair of motion target regions, the processing device 400A may determine a first optical flow value between the first pixel and a second pixel corresponding to the first pixel in the corresponding second motion target region in the third direction based on the first component of the optical flow field corresponding to the pair of motion target regions. The processing device 400A may also determine a second optical flow value between the first pixel and the second pixel corresponding to the first pixel in the corresponding second motion target region in the fourth direction based on the second component of the optical flow field corresponding to the pair of motion target regions. Further, the processing device 400A may determine the optical flow value corresponding to the pair of motion target regions based on the first optical flow value and the second optical flow value between each first pixel and the second pixel corresponding to the first pixel. For example, the processing device 400A may determine the optical flow value corresponding to the pair of motion target regions by summing the first optical flow value and the second optical flow value between each first pixel and the second pixel corresponding to the first pixel.

In some embodiments, the processing device 400A may determine the optical flow value between the target frame image and the determined key frame image adjacent to the target frame image based on the optical flow value between each of the one or more first motion target regions and the second motion target region corresponding to the first motion target region. For example, the processing device 400A may determine the optical flow value between the target frame image and the determined key frame image adjacent to the target frame image by summing the optical flow value between each of the one or more first motion target regions and the second motion target region corresponding to the first motion target region. As another example, the processing device 400A may determine an average of the one or more optical flow values corresponding to the one or more pairs of motion target regions, and designate the average of the one or more optical flow values as the optical flow value between the target frame image and the determined key frame image adjacent to the target frame image. As still another example, the processing device 400A may designate the maximum among the one or more optical flow values corresponding to the one or more pairs of motion target regions as the optical flow value between the target frame image and the determined key frame image adjacent to the target frame image.

In 640, the processing device 400A (e.g., the determination module 404) may determine, based on the optical flow value between the target frame image and the determined key frame image adjacent to the target frame image, the motion amplitude of the one or more target subjects in the target frame image.

In some embodiments, the processing device 400A may designate the optical flow value between the target frame image and the determined key frame image adjacent as the motion amplitude of the one or more target subjects in the target frame image.

Conventionally, whether a frame image is designated as a key frame image is determined using a Lucas-Kanade algorithm. An optical flow field obtained using the field Lucas-Kanade algorithm is a sparse optical flow field. However, only whether one or more subjects in the frame image are moving can be determined and the motion amplitude of one or more subjects is not accurately determined according to the sparse optical flow field. Moreover, the sparse optical flow has a low accuracy in pixel registration, and for continuous motions of the one or more subjects, the optical flow tracking accuracy is low.

Compared with the conventional approach for determining key frame images, in the process 500 and process 600, the processing device 400A may determine the motion amplitude of the one or more target subjects in the target frame image based on the optical flow value between the target frame image and the determined key frame image adjacent to the target frame image, and determine whether the target frame image is designated as a key fame image based on the motion amplitude of one or more target subjects in the target frame image, which may obtain more accurate key frame images to reduce the redundancy of the key frame images. In addition, the one or more first motion target regions and the one or more second motion target regions may be input into the optical flow model to determine the optical flow fields, and the optical flow value between the target frame image and the determined key frame image may be further determined based on the obtained optical flow fields, which may reduce the amount of data processed by the optical flow model, thereby improving the efficiency of the determining of the key frame images.

FIG. 7 is a flowchart illustrating an exemplary process 700 for determining a key frame image of video data according to some embodiments of the present disclosure. In some embodiments, a process 700 may be executed by the image processing system 100. For example, the process 700 may be implemented as a set of instructions (e.g., an application) stored in a storage device (e.g., the storage device 150, the ROM 230, the RAM 240, and/or the storage 390) . In some embodiments, the process 700 may be executed by the server 110 (e.g., the processing device 112) , the capturing device 130, or the terminal device 140 (e.g., the processor 220, the CPU 340, the GUP 330, and/or one or more modules illustrated in FIG. 4A) . The operations of the illustrated process presented below are intended to be illustrative. In some embodiments, the process 700 may be accomplished with one or more additional operations not described and/or without one or more of the operations discussed. Additionally, the order of the operations of process 700 illustrated in FIG. 7 and described below is not intended to be limiting.

In 710, the processing device 400A (e.g., the acquisition module 402) may obtain a target frame image of video data.

In some embodiments, the processing device 400A may obtain the video data from one or more components of the image processing system 100. In some embodiments, the processing device 400A may directly obtain the target frame image of the video data. In some embodiments, the operation 710 may be similar to or the same as the operation 510 of the process 500 as illustrated in FIG. 5.

In some embodiments, the processing device 400A may determine a clarity of the target frame image. The processing device 400A may determine a motion amplitude of one or more target subjects in the target frame image based on the target frame image and a determined key frame adjacent to the target frame image. The processing device 400A may determine whether to designate the target frame image as a key frame image based on the clarity and the motion amplitude. In some embodiments, in response to determining that the clarity of the target frame image is greater than a clarity threshold, and the motion amplitude of the one or more target subjects is greater than the amplitude threshold, the processing device 400A may designate the target frame image as a key fame image. In response to determining that the clarity of the target frame image is not greater than the clarity threshold or the motion amplitude of the one or more target subjects is not greater than the amplitude threshold, the processing device 400A may obtain another frame image (e.g., a frame image immediately after the target frame image) , and perform the process 700 to determine whether to designate the another frame image as a key frame image.

For example, as illustrated in operations 720-750, the processing device 400A may determine whether the clarity of the target frame image is greater than the clarity threshold. In response to determining that the clarity of the target frame image is greater than the clarity threshold, the processing device 400A may determine whether the motion amplitude of the one or more target subjects is greater than the amplitude threshold. In response to determining that the motion amplitude of the one or more target subjects is greater than the amplitude threshold, the processing device 400A may designate the target frame image as a key fame image. As another example, the processing device 400A may determine whether the motion amplitude of the one or more target subjects is greater than the amplitude threshold. In response to determining that the motion amplitude of the one or more target subjects is greater than the amplitude threshold, the processing device 400A may determine whether the clarity of the target frame image is greater than the clarity threshold. In response to determining that the clarity of the target frame image is greater than the clarity threshold, the processing device 400A may designate the target frame image as a key fame image.

As still another example, the determination of whether the motion amplitude of the one or more target subjects is greater than the amplitude threshold and whether the clarity of the target frame image is greater than the clarity threshold may be performed simultaneously.

In 720, the processing device 400A (e.g., the determination module 404) may determine a clarity of the target frame image.

In some embodiments, the processing device 400A may determine a single-channel gray image of the target frame image. The processing device 400A may determine the clarity of the target frame image based on the single-channel gray image of the target frame image. In some embodiments, the processing device 400A may determine the clarity of the target frame image according to a Laplace gradient function algorithm.

Merely by way of example, the processing device 400A may determine a gray value of a pixel in the single-channel gray image of the target frame image according to Equation (1) as below:

Img _i=0.299×I _i (R) +0.587×I _i (G) +0.114×I _i (B) , (1)

where I _i denotes the target frame image, Img _i denotes a gray value of a pixel i in the single-channel gray image, I _i (R) denotes a pixel value corresponding to R in RGB color mode, I _i (G) denotes a pixel value corresponding to G in RGB color mode, and I _i (B) denotes a pixel value corresponding to B in RGB color mode. Img ranges from 0 to 255. If the Img of a pixel equals 0, it indicates the pixel is black. If the Img of a pixel equals 255, it indicates the pixel is white.

For example, the single-channel gray image Img of the target frame image is a matrix (2) as below:

where each value in the matrix denotes a gray value of a pixel in the single-channel gray image Img.

Laplace operator Lap is a matrix (3) as below:

The clarity D (I _i) of the target frame image may be determined using a Laplace gradient function algorithm according to Equation (4) as below:

D (I _i) =∑ _y∑ _x|G (x, y) |, (4)

where G (x, y) denotes a value obtained by a convolution operation of the Laplace operator Lap at a pixel (x, y) of the single-channel gray image Img (0＜x＜9, 0＜y＜9) . The convolution operation of the Laplace operator Lap may include following operations. Laplace operator Lap moves line by line on the single-channel gray image Img, and multiply the pixel values in the Laplacian operator Lap that coincide with the single-channel gray image Img, and then sums them, and assign them to the pixels that coincide with the center point of Laplacian operator Lap. The remaining pixels in the single-channel gray image Img may be assigned a value of 0 directly. The single-channel gray image Img is performed the convolution operation of the Laplace operator Lap to a matrix (5) as blow:

The processing device 400A may determine that the clarity D (I _i) of the target frame image equals 242 based on the matrix (5) and the Equation (4) .

In 730, the processing device 400A (e.g., the determination module 404) may determine whether the clarity of the target frame image is greater than a clarity threshold.

In some embodiments, the processing device 400A may compare the clarity of the target frame image with the clarity threshold. In some embodiments, the processing device 400A may determine a clarity of each frame image of the video data. The processing device 400A may determine an average of the clarities of all frame images of the video data based on the clarity of each frame image of the video data. The processing device 400A may designate the average as the clarity threshold. In some embodiments, the clarity threshold may be set manually by a user (e.g., an engineer) according to an experience value or a default setting of the image processing system 100, or determined by the processing device 400A according to an actual need.

In response to determining that the clarity of the target frame image is greater than the clarity threshold, the processing device 400A may perform operation 740. In response to determining that the clarity of the target frame image is not greater than the clarity threshold, the processing device 400A may perform operation 710, that is, the processing device 400A may obtain another frame image, and perform the process 700 to determine whether the another frame image is designated as a key fame image.

In 740, the processing device 400A (e.g., the determination module 404) may determine a motion amplitude of one or more target subjects in the target frame image based on the target frame image and a determined key frame adjacent to the target frame image.

In some embodiments, the one or more target subjects may be one or more moving subjects. In some embodiments, the operation 740 may be similar to or the same as the operation 520 of the process 500 as illustrated in FIG. 5.

In 750, the processing device 400A (e.g., the determination module 404) may designate, based on the motion amplitude of one or more target subjects in the target frame image, the target frame image as the key fame image.

In some embodiments, the processing device 400A may determine whether the motion amplitude of the one or more target subjects is greater than an amplitude threshold. In some embodiments, the operation 750 may be similar to or the same as the operation 530 of the process 500 as illustrated in FIG. 5.

In the process 700, before the motion amplitude of one or more target subjects in the target frame image is determined, the processing device 400A may determine whether the clarity of the target frame image is greater than a clarity threshold. In response to determining that the clarity of the target frame image is greater than the clarity threshold, the processing device 400A may further determine the motion amplitude of one or more target subjects in the target frame image, and designate the target frame image as the key fame image based on the motion amplitude of one or more target subjects in the target frame image. In this way, if the clarity of the target frame image is not greater than the clarity threshold, the target frame image may be directly determined a non-key frame image, and the processing device 400A does not need to further determine the motion amplitude of one or more target subjects, which may improve the accuracy and efficiency of the determining of the key frame image.

It should be noted that the above description regarding the process 500 is merely provided for the purposes of illustration, and not intended to limit the scope of the present disclosure. For persons having ordinary skills in the art, multiple variations and modifications may be made under the teachings of the present disclosure. However, those variations and modifications do not depart from the scope of the present disclosure. In some embodiments, the processing device 400A may perform the process 500 and/or the process 700 on each of the frame images of the video data in temporal order starting from the first frame image or the last frame image. After designating key frames for the video data, the processing device 400A may code the video data. For example, the frame images that are designated as the key frames may be coded as a completed image, and the frame images that are not designated as the key frames may be compressed to hold only changes compared to their adjacent frame images.

It will be apparent to those skilled in the art that various changes and modifications can be made in the present disclosure without departing from the spirit and scope of the disclosure. In this manner, the present disclosure may be intended to include such modifications and variations if the modifications and variations of the present disclosure are within the scope of the appended claims and the equivalents thereof.

Having thus described the basic concepts, it may be rather apparent to those skilled in the art after reading this detailed disclosure that the foregoing detailed disclosure is intended to be presented by way of example only and is not limiting. Various alterations, improvements, and modifications may occur and are intended to those skilled in the art, though not expressly stated herein. These alterations, improvements, and modifications are intended to be suggested by this disclosure, and are within the spirit and scope of the exemplary embodiments of this disclosure.

Moreover, certain terminology has been used to describe embodiments of the present disclosure. For example, the terms “one embodiment, ” “an embodiment, ” and/or “some embodiments” mean that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present disclosure. Therefore, it is emphasized and should be appreciated that two or more references to “an embodiment” or “one embodiment” or “an alternative embodiment” in various portions of this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures or characteristics may be combined as suitable in one or more embodiments of the present disclosure.

Further, it will be appreciated by one skilled in the art, aspects of the present disclosure may be illustrated and described herein in any of a number of patentable classes or context including any new and useful process, machine, manufacture, or comlocation of matter, or any new and useful improvement thereof. Accordingly, aspects of the present disclosure may be implemented entirely hardware, entirely software (including firmware, resident software, micro-code, etc. ) or combining software and hardware implementation that may all generally be referred to herein as a “unit, ” “module, ” or “system. ” Furthermore, aspects of the present disclosure may take the form of a computer program product embodied in one or more computer readable media having computer-readable program code embodied thereon.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including electromagnetic, optical, or the like, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that may communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable signal medium may be transmitted using any appropriate medium, including wireless, wireline, optical fiber cable, RF, or the like, or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present disclosure may be written in a combination of one or more programming languages, including an object oriented programming language such as Java, Scala, Smalltalk, Eiffel, JADE, Emerald, C++, C#, VB. NET, Python or the like, conventional procedural programming languages, such as the "C" programming language, Visual Basic, Fortran 2003, Perl, COBOL 2002, PHP, ABAP, dynamic programming languages such as Python, Ruby, and Groovy, or other programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN) , or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider) or in a cloud computing environment or offered as a service such as a Software as a Service (SaaS) .

Furthermore, the recited order of processing elements or sequences, or the use of numbers, letters, or other designations thereof, are not intended to limit the claimed processes and methods to any order except as may be specified in the claims. Although the above disclosure discusses through various examples what is currently considered to be a variety of useful embodiments of the disclosure, it is to be understood that such detail is solely for that purpose, and that the appended claims are not limited to the disclosed embodiments, but, on the contrary, are intended to cover modifications and equivalent arrangements that are within the spirit and scope of the disclosed embodiments. For example, although the implementation of various components described above may be embodied in a hardware device, it may also be implemented as a software only solution, e.g., an installation on an existing server or mobile device.

Similarly, it should be appreciated that in the foregoing description of embodiments of the present disclosure, various features are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure aiding in the understanding of one or more of the various embodiments. This method of disclosure, however, is not to be interpreted as reflecting an intention that the claimed subject matter requires more features than are expressly recited in each claim. Rather, claimed subject matter may lie in less than all features of a single foregoing disclosed embodiment.

Claims

A system for obtaining a key frame, comprising:

at least one storage device including a set of instructions; and

at least one processor in communication with the at least one storage device, wherein when executing the set of instructions, the at least one processor is directed to cause the system to perform operations including:

obtaining a target frame image of video data;

determining a motion amplitude of one or more target subjects in the target frame image based on the target frame image and a determined key frame adjacent to the target frame image; and

designating, based on the motion amplitude of one or more target subjects in the target frame image, the target frame image as a key fame image.
The system of claim 1, wherein the obtaining a target frame image of video data comprises:

determining whether a frame image to be determined of the video data is the last frame image of the video data; and

in response to determining that the frame image to be determined of the video data is not the last frame image of the video data, designating the frame image to be determined as the target frame image.
The system of claim 1 or claim 2, wherein the at least one processor is directed to cause the system to further perform operations including:

determining a clarity of the target frame image; and

designating, based on the clarity of the target frame image and the motion amplitude of one or more target subjects in the target frame image, the target frame image as the key fame image.
The system of claim 3, wherein the determining a clarity of the target frame image comprises:

determining a single-channel gray image of the target frame image; and

determining, based on the single-channel gray image of the target frame image, the clarity of the target frame image.
The system of claim 3 or claim 4, wherein the determining a clarity of the target frame image comprises:

determining the clarity of the target frame image according to a Laplace gradient function algorithm.
The system of any one of claims 3-5, wherein the designating, based on the clarity of the target frame image and the motion amplitude of one or more target subjects in the target frame image, the target frame image as a key fame image comprises:

comparing the clarity of the target frame image with a clarity threshold;

comparing the motion amplitude of the one or more target subjects with an amplitude threshold; and

in response to determining that the clarity of the target frame image is greater than the clarity threshold and the motion amplitude of the one or more target subjects is greater than the amplitude threshold, designating the target frame image as the key fame image.
The system of claim 6, wherein the amplitude threshold is associated with at least one of a size of the video data, a size of the one or more target subjects, a count of the one or more target subjects, or a count of one or more frame images between the target frame image and the key fame image adjacent to the target frame image.
The system of any one of claims 3-7, wherein the designating, based on the clarity of the target frame image and the motion amplitude of one or more target subjects in the target frame image, the target frame image as a key fame image comprises:

determining a clarity of each frame image of the video data;

determining, based on the clarity of each frame image of the video data, an average of the clarities of all frame images of the video data; and

designating the average as the clarity threshold.
The system of any one of claims 1-8, wherein the determining a motion amplitude of one or more target subjects in the target frame image based on the target frame image and a determined key frame adjacent to the target frame image comprises:

extracting one or more first motion target regions of the target frame image, wherein the one or more first motion target regions include the one or more target subjects in the target frame image;

extracting one or more second motion target regions of the determined key frame image adjacent to the target frame image, wherein each of the one or more second motion target regions corresponds one of the one or more first motion target regions; and

determining, based on the one or more first motion target regions and the one or more second motion target regions, an optical flow value between the target frame image and the determined key frame image adjacent to the target frame image; and

determining, based on the optical flow value between the target frame image and the determined key frame image adjacent to the target frame image, the motion amplitude of the one or more target subjects in the target frame image.
The system of claim 9, wherein the extracting one or more first motion target regions of the target frame image or extracting one or more second motion target regions of the determined key frame image adjacent to the target frame image comprises:

determining a difference image of the target frame image or the determined key frame image using a background-difference algorithm;

determining a binary image of the target frame image or the determined key frame image by performing a binarization operation on the difference image;

determining one or more connected regions of the binary image by performing a morphological filtering operation on the binary image;

determining a first bounding box corresponding to each of the one or more connected regions based on the one or more connected regions;

extracting the one or more first motion target regions or the one or more second motion target regions based on the first bounding box corresponding to each connected region.
The system of claim 10, wherein the extracting the one or more first motion target regions or the one or more second motion target regions based on the first bounding box corresponding to each connected region comprises:

for each of the one or more connected regions in the target frame image or the determined key frame image,

determining a second bounding box of the connected region by extending the first bounding box by one or more pixels in a first direction and a second direction;

extracting the one or more first motion target regions or the one or more second motion target regions based on the second bounding box corresponding to each connected region.
The system of any one of claims 9-11, wherein the determining, based on the one or more first motion target regions and the one or more second motion target regions, an optical flow value between the target frame image and the determined key frame image adjacent to the target frame image comprises:

determining an optical flow value between each of the one or more first motion target regions and a second motion target region corresponding to the first motion target region; and

determining, based on the optical flow value between each of the one or more first motion target regions and the second motion target region corresponding to the first motion target region, the optical flow value between the target frame image and the determined key frame image adjacent to the target frame image.
The system of claim 12, wherein the determining, based on the optical flow value between each of the one or more first motion target regions and the second motion target region corresponding to the first motion target region, the optical flow value between the target frame image and the determined key frame image adjacent to the target frame image comprises:

determining the optical flow value between the target frame image and the determined key frame image adjacent to the target frame image by summing the optical flow value between each of the one or more first motion target regions and the second motion target region corresponding to the first motion target region.
The system of any one of claims 12-13, wherein the determining an optical flow value between each of the one or more first motion target regions and a second motion target region corresponding to the first motion target region comprises:

for each first pixel in the first motion target region,

determining, in a third direction, a first optical flow value between the first pixel and a second pixel corresponding to the first pixel in the corresponding second motion target region,

determining, in a fourth direction, a second optical flow value between the first pixel and the second pixel corresponding to the first pixel in the corresponding second motion target region; and

determining, based on the first optical flow value and the second optical flow value between each first pixel and the second pixel corresponding to the first pixel, the optical flow value between the first motion target region and the second motion target region corresponding to the first motion target region.
A method for obtaining a key frame, being implemented on a computing device having at least one storage device and at least one processor, the method comprising:

obtaining a target frame image of video data;

determining a motion amplitude of one or more target subjects in the target frame image based on the target frame image and a determined key frame adjacent to the target frame image; and

designating, based on the motion amplitude of one or more target subjects in the target frame image, the target frame image as a key fame image.
The method of claim 15, wherein the obtaining a target frame image of video data comprises:

determining whether a frame image to be determined of the video data is the last frame image of the video data; and

in response to determining that the frame image to be determined of the video data is not the last frame image of the video data, designating the frame image to be determined as the target frame image.
The method of claim 15 or claim 15, wherein the method further comprises:

determining a clarity of the target frame image; and

designating, based on the clarity of the target frame image and the motion amplitude of one or more target subjects in the target frame image, the target frame image as the key fame image.
The method of claim 17, wherein the determining a clarity of the target frame image comprises:

determining a single-channel gray image of the target frame image; and

determining, based on the single-channel gray image of the target frame image, the clarity of the target frame image.
The method of claim 17 or claim 18, wherein the determining a clarity of the target frame image comprises:

determining the clarity of the target frame image according to a Laplace gradient function algorithm.
The method of any one of claims 17-19, wherein the designating, based on the clarity of the target frame image and the motion amplitude of one or more target subjects in the target frame image, the target frame image as a key fame image comprises:

comparing the clarity of the target frame image with a clarity threshold;

comparing the motion amplitude of the one or more target subjects with an amplitude threshold; and

in response to determining that the clarity of the target frame image is greater than the clarity threshold and the motion amplitude of the one or more target subjects is greater than the amplitude threshold, designating the target frame image as the key fame image.
The method of claim 20, wherein the amplitude threshold is associated with at least one of a size of the video data, a size of the one or more target subjects, a count of the one or more target subjects, or a count of one or more frame images between the target frame image and the key fame image adjacent to the target frame image.
The method of any one of claims 17-21, wherein the designating, based on the clarity of the target frame image and the motion amplitude of one or more target subjects in the target frame image, the target frame image as a key fame image comprises:

determining a clarity of each frame image of the video data;

determining, based on the clarity of each frame image of the video data, an average of the clarities of all frame images of the video data; and

designating the average as the clarity threshold.
The method of any one of claims 15-22, wherein the determining a motion amplitude of one or more target subjects in the target frame image based on the target frame image and a determined key frame adjacent to the target frame image comprises:

extracting one or more first motion target regions of the target frame image, wherein the one or more first motion target regions include the one or more target subjects in the target frame image;

extracting one or more second motion target regions of the determined key frame image adjacent to the target frame image, wherein each of the one or more second motion target regions corresponds one of the one or more first motion target regions; and

determining, based on the one or more first motion target regions and the one or more second motion target regions, an optical flow value between the target frame image and the determined key frame image adjacent to the target frame image; and

determining, based on the optical flow value between the target frame image and the determined key frame image adjacent to the target frame image, the motion amplitude of the one or more target subjects in the target frame image.
The method of claim 23, wherein the extracting one or more first motion target regions of the target frame image or extracting one or more second motion target regions of the determined key frame image adjacent to the target frame image comprises:

determining a difference image of the target frame image or the determined key frame image using a background-difference algorithm;

determining a binary image of the target frame image or the determined key frame image by performing a binarization operation on the difference image;

determining one or more connected regions of the binary image by performing a morphological filtering operation on the binary image;

determining a first bounding box corresponding to each of the one or more connected regions based on the one or more connected regions;

extracting the one or more first motion target regions or the one or more second motion target regions based on the first bounding box corresponding to each connected region.
The method of claim 24, wherein the extracting the one or more first motion target regions or the one or more second motion target regions based on the first bounding box corresponding to each connected region comprises:

for each of the one or more connected regions in the target frame image or the determined key frame image,

determining a second bounding box of the connected region by extending the first bounding box by one or more pixels in a first direction and a second direction;

extracting the one or more first motion target regions or the one or more second motion target regions based on the second bounding box corresponding to each connected region.
The method of any one of claims 23-25, wherein the determining, based on the one or more first motion target regions and the one or more second motion target regions, an optical flow value between the target frame image and the determined key frame image adjacent to the target frame image comprises:

determining an optical flow value between each of the one or more first motion target regions and a second motion target region corresponding to the first motion target region; and

determining, based on the optical flow value between each of the one or more first motion target regions and the second motion target region corresponding to the first motion target region, the optical flow value between the target frame image and the determined key frame image adjacent to the target frame image.
The method of claim 26, wherein the determining, based on the optical flow value between each of the one or more first motion target regions and the second motion target region corresponding to the first motion target region, the optical flow value between the target frame image and the determined key frame image adjacent to the target frame image comprises:

determining the optical flow value between the target frame image and the determined key frame image adjacent to the target frame image by summing the optical flow value between each of the one or more first motion target regions and the second motion target region corresponding to the first motion target region.
The method of any one of claims 26-27, wherein the determining an optical flow value between each of the one or more first motion target regions and a second motion target region corresponding to the first motion target region comprises:

for each first pixel in the first motion target region,

determining, in a third direction, a first optical flow value between the first pixel and a second pixel corresponding to the first pixel in the corresponding second motion target region,

determining, in a fourth direction, a second optical flow value between the first pixel and the second pixel corresponding to the first pixel in the corresponding second motion target region; and

determining, based on the first optical flow value and the second optical flow value between each first pixel and the second pixel corresponding to the first pixel, the optical flow value between the first motion target region and the second motion target region corresponding to the first motion target region.
A system for obtaining a key frame, comprising:

an acquisition module, configured to obtain a target frame image of video data; and

a determination module, configured to determine a motion amplitude of one or more target subjects in the target frame image based on the target frame image and a determined key frame adjacent to the target frame image, and designate, based on the motion amplitude of one or more target subjects in the target frame image, the target frame image as a key fame image.
A non-transitory computer readable medium, comprising a set of instruction for obtaining a key frame, wherein when executed by at least one processor of a computing device, the set of instructions causes the computing device to perform a method, the method comprising:

obtaining a target frame image of video data;

determining a motion amplitude of one or more target subjects in the target frame image based on the target frame image and a determined key frame adjacent to the target frame image; and

designating, based on the motion amplitude of one or more target subjects in the target frame image, the target frame image as a key fame image.