CN116320536B

CN116320536B - Video processing method, device, computer equipment and computer readable storage medium

Info

Publication number: CN116320536B
Application number: CN202310550986.9A
Authority: CN
Inventors: 姚志军; 张磊; 高熙和
Original assignee: Hanbo Semiconductor Shanghai Co ltd
Current assignee: Hanbo Semiconductor Shanghai Co ltd
Priority date: 2023-05-16
Filing date: 2023-05-16
Publication date: 2023-08-18
Anticipated expiration: 2043-05-16
Also published as: CN116320536A

Abstract

The present disclosure provides a video processing method, apparatus, computer device, and computer-readable storage medium. The implementation scheme is as follows: acquiring a plurality of continuous image frames corresponding to a first video; determining a particular frame of a plurality of consecutive image frames; for each image frame of the plurality of consecutive image frames except for the particular frame: determining a correlation between the image frame and the particular frame; determining the image frame as a retained image frame in response to the correlation being less than a predetermined correlation threshold; and determining the image frame as a discard image frame in response to the correlation being greater than or equal to a predetermined correlation threshold, frame index information of the discard image frame and difference information of the discard image frame and the specific frame in image characteristics being saved as encoding information; and generating a second video based on the retained image frames and the specific frames, the second video including encoding information including information related to a time of the discarded image frames in the first video, to cause the terminal device to resume the positions of the discarded image frames in the first video.

Description

Video processing method, device, computer equipment and computer readable storage medium

Technical Field

The present disclosure relates to the field of video processing technology, and in particular, to the field of video codec technology and deep learning, and more particularly, to a video processing method, apparatus, computer device, computer readable storage medium, and computer program product.

Background

In the process of uploading and downloading video, processing operations of decoding, encoding and re-decoding the video are required. Generally, video needs to be decoded and then re-encoded on a cloud device, which is also called transcoding of video, before it can be transmitted to a terminal device. And after receiving the video recoded by the cloud terminal equipment, the terminal equipment needs to decode the video to play. As video quality continues to increase, the transmission of video between cloud devices and terminal devices presents a significant challenge to bandwidth. How to reduce the file size transmitted between the cloud device and the terminal device on the premise of ensuring the video quality so as to reduce the load on the bandwidth, thereby reducing the cost, and is still one of research hotspots and difficulties in the industry.

Disclosure of Invention

The present disclosure provides a video processing method, apparatus, computer device, computer readable storage medium, and computer program product.

According to an aspect of the present disclosure, there is provided a video processing method applied to a cloud device, the method including: acquiring a plurality of continuous image frames corresponding to a first video; determining a particular frame of a plurality of consecutive image frames; for each image frame of the plurality of consecutive image frames except for the particular frame: determining a correlation between the image frame and the particular frame; determining the image frame as a retained image frame in response to the correlation being less than a predetermined correlation threshold; and determining the image frame as a discard image frame in response to the correlation being greater than or equal to a predetermined correlation threshold, wherein frame index information of the discard image frame and difference information of the discard image frame and the specific frame in image characteristics are saved as encoding information; and generating a second video based on the retained image frames and the specific frames, wherein the second video comprises encoding information, wherein the encoding information comprises information related to a time of the discarded image frames in the first video, such that the terminal device resumes the positions of the discarded image frames in the first video based on the time-related information and the specific frames.

According to another aspect of the present disclosure, there is provided a video processing method, applied to a terminal device, the method including: receiving a second video after video processing of the first video, wherein the second video is generated based on a subset of image frames in a plurality of consecutive image frames corresponding to the first video, the subset of image frames including a particular frame determined in the first video and a reserved image frame, a correlation between the reserved image frame and the particular frame being less than a predetermined correlation threshold, and wherein the second video includes encoding information including frame index information of the discarded image frame determined in the first video and difference information of the discarded image frame and the particular frame in image characteristics, the correlation between the discarded image frame and the particular frame being greater than or equal to the predetermined correlation threshold, wherein the encoding information includes information related to a time of the discarded image frame in the first video, such that the terminal device resumes a position of the discarded image frame in the first video based on the time-related information and the particular frame; and decoding the second video based on the encoded information, wherein the discarded image frames are restored in a plurality of consecutive image frames.

According to another aspect of the present disclosure, there is provided a video processing apparatus applied to a cloud, the video processing apparatus including: an image frame acquisition module configured to acquire a plurality of consecutive image frames corresponding to a first video; a specific frame determination module configured to determine a specific frame of a plurality of consecutive image frames; an image frame processing module configured to process, for each of a plurality of consecutive image frames, except for a specific frame, the image frame processing module comprising: a correlation determination module configured to determine a correlation between the image frame and the specific frame; an image frame retention module configured to determine an image frame as a retained image frame in response to the correlation being less than a predetermined correlation threshold; and an image frame discarding module configured to determine the image frame as a discarded image frame in response to the correlation being greater than or equal to a predetermined correlation threshold, wherein frame index information of the discarded image frame and difference information of the discarded image frame and the specific frame in image characteristics are saved as encoding information; and a video generation module configured to generate a second video based on the retained image frame and the specific frame, wherein the second video includes encoding information, wherein the encoding information includes information related to a time of the discarded image frame in the first video, so that the terminal device restores a position of the discarded image frame in the first video based on the time-related information and the specific frame.

According to another aspect of the present disclosure, there is provided a video processing apparatus, applied to a terminal, the video processing apparatus including: a video receiving module configured to receive a second video after video processing of the first video, wherein the second video is generated based on a subset of image frames among a plurality of consecutive image frames corresponding to the first video, the subset of image frames including a specific frame determined in the first video and a reserved image frame, a correlation between the reserved image frame and the specific frame being smaller than a predetermined correlation threshold, and wherein the second video includes encoding information including frame index information of the discarded image frame determined in the first video and difference information of the discarded image frame and the specific frame in image characteristics, the correlation between the discarded image frame and the specific frame being greater than or equal to the predetermined correlation threshold, wherein the encoding information includes information related to a time of the discarded image frame in the first video, such that the terminal device restores a position of the discarded image frame in the first video based on the time-related information and the specific frame; and a video decoding module configured to decode the second video based on the encoded information, wherein the discarded image frames are restored in a plurality of consecutive image frames.

According to another aspect of the present disclosure, there is provided a computer apparatus comprising: at least one processor; and a memory having stored thereon a computer program which, when executed by the processor, causes the processor to perform the method of the present disclosure as provided above.

According to another aspect of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, causes the processor to perform the method of the present disclosure as provided above.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, causes the processor to perform the method of the present disclosure as provided above.

According to one or more embodiments of the present disclosure, the file size transmitted between the cloud device and the terminal device may be reduced on the premise of ensuring video quality, so as to reduce the load on bandwidth, thereby reducing the cost.

These and other aspects of the disclosure will be apparent from and elucidated with reference to the embodiments described hereinafter.

Drawings

The accompanying drawings illustrate exemplary embodiments and, together with the description, serve to explain exemplary implementations of the embodiments. The illustrated embodiments are for exemplary purposes only and do not limit the scope of the claims. Throughout the drawings, identical reference numerals designate similar, but not necessarily identical, elements.

FIG. 1 is a schematic diagram of an example system in which various methods described herein may be implemented, according to an example embodiment;

FIG. 2 is a flowchart of a video processing method according to an exemplary embodiment;

FIG. 3 is a schematic diagram of time domain processing of an image frame according to an exemplary embodiment;

FIG. 4 is a schematic diagram of a process of extracting image features according to an exemplary embodiment;

FIG. 5 is a flowchart of a video processing method according to another exemplary embodiment;

fig. 6 is a schematic diagram of a video processing method according to another exemplary embodiment;

fig. 7 is a schematic block diagram of a video processing apparatus according to an exemplary embodiment;

fig. 8 is a schematic block diagram of a video processing apparatus according to another exemplary embodiment;

FIG. 9 is a block diagram of an exemplary computer device that can be used in the exemplary embodiments.

Detailed Description

In the present disclosure, the use of the terms "first," "second," and the like to describe various elements is not intended to limit the positional relationship, timing relationship, or importance relationship of the elements, unless otherwise indicated, and such terms are merely used to distinguish one element from another. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, they may also refer to different instances based on the description of the context.

The terminology used in the description of the various illustrated examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, the elements may be one or more if the number of the elements is not specifically limited. As used herein, the term "plurality" means two or more, and the term "based on" should be interpreted as "based at least in part on". Furthermore, the term "and/or" and "at least one of … …" encompasses any and all possible combinations of the listed items.

In the related art, processing operations of decoding, encoding, and re-decoding a video are required at the time of uploading and downloading the video. Generally, video needs to be decoded and then re-encoded on a cloud device, which is also called transcoding of video, before it can be transmitted to a terminal device. And after receiving the video recoded by the cloud terminal equipment, the terminal equipment needs to decode the video to play. This process typically involves lossy compression of the video, which may lose part of the original image information. With the explosive growth of short video, online conferences, live video and other services, the flow cost control becomes a key link of each platform which needs to be optimized. Therefore, how to reduce the file size transmitted between the cloud device and the terminal device on the premise of ensuring the video quality so as to reduce the load on the bandwidth is always one of research hotspots and difficulties in the industry.

One conventional approach is that the video platform reduces the video transmission bandwidth by reducing the code rate of the video encoding, but this has a significant impact on the video quality, which can significantly detract from the user's viewing experience. The other traditional method is that the video platform dynamically adjusts the code rate of different time periods in the video through a dynamic coding technology so as to ensure the video quality as much as possible, but the method has the advantages of complex analysis strategy, high calculation force requirement, obvious hysteresis and limited bandwidth saving effect. Yet another conventional approach is that the video platform would choose to employ more advanced coding standards such as h.266, VP9, AV1, etc., which are not very popular, but have high copyright costs and requirements on the client device. None of these conventional methods is good at guaranteeing video quality and reducing bandwidth at the same time.

In order to reduce the size of a file transmitted between a cloud device and a terminal device while ensuring the video quality so as to reduce the load on bandwidth and thereby reduce the cost, the present disclosure provides a video processing method.

Exemplary embodiments of the present disclosure are described in detail below with reference to the attached drawings. Before describing in detail the video processing method according to embodiments of the present disclosure, an example system in which the present method may be implemented is first described.

FIG. 1 is a schematic diagram illustrating an example system 100 in which various methods described herein may be implemented, according to an example embodiment.

Referring to fig. 1, the system 100 includes a client device 110, a server 120, and a network 130 communicatively coupling the client device 110 with the server 120.

Client device 110 includes a display 114 and a client Application (APP) 112 that is displayable via display 114. The client application 112 may be an application program that needs to be downloaded and installed before running or an applet (liteapp) that is a lightweight application program. In the case where the client application 112 is an application program that needs to be downloaded and installed before running, the client application 112 may be pre-installed on the client device 110 and activated. In the case where the client application 112 is an applet, the user 102 may run the client application 112 directly on the client device 110 by searching the client application 112 in the host application (e.g., by name of the client application 112, etc.) or by scanning a graphical code (e.g., bar code, two-dimensional code, etc.) of the client application 112, etc., without installing the client application 112. In some embodiments, the client device 110 may be any type of mobile computer device, including a mobile computer, a mobile phone, a wearable computer device (e.g., a smart watch, a head-mounted device, including smart glasses, etc.), or other type of mobile device. In some embodiments, client device 110 may alternatively be a stationary computer device, such as a desktop, server computer, or other type of stationary computer device.

Server 120 is typically a server deployed by an Internet Service Provider (ISP) or Internet Content Provider (ICP). Server 120 may represent a single server, a cluster of multiple servers, a distributed system, or a cloud server providing basic cloud services (such as cloud databases, cloud computing, cloud storage, cloud communication). It will be appreciated that although server 120 is shown in fig. 1 as communicating with only one client device 110, server 120 may provide background services for multiple client devices simultaneously.

Examples of network 130 include a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), and/or a combination of communication networks such as the internet. The network 130 may be a wired or wireless network. In some embodiments, the data exchanged over the network 130 is processed using techniques and/or formats including hypertext markup language (HTML), extensible markup language (XML), and the like. In addition, all or some of the links may also be encrypted using encryption techniques such as Secure Sockets Layer (SSL), transport Layer Security (TLS), virtual Private Network (VPN), internet protocol security (IPsec), and the like. In some embodiments, custom and/or dedicated data communication techniques may also be used in place of or in addition to the data communication techniques described above.

For purposes of embodiments of the present disclosure, in the example of fig. 1, client application 112 may be a video processing application program. In response, the server 120 may be a server for use with video processing applications. The server 120 may provide video processing data to the client device 110, with video processing services provided by a client application 112 running in the client device 110.

Fig. 2 is a flowchart illustrating a video processing method 200 according to an exemplary embodiment. The video processing method 200 is applied to a cloud device, such as the server 120 shown in fig. 1.

Referring to fig. 2, in step S210, a plurality of consecutive image frames corresponding to a first video are acquired.

In an example, the first video may be a homemade video or a reloaded video uploaded to the cloud device by the video platform user. After the first video is uploaded to the cloud device, the first video may be decoded on the cloud device, so that the cloud device may acquire a plurality of continuous image frames corresponding to the first video. In general, an image frame may be the smallest unit of constituent video.

In step S220, a specific frame of a plurality of consecutive image frames is determined.

In an example, information about a specific frame may be previously included in original encoded information of the first video, so that information about which image frame or frames is the specific frame may be acquired when the first video is decoded into a plurality of consecutive image frames. Alternatively, feature extraction may be performed on each of a plurality of consecutive image frames to determine which image frame or frames are particular frames. The particular frames may include, for example, key frames, scene change frames, or image frames with higher key information integrity. The specific frame is not compressed during video processing and during transmission between the cloud device and the terminal device, but all the characteristics and information thereof can be retained.

In step S230, for each image frame other than a specific frame among a plurality of consecutive image frames, the following steps are performed: step S231, determining the correlation between the image frame and the specific frame; step S232, in response to the correlation being less than a predetermined correlation threshold, determining the image frame as a reserved image frame; and step S233, determining the image frame as a discard image frame in response to the correlation being greater than or equal to a predetermined correlation threshold. Here, the frame index information of the discarded image frame and the difference information of the discarded image frame and the specific frame in image characteristics are saved as the encoding information.

In an example, a particular frame may be adjacent to the image frame, and thus a correlation of the image frame to the adjacent image frame may be determined. In some examples, it may be that each image frame is correlated with a particular frame that is most recent in time.

In an example, the correlation between the image frame and the particular frame may be determined using the calculation of the optical flow.

In an example, the correlation threshold may be flexibly configured based on the actual situation in terms of computing power, decoder version, network bandwidth, etc. of the terminal device that is to play the video. The cloud device can acquire information in the aspects of computing power, decoder version, network bandwidth and the like of the terminal device from the terminal device, and determine a proper correlation threshold based on the information, so that the transcoded video can be given consideration to video image quality and file size as much as possible, and user experience sense and transmission cost are taken consideration.

In an example, the image frames may be analyzed for characteristic information of a specific frame based on a correlation threshold determined by the terminal device performance using a deep learning technique, discarding image frames having a high correlation with the specific frame, and preserving image frames having a low correlation with the specific frame for encoding.

The discarded image frames have a higher correlation with the specific frame, so that the discarded image frames have smaller difference information in image characteristics with respect to the specific frame than the retained image frames having a lower correlation with the specific frame, and are easy to restore, and thus the size of the encoded video file can be reduced by not encoding the discarded image frames.

In an example, discarding the difference information of an image frame from a particular frame on the image characteristics may include information about the motion, texture, deformation, etc. of a region object in the image. That is, the difference information may reflect a change trend between the discarded image frames to a specific frame to enable restoration of the discarded image frames based on the change trend on the basis of the specific frame.

In an example, time domain processing may be performed based on characteristics of a plurality of image frames and a particular frame to determine correlation therebetween. This time domain processing may be performed using a long short term memory network (LSTM). Thereafter, the image frames having high correlation with the specific frame may be discarded based on the correlation threshold determined by the terminal device performance, and the image frames having low correlation with the specific frame may be retained for encoding.

Fig. 3 is a schematic diagram illustrating time domain processing of an image frame according to an exemplary embodiment.

In an example, as shown in fig. 3, a plurality of image frames may be simultaneously input into the long and short term memory network 310, which may include image frame 321, image frame 322, image frame 323, image frame 324, and a plurality of other image frames not shown. At least one of the image frame 321, the image frame 322, the image frame 323, the image frame 324, and the plurality of other image frames not shown may be a specific frame, for example, the image frame 321 may be a specific frame, and the image frame 322, the image frame 323, the image frame 324, and the plurality of other image frames not shown may each be an image frame (for example, a previous frame, a next frame, a previous frame, or a next frame) that is temporally adjacent to the image frame 321. The long-term memory network 310 may analyze the degree of similarity of each of the image frames 322, 323, 324, and a plurality of other image frames not shown with the feature information of the image frame 321 to determine the correlation therebetween.

It will be appreciated that fig. 3 is only an example showing a process of time domain processing based on characteristics of a plurality of image frames and a specific frame using LSTM. Depending on the application, the step of determining the correlation may also be performed with other networks instead of LSTM.

In step S240, a second video is generated based on the reserved image frame and the specific frame. The second video includes the encoded information saved in step S230. The encoded information includes information related to the time of the discarded image frames in the first video such that the terminal device resumes the positions of the discarded image frames in the first video based on the time-related information and the specific frames.

In an example, the second video including the encoded information may be transmitted to the terminal device over a lossless transmission channel.

The specific frame is reserved so that the image frame including important information is not compressed during video processing and transmission between the cloud device and the terminal device, and all the characteristics and information of the image frame can be reserved in a lossless manner. Thus, at the terminal device, the discarded image frame can be restored based on the difference information in image characteristics between the discarded image frame and the particular frame in the saved encoding information, thereby obtaining the complete image frame corresponding to the first video.

According to the video processing method of the embodiment of the disclosure, by discarding the image frame having higher correlation with the specific frame and reserving the image frame having lower correlation with the specific frame, the number of the encoded image frames can be reduced, thereby reducing the size of the second video to be transmitted; meanwhile, by saving frame index information of the discarded image frames and difference information of the discarded image frames and the specific frames in terms of image characteristics as encoded information of the second video, the quality of the first video can be restored at the terminal device by restoring these reduced discarded image frames. In this way, the transmission of the second video does not place excessive pressure on the bandwidth, and the video information content is not lost, thereby achieving both transmission bandwidth and video quality.

According to some embodiments, the video processing method 200 may further include: receiving information about performance of a decoding operation, which is used for indicating a terminal device in communication with a cloud device; and determining a predetermined correlation threshold based on the information, the magnitude of the predetermined correlation threshold being inversely proportional to the magnitude of the performance of the terminal device to perform the decoding operation.

In an example, the above-described information about the performance of the terminal device that is in communication with the cloud device to perform the decoding operation may be received and the correlation threshold may be determined before S232 as shown in fig. 2.

In an example, the terminal device in communication with the cloud device may be a terminal device that is about to play the video, such as a notebook, smart phone, tablet, etc., such as client device 110 shown in fig. 1. The correlation threshold may be flexibly configured based on the actual situation in terms of computational power, decoder version, network bandwidth, etc. of the terminal device. For example, when the performance of the terminal device to perform the decoding operation is relatively large, it means that the terminal device may have a large image frame restoration capability, and thus the correlation threshold may be set small to minimize the number of reserved image frames, so that transmission can be performed with a relatively small video size. Conversely, when the performance of the terminal device to perform the decoding operation is relatively small, this means that the terminal device may have a small image frame restoration capability, and thus the correlation threshold may be set to be large to increase the number of reserved image frames as much as possible, thereby reducing the influence on the video quality.

In an example, the cloud device may send a request to the terminal device to obtain the above information of the terminal device, and determine an appropriate correlation threshold based on the information. The terminal device may also actively send the information of the terminal device to the cloud device, so that the cloud device may determine a suitable correlation threshold based on the information.

According to the embodiments of the present disclosure, by determining the predetermined correlation threshold based on the information about the performance of the decoding operation performed by the terminal device for instructing the terminal device to communicate with the cloud device, the performance of the terminal device can be utilized to the greatest extent, so that the number of encoded image frames can be reduced as much as possible, thereby reducing the size of the video, thereby reducing the transmission cost.

According to some embodiments, discarding the difference information of the image frame from the particular frame on the image characteristics includes at least one of: the movement trend of the background part, the deformation trend of the background part, the movement trend of the foreground part, the deformation trend of the foreground part, the movement trend of the target and the deformation trend of the target.

In an example, optical flow may be utilized to obtain at least one of a motion trend of a background portion, a morphing trend of a background portion, a motion trend of a foreground portion, a morphing trend of a foreground portion, a motion trend of a target, a morphing trend of a target in a video.

According to the embodiment of the disclosure, by storing information in terms of the motion trend of the background portion, the deformation trend of the background portion, the motion trend of the foreground portion, the deformation trend of the foreground portion, the motion trend of the target, the deformation trend of the target and the like as encoded information of the second video, the terminal device can restore and discard the image frame based on the encoded information, so that the complete first video is restored without losing video information content.

According to some embodiments, step S231 as shown in fig. 2 may include: respectively extracting first image features of the image frames and second image features of the feature frames under multiple scales; and using the similarity between the first image feature and the second image feature to represent the correlation of the image frame with the particular frame.

Fig. 4 is a schematic diagram illustrating an extraction image feature process 400 according to an example embodiment.

In an example, as shown in fig. 4, the process 400 of extracting image features can be implemented using a U-Net network.

In an example, each image frame of a plurality of consecutive image frames can be sequentially input into a U-Net network to extract image features of the image frame.

Referring to fig. 4, for example, an image frame 410 may be input to a U-Net network for feature extraction, a multi-channel feature map 421 having predetermined scale information is generated, and then a multi-channel feature map 422 and a multi-channel feature map 423 having decreasing scale information are generated. That is, the multi-channel feature maps 421, 422, and 423 may correspond to the image frame 410 at different scales, wherein the multi-channel feature map 421 may have the largest scale information and the multi-channel feature map 423 may have the smallest scale information. The U-Net network can then stitch the features of the multi-channel feature maps 421, 422, and 423 together. These stitched together features together form image features 430 of image frame 410 as the output of the U-Net network.

In an example, if image frame 410 is an image frame other than a particular frame of a plurality of consecutive image frames, then the output image characteristic 430 of the U-Net network can be the first image characteristic described above. If image frame 410 is a particular frame of a plurality of consecutive image frames, then the output image characteristic 430 of the U-Net network can be the second image characteristic described above.

It will be appreciated that fig. 4 is only an example, and illustrates a process of dividing an image frame 410 into three different scales for feature extraction, and then combining the features of the different scales to generate and output a composite image feature 430. Depending on the application, feature extraction may also be performed on the image frame 410 from fewer or more scales to generate and output image features 430.

It will also be appreciated that fig. 4 is only one example, showing a process of extracting image information of an image frame using a U-Net network. According to practical application, other backbone networks and pyramid networks can be used for replacing the U-Net network to extract the image information of the image frames.

According to the embodiment of the disclosure, by extracting the image features of the image frames at multiple scales, the information can be more effectively utilized by means of data enhancement from a single image frame, so that more complete image features are obtained, and the correlation between the image frames determined by the information and the specific frames is more accurate and reliable.

Fig. 5 is a flowchart illustrating a video processing method 500 according to another exemplary embodiment. The video processing method 500 is applied to a terminal device, such as the client device 110 shown in fig. 1.

Referring to fig. 5, in step S510, a second video after video processing of a first video is received. The second video is generated based on a subset of image frames of a plurality of consecutive image frames corresponding to the first video, the subset of image frames including a particular frame determined in the first video and a retained image frame, a correlation between the retained image frame and the particular frame being less than a predetermined correlation threshold. The second video includes encoding information including frame index information of a discarded image frame determined in the first video and difference information of the discarded image frame and a specific frame in image characteristics, a correlation between the discarded image frame and the specific frame being greater than or equal to a predetermined correlation threshold. The encoded information further includes information related to a time of discarding the image frame in the first video such that the terminal device resumes the location of the discarding the image frame in the first video based on the time related information and the specific frame.

In an example, the second video may be obtained by video processing the first video, for example, by the video processing method 200 shown in fig. 2.

In an example, the discarded image frame is less in difference information in image characteristics with the specific frame because of high correlation with the specific frame, and is easily restored, so that the size of the encoded video file can be reduced by not encoding the discarded image frame.

In an example, discarding the difference information of an image frame from a particular frame on the image characteristics may include information about the motion, texture, deformation, etc. of a region object in the image.

In step S520, the second video is decoded based on the encoding information, and the discarded image frames are restored in a plurality of consecutive image frames.

In an example, the encoded information may include a movement trend of the background portion, a deformation trend of the background portion, a movement trend of the foreground portion, a deformation trend of the foreground portion, a movement trend of the object, a deformation trend of the object, and may further include information about movement, texture, deformation, and the like of the region object in the image, and may include information about a time of discarding the image frame in the first video. The terminal device may then restore all the information of the discarded image frames based on the information and the particular frames in the subset of image frames and insert the discarded image frames in their original positions in the first video, thereby restoring the first video. The terminal device may be, for example, a notebook computer, a smart phone, a tablet computer, etc. The new video generated by decoding the second video may be output to a display device for playback.

According to the embodiment of the disclosure, by recovering the discarded image frames in the plurality of continuous image frames based on the encoding information contained in the second video after the video processing of the first video, the terminal device can recover the first video by simply inserting frames into the second video according to the encoding information without losing the video information content, thereby ensuring that the transmission can be performed with a smaller video file without affecting the video quality, thereby achieving both the transmission bandwidth and the video quality.

According to some embodiments, the video processing method 500 may further include: information about performance of the decoding operation performed by the terminal device is transmitted to a cloud device in communication with the terminal device, and a predetermined correlation threshold is determined based on the information, the magnitude of the predetermined correlation threshold being inversely proportional to the magnitude of the performance of the decoding operation performed by the terminal device.

In an example, the above-described information about the performance for instructing the terminal device to perform the decoding operation may be transmitted before S510 as shown in fig. 5.

In an example, the terminal device may send the above information in response to receiving a request from the cloud device, or may actively send the above information to the cloud device, so that the cloud device may determine an appropriate correlation threshold based on the above information.

According to the embodiments of the present disclosure, by transmitting information about performance for instructing a terminal device to perform a decoding operation to a cloud device in communication with the terminal device, the cloud device can determine a predetermined correlation threshold based on the information about the performance for instructing the terminal device to perform the decoding operation, thereby maximally utilizing the performance of the terminal device, so that the number of encoded image frames can be reduced as much as possible, thereby reducing the size of a video, thereby reducing transmission costs.

Fig. 6 is a schematic diagram illustrating a video processing method 600 according to another exemplary embodiment.

As shown in fig. 6, the video processing method 600 includes steps S601 to S607, which relate to both sides of the cloud device 620 and the terminal device 650, where the cloud device 620 and the terminal device 650 are communicatively connected. The cloud device 620 may receive the first video 610 and perform the video processing method 600 on the first video 610.

First, the cloud device 620 may perform step S601 to decode the first video 610. The decoded cloud device 620 may obtain a plurality of consecutive image frames corresponding to the first video 610.

Step S602, image frame information analysis, may then be performed. In step S602, feature extraction may be performed on each of a plurality of consecutive image frames to determine which image frame or frames are particular frames. In the process of extracting the features of the image frame, the first image features of the image frame and the second image features of the feature frame may be extracted at multiple scales, respectively, and the process of extracting the image features of the image frame may be performed using a U-Net network, for example.

Step S603, time domain analysis, may then be performed. In step S603, time domain processing may be performed based on the characteristics of the plurality of image frames and the specific frame. The time domain processing may be performed using the LSTM based on the first image feature of the image frame and the second image feature of the feature frame.

Step S604 may then be performed to preserve or discard the image frames. The operation of reserving or discarding the image frames may be performed based on a correlation threshold determined to indicate information 630 about the performance of the terminal device 650 to perform the decoding operation, the magnitude of the correlation threshold being inversely proportional to the magnitude of the performance of the terminal device 650 to perform the decoding operation. For example, the actual situation in terms of computational effort, decoder version, network bandwidth, etc. of the terminal device 650 may determine the magnitude of the correlation threshold. The terminal device 650 may send the information 630 in response to receiving a request from the cloud device 620, and the terminal device 650 may also actively send the information 630 to the cloud device 620 to enable the cloud device 620 to determine an appropriate correlation threshold based on the information. The cloud device 620 may hold image frames having a correlation less than a correlation threshold, discard image frames having a correlation greater than or equal to the correlation threshold, and save frame index information of the discarded image frames and difference information thereof with respect to image features of a specific frame as encoding information.

Step S605 may then be performed to encode the video. The encoded second video 640 may include the above-described reserved image frames and specific frames and encoded information. The second video 640 may be transmitted to the terminal device 650 through a lossless transmission channel.

Upon receiving the second video 640, the terminal device 650 may perform step S606 to decode the second video 640. The terminal device 650 can then acquire the above-described reserved image frames and specific frames corresponding to the second video 640, as well as the encoding information.

The terminal device 650 may then perform step S607, restoring the image frame. The terminal device 650 may restore all the information of the previously discarded image frames based on the encoded information in the second video 640 and insert the restored image frames into an appropriate position, thereby restoring the video 660 to be played consistent with the first video 610 to be output to the display device for playing.

According to another aspect of the present disclosure, there is also provided a video processing apparatus.

Fig. 7 is a schematic block diagram illustrating a video processing apparatus 700 according to an exemplary embodiment. The video processing apparatus 700 is applied to a cloud device.

As shown in fig. 7, the video processing apparatus 700 includes: an image frame acquisition module 710 configured to acquire a plurality of consecutive image frames corresponding to a first video; a particular frame determination module 720 configured to determine a particular frame of the plurality of consecutive image frames; an image frame processing module 730 configured to process each of a plurality of consecutive image frames except for a specific frame, the image frame processing module 730 comprising: a correlation determination module 731 configured to determine a correlation between the image frame and a specific frame; an image frame retention module 732 configured to determine an image frame as a retained image frame in response to the correlation being less than a predetermined correlation threshold; and an image frame discarding module 733 configured to determine an image frame as a discarded image frame in response to the correlation being greater than or equal to a predetermined correlation threshold, frame index information of the discarded image frame and difference information of the discarded image frame and the specific frame in image characteristics being saved as encoding information; and a video generation module 740 configured to generate a second video based on the retained image frames and the specific frames, the second video including encoding information including information related to a time of the discarded image frames in the first video, so that the terminal device restores the positions of the discarded image frames in the first video based on the time-related information and the specific frames.

According to the embodiments of the present disclosure, by discarding image frames having higher correlation with a specific frame and retaining image frames having lower correlation with the specific frame, the number of encoded image frames can be reduced, thereby reducing the size of a second video to be transmitted; meanwhile, by saving frame index information of the discarded image frames and difference information of the discarded image frames and the specific frames in terms of image characteristics as encoded information of the second video, the quality of the first video can be restored at the terminal device by restoring these reduced discarded image frames. In this way, the transmission of the second video does not place excessive pressure on the bandwidth, and the video information content is not lost, thereby achieving both transmission bandwidth and video quality.

It should be appreciated that the various modules of the apparatus 700 shown in fig. 7 may correspond to the various steps in the method 200 described with reference to fig. 2. Thus, the operations, features, and advantages described above with respect to method 200 apply equally to apparatus 700 and the modules that it comprises. For brevity, certain operations, features and advantages are not described in detail herein.

Fig. 8 is a schematic block diagram illustrating a video processing apparatus 800 according to another exemplary embodiment. The image processing apparatus 800 is applied to a terminal device.

As shown in fig. 8, the video processing apparatus 800 includes: the video receiving module 810 is configured to receive a second video after video processing the first video, and the video decoding module 820 is configured to decode the second video based on the encoded information, the discarded image frames being restored in a plurality of consecutive image frames. The second video is generated based on a subset of image frames of a plurality of consecutive image frames corresponding to the first video, the subset of image frames including a particular frame determined in the first video and a retained image frame, a correlation between the retained image frame and the particular frame being less than a predetermined correlation threshold. The second video includes encoding information including frame index information of a discarded image frame determined in the first video and difference information of the discarded image frame and a specific frame in image characteristics, a correlation between the discarded image frame and the specific frame being greater than or equal to a predetermined correlation threshold. The encoded information further includes information related to a time of discarding the image frame in the first video such that the terminal device resumes the location of the discarding the image frame in the first video based on the time related information and the specific frame.

It should be appreciated that the various modules of the apparatus 800 shown in fig. 8 may correspond to the various steps in the method 500 described with reference to fig. 5. Thus, the operations, features, and advantages described above with respect to method 500 apply equally to apparatus 800 and the modules that it comprises. For brevity, certain operations, features and advantages are not described in detail herein.

Although specific functions are discussed above with reference to specific modules, it should be noted that the functions of the various modules discussed herein may be divided into multiple modules and/or at least some of the functions of the multiple modules may be combined into a single module. The particular module performing the actions discussed herein includes the particular module itself performing the actions, or alternatively the particular module invoking or otherwise accessing another component or module that performs the actions (or performs the actions in conjunction with the particular module). Thus, a particular module that performs an action may include that particular module itself that performs the action and/or another module that the particular module invokes or otherwise accesses that performs the action.

It should also be appreciated that various techniques may be described herein in the general context of software or program modules. The various modules described above with respect to fig. 7 and 8 may be implemented in hardware or in hardware in combination with software and/or firmware. For example, the modules may be implemented as computer program code/instructions configured to be executed in one or more processors and stored in a computer-readable storage medium. Alternatively, these modules may be implemented as hardware logic/circuitry. For example, in some embodiments, one or more of the image frame acquisition module 710, the particular frame determination module 720, the image frame processing module 730, the correlation determination module 731, the image frame retention module 732, the image frame rejection module 733, the video generation module 740, as shown in fig. 7, or one or both of the video reception module 810 and the video decoding module 820, as shown in fig. 8, may be implemented together in a System on Chip (SoC). The SoC may include an integrated circuit chip including one or more components of a processor (e.g., a central processing unit (Central Processing Unit, CPU), microcontroller, microprocessor, digital signal processor (Digital Signal Processor, DSP), etc.), memory, one or more communication interfaces, and/or other circuitry, and may optionally execute received program code and/or include embedded firmware to perform functions.

According to an aspect of the present disclosure, a computer device is provided that includes a memory, a processor, and a computer program stored on the memory. The processor is configured to execute a computer program to implement the steps of any of the method embodiments described above.

According to an aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of any of the method embodiments described above.

According to an aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the steps of any of the method embodiments described above.

Illustrative examples of such computer devices, non-transitory computer readable storage media, and computer program products are described below in connection with fig. 9.

Fig. 9 illustrates an example configuration of a computer device 900 that may be used to implement the methods described herein. For example, the server 120 and/or client device 110 shown in fig. 1 may include an architecture similar to that of computer device 900. The video processing apparatus described above may also be implemented, in whole or at least in part, by computer device 900 or a similar device or system.

Computer device 900 may be a variety of different types of devices. Examples of computer device 900 include, but are not limited to: a desktop, server, notebook, or netbook computer, a mobile device (e.g., tablet, cellular, or other wireless telephone (e.g., smart phone), notepad computer, mobile station), a wearable device (e.g., glasses, watch), an entertainment appliance (e.g., an entertainment appliance, a set-top box communicatively coupled to a display device, a gaming machine), a television or other display device, an automotive computer, and so forth.

Computer device 900 may include at least one processor 902, memory 904, communication interface(s) 906, display device 908, other input/output (I/O) devices 910, and one or more mass storage devices 912, capable of communicating with each other, such as through a system bus 914 or other suitable connection.

The processor 902 may be a single processing unit or multiple processing units, all of which may include a single or multiple computing units or multiple cores. The processor 902 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. The processor 902 may be configured to, among other capabilities, obtain and execute computer-readable instructions stored in the memory 904, mass storage device 912, or other computer-readable medium, such as program code for the operating system 916, program code for the application programs 918, program code for other programs 920, and the like.

Memory 904 and mass storage device 912 are examples of computer-readable storage media for storing instructions that are executed by processor 902 to implement the various functions as previously described. For example, the memory 904 may generally include both volatile memory and nonvolatile memory (e.g., RAM, ROM, etc.). In addition, mass storage device 912 may generally include hard disk drives, solid state drives, removable media, including external and removable drives, memory cards, flash memory, floppy disks, optical disks (e.g., CDs, DVDs), storage arrays, network attached storage, storage area networks, and the like. The memory 904 and mass storage device 912 may both be referred to herein collectively as memory or computer-readable storage media, and may be non-transitory media capable of storing computer-readable, processor-executable program instructions as computer program code that may be executed by the processor 902 as a particular machine configured to implement the operations and functions described in the examples herein.

A number of programs may be stored on mass storage device 912. These programs include an operating system 916, one or more application programs 918, other programs 920, and program data 922, and they may be loaded into the memory 904 for execution. Examples of such application programs or program modules may include, for example, computer program logic (e.g., computer program code or instructions) for implementing client application 112, method 200, method 500, and/or further embodiments described herein.

Although illustrated in fig. 9 as being stored in memory 904 of computer device 900, modules 916, 918, 920, and 922, or portions thereof, may be implemented using any form of computer readable media accessible by computer device 900. As used herein, "computer-readable medium" includes at least two types of computer-readable media, namely computer-readable storage media and communication media.

Computer-readable storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. Computer-readable storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD), or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium which can be used to store information for access by a computer device. In contrast, communication media may embody computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism. Computer-readable storage media as defined herein do not include communication media.

One or more communication interfaces 906 are used to exchange data with other devices, such as via a network, direct connection, or the like. Such communication interfaces may be one or more of the following: any type of network interface (e.g., a Network Interface Card (NIC)), a wired or wireless (such as IEEE 802.11 Wireless LAN (WLAN)) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, bluetooth, etc ^TM An interface, a Near Field Communication (NFC) interface, etc. Communication interface 906 may facilitate communication over a variety of networks and protocol types, including wired networks (e.g., LAN, cable, etc.) and wireless networks (e.g., WLAN, cellular, satellite, etc.), the Internet, and so forth. The communication interface 906 may also provide for communication with external storage devices (not shown) such as in a storage array, network attached storage, storage area network, or the like.

In some examples, a display device 908, such as a monitor, may be included for displaying information and images to a user. Other I/O devices 910 may be devices that receive various inputs from a user and provide various outputs to the user, and may include touch input devices, gesture input devices, cameras, keyboards, remote controls, mice, printers, audio input/output devices, and so on.

The techniques described herein may be supported by these various configurations of computer device 900 and are not limited to the specific examples of techniques described herein. For example, this functionality may also be implemented in whole or in part on a "cloud" using a distributed system. The cloud includes and/or represents a platform for the resource. The platform abstracts underlying functionality of hardware (e.g., servers) and software resources of the cloud. Resources may include applications and/or data that may be used when performing computing processes on servers remote from computer device 900. Resources may also include services provided over the internet and/or over subscriber networks such as cellular or Wi-Fi networks. The platform may abstract resources and functions to connect computer device 900 with other computer devices. Thus, implementations of the functionality described herein may be distributed throughout the cloud. For example, the functionality may be implemented in part on computer device 900 and in part by a platform that abstracts the functionality of the cloud.

While the disclosure has been illustrated and described in detail in the drawings and foregoing description, such illustration and description are to be considered illustrative and schematic and not restrictive; the present disclosure is not limited to the disclosed embodiments. Variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed subject matter, from a study of the drawings, the disclosure, and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps than those listed and the indefinite article "a" or "an" does not exclude a plurality, the term "a" or "an" means two or more, and the term "based on" is to be interpreted as "based at least in part on". The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

Claims

1. A video processing method applied to a cloud device and a terminal device, the cloud device and the terminal device being communicatively connected, the method comprising:

at the cloud device:

acquiring a plurality of continuous image frames corresponding to a first video;

determining a particular frame of the plurality of consecutive image frames;

receiving, from the terminal device, information about performance for instructing the terminal device to perform a decoding operation;

determining a predetermined correlation threshold based on the information, wherein a magnitude of the predetermined correlation threshold is inversely proportional to a magnitude of the performance of the terminal device to perform a decoding operation;

for each image frame of the plurality of consecutive image frames other than the particular frame:

determining a correlation between the image frame and the particular frame;

determining the image frame as a retained image frame in response to the correlation being less than the predetermined correlation threshold; and

determining the image frame as a discard image frame in response to the correlation being greater than or equal to the predetermined correlation threshold, wherein frame index information of the discard image frame and difference information of the discard image frame and the specific frame in image characteristics are saved as encoding information; and

Generating a second video based on the retained image frame and the specific frame, wherein the second video includes the encoding information, wherein the encoding information includes information related to a time of the discarded image frame in the first video, so that a terminal device restores a position of the discarded image frame in the first video based on the time-related information and the specific frame; and

at the terminal device:

sending the information related to the performance for indicating the terminal equipment to execute decoding operation to the cloud equipment;

receiving the second video from the cloud device; and

decoding the second video based on the encoded information, wherein the discarded image frames are restored in the plurality of consecutive image frames.

2. The method of claim 1, wherein the discarding the difference information of the image frame and the particular frame in image characteristics comprises at least one of: the movement trend of the background part, the deformation trend of the background part, the movement trend of the foreground part, the deformation trend of the foreground part, the movement trend of the target and the deformation trend of the target.

3. The method of claim 1 or 2, wherein the determining the correlation of the image frame with the particular frame comprises:

Respectively extracting a first image feature of the image frame and a second image feature of the specific frame under multiple scales; and

a correlation of the image frame with the particular frame is represented using a similarity between the first image feature and the second image feature.

4. A video processing apparatus applied to a cloud device and a terminal device, the cloud device and the terminal device being communicatively connected, the video processing apparatus comprising at the cloud device:

an image frame acquisition module configured to acquire a plurality of consecutive image frames corresponding to a first video;

a specific frame determination module configured to determine a specific frame of the plurality of consecutive image frames; an image frame processing module configured to receive, from the terminal device, information about performance for instructing the terminal device to perform a decoding operation; determining a predetermined correlation threshold based on the information, wherein a magnitude of the predetermined correlation threshold is inversely proportional to a magnitude of the performance of the terminal device to perform a decoding operation; and processing for each image frame of the plurality of consecutive image frames except for the particular frame,

wherein, the image frame processing module includes:

A correlation determination module configured to determine a correlation between the image frame and the particular frame;

an image frame retention module configured to determine the image frame as a retained image frame in response to the correlation being less than the predetermined correlation threshold; and

an image frame discarding module configured to determine the image frame as a discarded image frame in response to the correlation being greater than or equal to the predetermined correlation threshold, wherein frame index information of the discarded image frame and difference information of the discarded image frame and the particular frame in image characteristics are saved as encoding information; and

a video generation module configured to generate a second video based on the retained image frame and the specific frame, wherein the second video includes the encoding information, wherein the encoding information includes information related to a time of the discarded image frame in the first video, so that a terminal device restores a position of the discarded image frame in the first video based on the time-related information and the specific frame;

and, the video processing apparatus includes, at the terminal device:

A video receiving module configured to receive the second video from the cloud device; and

a video decoding module configured to send the information about performance for instructing the terminal device to perform a decoding operation to the cloud device; and decoding the second video based on the encoded information, wherein the discarded image frames are restored in the plurality of consecutive image frames.

5. A computer device, comprising:

at least one processor; and

a memory on which a computer program is stored,

wherein the computer program, when executed by the processor, causes the processor to perform the method of any of claims 1-3.

6. A computer readable storage medium having stored thereon a computer program which, when executed by a processor, causes the processor to perform the method of any of claims 1-3.