WO2023019870A1 - 视频处理方法及装置、电子设备、存储介质、计算机程序、计算机程序产品 - Google Patents

视频处理方法及装置、电子设备、存储介质、计算机程序、计算机程序产品 Download PDF

Info

Publication number
WO2023019870A1
WO2023019870A1 PCT/CN2022/070177 CN2022070177W WO2023019870A1 WO 2023019870 A1 WO2023019870 A1 WO 2023019870A1 CN 2022070177 W CN2022070177 W CN 2022070177W WO 2023019870 A1 WO2023019870 A1 WO 2023019870A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
target
target frame
frame
attention area
Prior art date
Application number
PCT/CN2022/070177
Other languages
English (en)
French (fr)
Inventor
许通达
高宸健
王岩
袁涛
秦红伟
Original Assignee
上海商汤智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海商汤智能科技有限公司 filed Critical 上海商汤智能科技有限公司
Publication of WO2023019870A1 publication Critical patent/WO2023019870A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44012Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving rendering scenes according to scene graphs, e.g. MPEG-4 scene graphs
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/4402Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
    • H04N21/440245Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display the reformatting operation being performed only on part of the stream, e.g. a region of the image or a time segment

Definitions

  • the present disclosure relates to the field of computer technology, and in particular to a video processing method and device, electronic equipment, storage media, computer programs, and computer program products.
  • identifying the attention area in the video and adjusting the bit rate is very important for improving the subjective quality of the video or the visual experience of the audience.
  • how to quickly and accurately identify attention regions is a challenge.
  • simply increasing the code rate of the attention area and reducing the code rate of the non-attention area will cause coding noise to the non-attention area at a low bit rate.
  • the present disclosure proposes a video processing method and device, electronic equipment, storage media, computer programs, and computer program products, aiming at quickly and accurately identifying attention regions in videos and reducing coding noise generated during video coding.
  • An embodiment of the present disclosure provides a video processing method, the method comprising:
  • the target frame sequence including: a target frame and at least one reference frame within a preset length range from the target frame;
  • the background image and the foreground image are transparently fused to obtain a target replacement image
  • the attention area of the target replacement image is the foreground image
  • the non-attention area of the target replacement image is at least the background image described in section
  • the target frame is updated with the target substitute image.
  • An embodiment of the present disclosure provides a video processing device, the device comprising:
  • the sequence determination module is configured to determine a target frame sequence in the video to be processed according to the order of the time axis, and the target frame sequence includes: a target frame and at least one reference frame within a preset length range from the target frame;
  • the attention area detection module is configured to perform attention area detection according to the target frame sequence, and obtain a target detection image for representing the attention area and the non-attention area in the target frame;
  • An image determination module configured to determine a corresponding background image and foreground image according to the target frame
  • the image fusion module is configured to perform transparency fusion on the background image and the foreground image according to the target detection image to obtain a target replacement image, the attention area of the target replacement image is the foreground image, and the target replacement image
  • the non-attention area is at least part of said background image
  • An image updating module configured to update the target frame with the target replacement image.
  • An embodiment of the present disclosure provides an electronic device, including: a processor; a memory configured to store processor-executable instructions; wherein, the processor is configured to call the instructions stored in the memory to perform part of the above method or all steps.
  • An embodiment of the present disclosure provides a computer-readable storage medium, on which computer program instructions are stored, and when the computer program instructions are executed by a processor, some or all steps of the above method are implemented.
  • An embodiment of the present disclosure provides a computer program, the computer program includes computer readable code, and when the computer readable code is read and executed by a computer, a part or part of the method in any embodiment of the present disclosure is realized. All steps.
  • An embodiment of the present disclosure provides a computer program product, the computer program product includes a non-transitory computer-readable storage medium storing a computer program, and when the computer program is read and executed by a computer, any embodiment of the present disclosure is realized Some or all of the steps in the method.
  • the background image and the foreground image of the target frame can be determined, and the target frame can be updated by displaying the foreground image in the attention area and the target replacement image of the background image in the non-attention area, reducing the cost of the entire video frame to be processed.
  • the code rate thereby reducing the encoding noise generated in the subsequent encoding process.
  • FIG. 1 is a flowchart of a video processing method provided by an embodiment of the present disclosure
  • FIG. 2 is a schematic diagram of determining a target frame sequence provided by an embodiment of the present disclosure
  • FIG. 3A is a flowchart of an attention region detection process provided by an embodiment of the present disclosure.
  • FIG. 3B is a schematic flow diagram of obtaining a first detection image provided by an embodiment of the present disclosure
  • FIG. 4 is a schematic diagram of a second image processing process provided by an embodiment of the present disclosure.
  • FIG. 5 is a schematic diagram of an attention region detection process provided by an embodiment of the present disclosure.
  • FIG. 6 is a schematic diagram of a target detection image provided by an embodiment of the present disclosure.
  • FIG. 7 is a schematic diagram of a process of determining a target substitute image provided by an embodiment of the present disclosure.
  • FIG. 8 is a schematic diagram of a transparency fusion process provided by an embodiment of the present disclosure.
  • FIG. 9 is a schematic diagram of a process of determining an adaptive quantization parameter provided by an embodiment of the present disclosure.
  • FIG. 10A is a schematic diagram of a data transmission process provided by an embodiment of the present disclosure.
  • FIG. 10B is a schematic diagram of another data transmission process provided by an embodiment of the present disclosure.
  • FIG. 11 is a schematic diagram of a video processing device provided by an embodiment of the present disclosure.
  • FIG. 12 is a block diagram of an electronic device provided by an embodiment of the present disclosure.
  • Fig. 13 is a block diagram of another electronic device provided by an embodiment of the present disclosure.
  • FIG. 1 is a flowchart of a video processing method provided by an embodiment of the present disclosure.
  • the video processing method may be performed by a terminal device or other processing devices, wherein the terminal device may be user equipment (User Equipment, UE), mobile device, user terminal, terminal, cellular phone, cordless phone, personal digital assistant (Personal Digital Assistant) , PDA), handheld devices, computing devices, vehicle-mounted devices, wearable devices, etc.
  • the video processing method may be implemented by a processor invoking computer-readable instructions stored in a memory.
  • each frame in it can be updated to an image with different sharpness in the attention area and the non-attention area , and determine the adaptive quantization parameter corresponding to each frame, and perform video encoding by using each frame in the video to be processed and the corresponding adaptive quantization parameter.
  • the video processing method and video encoding can be completed by the same device, or the video processing method is first executed by a terminal device or other device and then transmitted to a video encoder for video encoding.
  • the video processing method of the embodiment of the present disclosure includes the following steps:
  • Step S10 determining the target frame sequence in the video to be processed according to the order of the time axis.
  • the embodiment of the present disclosure may execute the video processing method in a manner of separately processing each frame of the video to be processed. That is to say, each frame in the video to be processed may be used as a target frame for image processing, so as to determine a target substitute image for replacing the target frame. After completing the image processing of the current target frame, re-determine the unprocessed frame in the video to be processed as a new target frame until the image processing of all frames in the video to be processed is completed, and then complete the video processing process of the video to be processed.
  • the processing sequence of the target frames may be sequentially determined based on the sequence of the time axis.
  • the video to be processed is substantially a frame sequence composed of a plurality of frames, wherein each frame records a piece of static image information. Therefore, in order to perform video processing on the target frame, it is necessary to obtain multiple frames within a preset length range from the target frame as reference frames, so as to detect the attention area and determine the target frame by comparing the image content of the reference frame with the target frame Attention area in , for video processing.
  • the target frame sequence includes: the target frame and the target frame within a preset length range At least one frame of reference.
  • the "preset length" may be a preset number of frames. Among them, when there is no number of frames with a preset length before or after the target frame, that is, when the target frame is the first few frames at the beginning of the video or the last few frames at the end, the corresponding frame sequence can be determined, and the adjacent reference
  • the attention area identification result of the frame is used as the attention area identification result of the current target frame.
  • the video to be processed includes T1-T10 frames as an example for illustration.
  • the target frame sequence can be sequentially determined according to the time axis sequence as (T1, T2, T3, T4, T5), (T2, T3, T4, T5, T6), (T3, T4, T5,T6,T7), (T4,T5,T6,T7,T8), (T5,T6,T7,T8,T9) and (T6,T7,T8,T9,T10).
  • the target frames corresponding to each target frame sequence are T3, T4, T5, T6, T7 and T8 in sequence, and the recognition result of the attention area corresponding to the target frame can be obtained by sequentially performing attention area detection on each target frame sequence.
  • the recognition result of T3 may be used as the recognition result of T1 and T2
  • the recognition result of T8 may be used as the recognition result of T9 and T10.
  • the process of obtaining the target frame sequence in this embodiment of the present disclosure may include: sequentially adding each frame in the video to be processed to a preset first-in-first-out queue in sequence according to the time axis, and responding to each position in the queue being Occupy, the frame in the middle of the queue is used as the target frame of the target frame sequence, and the frames in other positions are used as the reference frame of the target frame sequence to determine the target frame sequence. That is to say, a fixed-length first-in-first-out queue is preset, and each frame is sequentially added to the queue according to the order of each frame in the video to be processed on the time axis, wherein each frame occupies a position in the queue.
  • the middle position represents the middlemost position in the queue, or a predetermined position among the two middlemost positions. For example, when the queue length is an odd number, the queue includes only one middlemost position, and the frame in this position is determined to be the target frame. When the queue length is an even number, the queue includes two intermediate positions, and it may be determined that the frame stored in the front position of the two positions is the target frame.
  • the length of the first-in-first-out queue can be the sum of the number of the target frame and the reference frame, that is, the preset length is multiplied by two, and the result of the multiplication is added by one.
  • the preset length can be equal to the preset
  • the preset length indicated by the length range may be a preset number of frames. For example, when the preset length is 2, the length of the FIFO queue is 5.
  • the frame stored in the queue is popped from the first position in the queue, and the next frame in the video to be processed is pushed into the queue.
  • FIG. 2 is a schematic diagram of determining a target frame sequence provided by an embodiment of the present disclosure. As shown in FIG. 2 , when the target frame sequence is determined, each frame in the video to be processed 20 is sequentially added to a preset first-in-first-out queue 22 along the time axis sequence 21 .
  • the frame currently in the middle position of the queue 22 T is used as the target frame, and the frames T-2, T-1, T+1 and T+2 in other positions are used as reference frames to determine the target frame sequence (T-2, T-1, T, T+1, T+ 2) to perform video processing such as attention region identification on the target frame T based on the target frame sequence (T-2, T-1, T, T+1, T+2).
  • the target frame sequence may be determined when the T-3 frame is ejected and the T+2 frame is pushed into the queue.
  • the foregoing manner of sequentially determining the target frame sequence can improve the efficiency of the entire video processing process.
  • the target frame sequence including the reference frame and the target frame is determined to detect the attention area based on the target frame sequence, which improves the accuracy of the recognition result of the attention area corresponding to the target frame.
  • Step S20 perform attention region detection according to the target frame sequence, and obtain target detection images for representing attention regions and non-attention regions in the target frames.
  • the target detection image can be determined by performing attention region detection on the target frame sequence, and the target detection image is: an image used to represent the attention region and non-attention region in the target frame .
  • the attention area may be determined by comparing image content in the target frame and the reference frame in the target frame sequence.
  • the attention area is an area that humans will focus on in the target frame, for example, it may be a moving area in an image under a relatively static background or an area where a specific contour in the image is located.
  • the motion area under the relatively static background may be: the area where the football is located in the football game video, etc.
  • the area where the specific outline is located may be: the area where the face outline is located in the face recognition scene, etc.
  • the attention area may also be other areas than the motion area.
  • FIG. 3A is a flowchart of an attention region detection process provided by an embodiment of the present disclosure. As shown in FIG. 3A, in a possible implementation, the embodiment of the present disclosure performs attention area detection on the target detection sequence to obtain a target detection image process for representing the attention area and non-attention area of the target frame, which may include The following steps:
  • Step S21 performing the first image processing on the target frame sequence to obtain a feature tensor.
  • the first image processing is performed on the target frame sequence to obtain a feature tensor, which is used to characterize: the image features of the target frame and each reference frame in the target frame sequence, where each A target frame sequence corresponds to a feature tensor.
  • the first image processing process aims to convert each frame in the target frame sequence from a high-resolution image to a low-resolution image, so as to improve the detection speed and efficiency of subsequent attention regions.
  • the first image processing process may include: downsampling each frame in the target frame sequence by a predetermined multiple, and determining a feature tensor according to each downsampled frame. That is to say, a multiple is preset, and each frame in the target frame sequence is reduced by a predetermined multiple by downsampling, and then the feature tensor is determined according to each reduced frame.
  • the down-sampling method may be in any manner, such as nearest neighbor interpolation, bilinear interpolation, mean value interpolation, median value interpolation and other methods, which are not limited here.
  • a predetermined multiple may be set according to the macroblock size applied in the encoding process. For example, when the macroblock size is 16 ⁇ 16, the predetermined multiple is set to 16, that is, each frame is reduced by 16 times by down-sampling to obtain a macroblock-level frame.
  • the determined feature tensor is a four-dimensional feature tensor, where the four dimensions of the feature tensor are the timing, channel, height and width of the corresponding frame .
  • the timing can be determined according to the time axis sequence of each frame in the image to be processed
  • the channels can be determined according to the number of color channels of each frame
  • the height and width can be determined according to the resolution size of each frame.
  • the four-dimensional feature tensor can be applied to lightweight neural networks such as the MobileNetV3 neural network described later, and used as the input data of the neural network.
  • Step S22 inputting the feature tensor into the trained neural network to detect attention regions.
  • the feature tensor is input into the trained neural network to detect the attention area, so as to determine the attention area in the target frame by comparing the image content of the target frame and each reference frame, and output with The first detection image representing the attention area and the non-attention area.
  • the resolution of the first detection image is the same as the resolution of each frame after downsampling.
  • multiple object areas included in the target frame and the reference frame can be determined through object recognition, and then the positions of the object areas where the same object is located in the target frame and the reference frame can be compared, according to The position of the object area corresponding to the object whose position change distance is greater than the preset threshold in the target frame is determined as the attention area.
  • the neural network for detecting attention regions is a lightweight neural network.
  • the neural network may be a MobileNetV3 neural network, which sequentially includes: an initial part, an intermediate part and a final part.
  • the initial part includes a convolutional layer with a kernel size of 3 ⁇ 3 for feature extraction
  • the middle part includes 11 or 15 bneck modules
  • the final part includes an average pooling layer and a convolutional kernel with a size of 1 ⁇ 1 convolution layer
  • the bneck module includes sequentially connected channel separable convolution and channel attention mechanism, and reduces data loss during convolution through residual connection.
  • FIG. 3B is a schematic flow chart of obtaining a first detection image provided by an embodiment of the present disclosure.
  • the frame T in the video to be processed can be pushed into the FIFO queue 401, and when the frame T is pushed into the FIFO queue 401, it will pop up Frame T-5, so the FIFO queue 401 can store frame T, frame T-1, frame T-2, frame T-3 and frame T-4.
  • each frame in the first-in-first-out queue 401 can be down-sampled respectively, and the feature vector 402 is obtained according to each frame after down-sampling, and the feature vector 402 is input into the MobileNetV3 neural network 403, and the MobileNetV3 neural network
  • the network 403 outputs a first detection image 404 .
  • the MobileNetV3 neural network can reduce the amount of computation while improving the accuracy of the calculation results.
  • the embodiments of the present disclosure can perform real-time attention region detection in the case of low-resolution input. Improve the detection speed, while improving the accuracy of the detection results.
  • Step S23 performing a second image processing on the first detection image to obtain a target detection image with the same resolution as the target frame.
  • a second image processing is performed on the first detection image to obtain a target detection image with the same resolution as the target frame.
  • the second image processing process is used to restore the size of the first detection image to the original size of the target frame, so as to perform image processing and update on the target frame based on the obtained target detection image.
  • the process of performing the second image processing on the first detection image includes: upsampling the first detection image by a predetermined multiple to obtain a second detection image with the same resolution as the target frame. Perform maximum pooling on the second detection image with a preset window size and step size to obtain a target detection image.
  • the resolution of the first detection image can be restored to the same resolution as the target frame to obtain the second detection image.
  • the manner of upsampling the first detected image may be any manner, for example, methods such as nearest neighbor interpolation, bilinear interpolation, mean value interpolation, median value interpolation, etc., which are not limited here.
  • the bicubic interpolation method can also be used for upsampling to improve the final image effect.
  • the size of the window for performing maximum pooling on the second detected image may be determined according to an upsampling ratio, that is, the same as the aforementioned predetermined multiple. For example, when the predetermined multiple is 16, the maximum pooling window size may be determined to be 16 ⁇ 16.
  • the step size of the maximum pooling process can be set to 1 in advance.
  • FIG. 4 is a schematic diagram of a second image processing process provided by an embodiment of the present disclosure.
  • the first detection image 40 is obtained by detecting the attention area of the feature tensor through the neural network
  • the resolution of the first detection image is restored to the same as the target frame by upsampling.
  • the second detection image 41 is extracted by means of max pooling, and a target detection image 42 capable of clearly distinguishing the attention area and the non-attention area is obtained, which is convenient for subsequent image processing.
  • Fig. 5 is a schematic diagram of an attention region detection process provided by an embodiment of the present disclosure.
  • the process of detecting the attention area of the target frame in the embodiment of the present disclosure is as follows: firstly determine the target frame sequence 50 corresponding to the target frame, and then down-sample each frame in the target frame sequence 50 through the first image processing to obtain Eigenvector51.
  • Inputting the low-resolution feature vector 51 into the trained neural network 52 can quickly and accurately obtain the low-resolution first detection image 53 .
  • a target detection image 54 with clear texture features and high resolution is obtained.
  • the attention region detection process improves detection efficiency and improves the accuracy of detection results.
  • FIG. 6 is a schematic diagram of a target detection image provided by an embodiment of the present disclosure.
  • the target detection image corresponding to the target frame has the same resolution as the target frame, and the value of each pixel is a value of 0-1.
  • each value is used to represent the probability of the corresponding pixel in the attention area, for example, a pixel with a value of 1 is a pixel in the attention area, and a pixel with a value of 0 is a pixel in a non-attention area.
  • Step S30 determine the corresponding background image and foreground image according to the target frame.
  • image processing is performed on the target frame through different image processing methods, so as to obtain background images and foreground images with different visual effects.
  • the target frame is blurred to obtain a background image
  • the target frame is sharpened to obtain a foreground image.
  • the method of blurring the target frame in the embodiments of the present disclosure may include any image blurring method such as Gaussian blur, salt and pepper blur, motion blur, and occlusion blur, which is not limited here.
  • the method of sharpening the target frame in the embodiments of the present disclosure may include: Sobel operator sharpening, Laplacian operator sharpening, prewitt operator sharpening, and canny operator sharpening, etc.
  • Any image sharpening processing method is not limited here.
  • different processing methods can be used to determine the foreground image and the background image respectively, so as to fuse the foreground image and the background image based on the attention area, enhance the image outline of the attention area, improve the clarity, and reduce the image of the non-attention area Clarity, to improve the visual experience of the image after final processing.
  • Step S40 performing transparency fusion on the background image and the foreground image according to the target detection image to obtain a target replacement image.
  • the attention area of the target substitute image is a foreground image
  • the non-attention area is a background image.
  • the method of obtaining the target replacement image includes: determining the transparency channel according to the target detection image, and performing transparency fusion on the background image and the foreground image according to the transparency channel, and obtaining the position display in the attention area A foreground image, a target surrogate image showing all or part of the background image at the location of the non-attention region.
  • the value of each pixel in the target detection image is remapped to a range of 0-1 by normalizing the target detection image to obtain a corresponding transparency channel.
  • the area with a pixel value of 1 is an attention area
  • the area with a pixel value other than 1 is a non-attention area.
  • a pixel value of 1 represents a position of 0% transparency
  • a pixel value of 0 represents a position of 100% transparency
  • a pixel value between 0-1 represents the probability of opacity.
  • the manner of performing transparency fusion of the background image and the foreground image according to the transparency channel may include: adjusting the transparency of each pixel in the foreground image according to the probability represented by each pixel value in the transparency channel, and then combining the adjusted foreground image with The background image is fused to obtain the target replacement image.
  • the target substitute image shows an opaque foreground image at the location of the attention region, and the background image is covered.
  • the non-attention area since the transparency of the foreground image is between 0-100%, the background image can be fully or partially displayed.
  • the transparency of the foreground image is 100%, and the background image can be displayed directly. value adjusts the transparency of the corresponding foreground image to show both part of the foreground image and part of the background image at that location.
  • Fig. 7 is a schematic diagram of a process of determining a substitute image of a target provided by an embodiment of the present disclosure.
  • the background image 71 is obtained by blurring the target frame 70
  • the foreground image 72 is obtained by sharpening the target frame 70.
  • the transparency channel 74 is obtained by normalizing the target detection image 73 .
  • FIG. 8 is a schematic diagram of a transparency fusion process provided by an embodiment of the present disclosure. As shown in Figure 8, when performing transparency fusion on the foreground image 80, the background image 81 and the transparency channel 82, the foreground image 80 is used as the top layer of the image, and the background image 81 is used as the bottom layer of the image. The underlying background image 81 is superimposed.
  • the transparency of the attention area in the foreground image 80 is adjusted to 100% (that is, adjusted to be opaque), that is, the foreground image 80 located on the top layer of the image is displayed in the attention area of the target replacement image 83 , adjust the transparency of the non-attention area with a value of 0 in the foreground image 80 to 100%, that is, display the background image 81 at the bottom of the image in the non-attention area of the target replacement image 83 .
  • the embodiment of the present disclosure can display a clear foreground image in an attention area and a blurred background image in a non-attention area through transparency fusion, so as to improve the subjective visual experience of obtaining a target replacement image.
  • Step S50 updating the target frame by using the target substitute image.
  • the target frame in the video to be processed is updated by the target replacement image.
  • the updated target frame may be used as an input frame and input to a video encoder for video encoding.
  • the frame stored in the first position in the queue is popped and the next frame in the video to be processed is pushed into the queue. That is to say, after the target frame in the video to be processed is updated, it is judged that the processing of the current target frame is completed, by popping the frame stored in the first position in the queue, and pushing the next frame into the queue, re- Determine the next frame after the previous target frame on the time axis as the new target frame.
  • each frame in the queue is reacquired to determine the target frame sequence corresponding to the new target frame.
  • the updated target frame becomes the reference frame in the new target sequence.
  • the video processing method of the embodiment of the present disclosure is applied to a video coding scenario.
  • the process of inputting the updated target frame and the corresponding adaptive quantization parameter into the video encoder may be: inputting the updated target frame into the video encoder as the input frame, and inputting the adaptive quantization parameter into the adaptive quantization parameter of the video encoder. quantized interface.
  • the feature vectors obtained after downsampling can be processed based on the lightweight neural network of MobileNetV3, and the frame sequence downsampled to the macroblock level (video ) for real-time saliency detection to obtain target detection images.
  • the target detection image is obtained, the target frame sequence (original video) is post-processed based on the target detection image, and the adaptive quantization parameters are output, which can improve the subjective clarity of the video while reducing the bit rate.
  • the process of determining the adaptive quantization parameter corresponding to the target detection image includes: performing histogram statistics on the target detection image to obtain a corresponding histogram mapping table. Map the target detection image according to the histogram mapping table to obtain the corresponding preliminary quantization parameters.
  • the mapping process can be: initialize a blank image with the same size as the target detection image, determine the corresponding value in the histogram mapping table for each pixel value in the target detection image, and store each value Enter the same position on the blank image as the corresponding pixel value position to obtain the corresponding preliminary quantization parameters.
  • determine the corresponding value of each pixel value in the target detection image in the histogram mapping table replace the corresponding pixel value in the target detection image according to each value, and obtain the preliminary quantization parameter.
  • the adaptive quantization parameters are obtained by down-sampling the preliminary quantization parameters.
  • the adaptive quantization parameter is used for performing video encoding on the updated target frame during the video encoding process. This downsampling process is used to convert the preliminary quantization parameters to an image size suitable for video encoding.
  • the process of downsampling the preliminary quantization parameters is the same as the process of downsampling each frame in the target frame sequence; scaling the preliminary quantization parameters is the same as scaling each frame in the target frame sequence
  • the zoom factor of is also the same, and will not be repeated here.
  • FIG. 9 is a schematic diagram of a process of determining an adaptive quantization parameter provided by an embodiment of the present disclosure.
  • the preliminary quantization parameter 91 corresponding to the target frame can be obtained through histogram mapping.
  • the histogram mapping process includes: performing histogram statistics on the target detection image 90 to obtain a corresponding histogram mapping table, and then obtaining preliminary quantization parameters 91 by mapping the target detection image through the histogram mapping table.
  • the adaptive quantization parameter 92 is obtained by downsampling the preliminary quantization parameter by the same predetermined multiple as the downsampling process of each frame in the target frame sequence.
  • FIG. 10A is a schematic diagram of a data transmission process provided by an embodiment of the present disclosure.
  • the target replacement image 100 is input into the video encoder 102 as an input frame of the video encoder.
  • the adaptive quantization parameter 101 determined based on the target detection image is also input into the adaptive quantization interface of the video encoder 102 as a parameter for video coding the target replacement image 100 .
  • FIG. 10B is a schematic diagram of another data transmission process provided by an embodiment of the present disclosure.
  • the background image 1002 can be obtained by blurring the target frame 1001 .
  • the target detection image 1004 is normalized to obtain a transparency channel 1005 .
  • the preliminary quantization parameter 1007 corresponding to the target frame 1001 can be obtained through histogram mapping, and then the preliminary quantization parameter 1007 is down-sampled to obtain an adaptive quantization parameter 1008 .
  • the target replacement image 1006 is input into the video encoder 1009 as an input frame of the video encoder.
  • the adaptive quantization parameter 1008 determined based on the target detection image 1004 is also input into the adaptive quantization interface of the video encoder 1009 as a parameter for video coding the target replacement image 1006 .
  • the embodiments of the present disclosure may determine corresponding adaptive quantization parameters based on the attention region detection result of the target frame, so as to perform adaptive quantization adjustment and improve the efficiency of the video coding process.
  • the embodiments of the present disclosure determine the background image and the foreground image of the target frame, and update the target frame by displaying the foreground image in the attention area, and displaying the target replacement image of the background image in the non-attention area, thereby reducing the code rate of the entire video frame to be processed, Coding noise generated during the subsequent encoding process is reduced.
  • the embodiments of the present disclosure perform attention region detection after each frame in the frame sequence is down-sampled, thereby improving the efficiency of the attention region detection process and realizing real-time attention region detection.
  • the embodiments of the present disclosure it is possible to identify the area of interest to the human eye in real time, and use the limited bit rate to protect the quality of the attention area.
  • the total bit rate of the video decreases, the subjective quality can also be maintained. change, thereby saving network bandwidth. From the user's point of view, it can also save the traffic required to download videos and reduce video delays, thereby improving user experience. From the perspective of video service providers, it can save video storage space and transmission bandwidth, thereby reducing server costs.
  • FIG. 11 is a schematic diagram of a video processing device provided by an embodiment of the present disclosure.
  • the device includes: a sequence determination module 110 configured to determine a target frame sequence in the video to be processed according to the order of the time axis, so The target frame sequence includes: a target frame and at least one reference frame within a preset length range from the target frame; the attention area detection module 111 is configured to detect the attention area according to the target frame sequence, and obtain the used Characterize the target detection image of the attention area and non-attention area in the target frame; the image determination module 112 is configured to determine the corresponding background image and foreground image according to the target frame; the image fusion module 113 is configured to determine according to the target frame The target detection image performs transparency fusion on the background image and the foreground image to obtain a target replacement image, the attention area of the target replacement image is the foreground image, and the non-attention area of the target replacement image is at least part of the background image; an image updating module 114 configured to update the target
  • the attention area detection module includes: a first processing submodule configured to perform a first image processing on the target frame sequence to obtain a feature tensor, and the feature tensor Used to characterize: the image features of the target frame and each reference frame in the target frame sequence; the detection submodule is configured to input the feature tensor into the trained neural network to perform attention region detection, and compare the target frame and each of the reference frames determines the attention area in the target frame, and outputs the first detected image for characterizing the attention area and the non-attention area in the target frame, and the non-attention area is except for attention Areas other than the force area; the second processing submodule is configured to perform a second image processing on the first detection image to obtain a target detection image with the same resolution as the target frame.
  • the first processing submodule includes: a downsampling unit configured to downsample each frame in the target frame sequence by a predetermined multiple; a feature tensor determination unit configured to For each frame after downsampling, the feature tensor is determined.
  • the feature tensor includes a four-dimensional feature tensor, and the four dimensions of the feature tensor are timing, channel, height, and width of a corresponding frame, respectively.
  • the second processing submodule includes: an upsampling unit configured to upsample the first detected image by the predetermined multiple, so that the resolution obtained is the same as that of the target frame The second detection image; a pooling unit configured to perform maximum pooling on the second detection image with a window of a preset size and a step size to obtain a target detection image.
  • the neural network is a MobileNetV3 neural network.
  • the image determination module includes: a background determination submodule configured to perform blur processing on the target frame to obtain a background image; a foreground determination submodule configured to perform blurring processing on the target frame Sharpen to get the foreground image.
  • the image fusion module includes: a channel determination submodule configured to determine a transparency channel according to the target detection image; an image fusion submodule configured to analyze the background The image and the foreground image are transparently fused to obtain a target replacement image in which the foreground image is displayed at the position of the attention area and the background image is displayed at a position other than the attention area.
  • the sequence determination module includes: a queue insertion submodule configured to sequentially add each frame in the video to be processed to a preset first-in-first-out queue in sequence according to the time axis; the sequence determination submodule A module configured to determine a target frame sequence by using the frame in the middle of the queue as a target frame and the frames in other positions as reference frames in response to each position in the queue being occupied.
  • the device further includes: a queue update module, configured to, in response to the target frame being updated, eject the frame stored in the first position in the queue, and save the pending The next frame in the video is pushed onto the queue.
  • the device further includes: a parameter determination module configured to determine an adaptive quantization parameter corresponding to the target detection image; a data transmission module configured to convert the updated target frame and the corresponding The adaptive quantization parameter of is input to a video encoder, and video coding is performed on the target frame based on the corresponding adaptive quantization parameter.
  • the parameter determination module includes: a histogram statistics submodule configured to perform histogram statistics on the target detection image to obtain a corresponding histogram mapping table; the first parameter determination submodule is configured to map the target detection image according to the histogram mapping table to obtain corresponding preliminary quantization parameters; the second parameter determination submodule is configured to down-sample the preliminary quantization parameters to obtain adaptive quantization parameters.
  • the data transmission module includes: a data transmission submodule configured to input the updated target frame as an input frame into the video encoder, and input the adaptive quantization parameter into the Adaptive quantization interface for the video encoder described above.
  • the functions or modules included in the device provided by the embodiments of the present disclosure can be configured to execute the methods described in the above method embodiments, and its specific implementation can refer to the descriptions of the above method embodiments. For brevity, here No longer.
  • Embodiments of the present disclosure also provide a computer-readable storage medium, on which computer program instructions are stored, and the above-mentioned method is implemented when the computer program instructions are executed by a processor.
  • Computer readable storage media may be volatile or nonvolatile computer readable storage media.
  • An embodiment of the present disclosure also proposes an electronic device, including: a processor; a memory configured to store processor-executable instructions; wherein, the processor is configured to call the instructions stored in the memory to execute part of the above method or all steps.
  • An embodiment of the present disclosure also proposes a computer program, the computer program includes computer readable code, and when the computer readable code is read and executed by a computer, part of the method in any embodiment of the present disclosure is implemented or all steps.
  • An embodiment of the present disclosure also provides a computer program product, including computer-readable codes, or a non-volatile computer-readable storage medium carrying computer-readable codes, when the computer-readable codes are stored in a processor of an electronic device When running in the electronic device, the processor in the electronic device executes some or all steps of the above method.
  • Electronic devices may be provided as terminals, servers, or other forms of devices.
  • Fig. 12 is a block diagram of an electronic device provided by an embodiment of the present disclosure.
  • the electronic device 1200 may be a terminal such as a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, or a personal digital assistant.
  • electronic device 1200 may include one or more of the following components: processing component 1202, memory 1204, power supply component 1206, multimedia component 1208, audio component 1210, input/output (I/O) interface 1212, sensor component 1214, and communication component 1216.
  • the processing component 1202 generally controls the overall operations of the electronic device 1200, such as those associated with display, telephone calls, data communications, camera operations, and recording operations.
  • the processing component 1202 may include one or more processors 1220 for executing instructions to complete all or part of the steps of the above-mentioned method.
  • processing component 1202 may include one or more modules that facilitate interaction between processing component 1202 and other components.
  • processing component 1202 may include a multimedia module to facilitate interaction between multimedia component 1208 and processing component 1202 .
  • the memory 1204 is configured to store various types of data to support operations at the electronic device 1200 . Examples of such data include instructions for any application or method operating on the electronic device 1200, such as contact data, phonebook data, messages, pictures, videos, and the like.
  • the memory 1204 may be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic or Optical Disk.
  • the power supply component 1206 provides power to various components of the electronic device 1200 . Power supply components 1206 may include a power management system, one or more power supplies, and other components associated with managing and distributing power generated for electronic device 1200 .
  • the multimedia component 1208 includes a screen providing an output interface between the electronic device 1200 and the user.
  • the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user.
  • the touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor can not only sense the boundary of the touch or slide action, but also detect the duration and pressure associated with the touch or slide operation.
  • multimedia component 1208 includes a front camera and/or a rear camera. When the electronic device 1200 is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera can receive external multimedia data. Each front camera and each rear camera can be a fixed optical lens system or have focal length and optical zoom capability.
  • the audio component 1210 is configured to output and/or input audio signals.
  • the audio component 1210 includes a microphone (MIC), which is configured to receive an external audio signal when the electronic device 1200 is in an operation mode, such as a call mode, a recording mode and a voice recognition mode. Received audio signals may be stored in memory 1204 or sent via communication component 1216 .
  • the audio component 1210 also includes a speaker for outputting audio signals.
  • the I/O interface 1212 provides an interface between the processing component 1202 and a peripheral interface module, which may be a keyboard, a click wheel, a button, and the like. These buttons may include, but are not limited to: a home button, volume buttons, start button, and lock button.
  • Sensor assembly 1214 includes one or more sensors for providing various aspects of status assessment for electronic device 1200 .
  • the sensor component 1214 can detect the open/closed state of the electronic device 1200, the relative positioning of components, such as the display and the keypad of the electronic device 1200, the sensor component 1214 can also detect the electronic device 1200 or a Changes in the position of components can also detect the presence or absence of user contact with the electronic device 1200 , and can also detect the orientation, acceleration, deceleration of the electronic device 1200 or temperature changes of the electronic device 1200 .
  • Sensor assembly 1214 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact.
  • the sensor assembly 1214 may also include an optical sensor, such as a complementary metal-oxide-semiconductor (CMOS) or charge-coupled device (CCD) image sensor, for use in imaging applications.
  • CMOS complementary metal-oxide-semiconductor
  • CCD charge-coupled device
  • the sensor component 1214 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor or a temperature sensor.
  • the communication component 1216 is configured to facilitate wired or wireless communication between the electronic device 1200 and other devices.
  • the electronic device 1200 can access a wireless network based on a communication standard, such as a wireless network (WiFi), a fourth generation mobile communication technology (4G) or a fifth generation mobile communication technology (5G), or a combination thereof.
  • the communication component 1216 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel.
  • the communication component 1216 also includes a near field communication (NFC) module to facilitate short-range communication.
  • the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, Infrared Data Association (IrDA) technology, Ultra Wide Band (UWB) technology, Bluetooth (BT) technology and other technologies.
  • RFID Radio Frequency Identification
  • IrDA Infrared Data Association
  • UWB Ultra Wide Band
  • Bluetooth Bluetooth
  • electronic device 1200 may be implemented by one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable A programmable gate array (FPGA), controller, microcontroller, microprocessor or other electronic component implementation for performing the methods described above.
  • ASICs application specific integrated circuits
  • DSPs digital signal processors
  • DSPDs digital signal processing devices
  • PLDs programmable logic devices
  • FPGA field programmable A programmable gate array
  • controller microcontroller, microprocessor or other electronic component implementation for performing the methods described above.
  • a non-volatile computer-readable storage medium such as a memory 1204 including computer program instructions, which can be executed by the processor 1220 of the electronic device 1200 to complete part of the above method or all steps.
  • Fig. 13 is a block diagram of another electronic device provided by an embodiment of the present disclosure.
  • the electronic device 1300 may be provided as a server.
  • electronic device 1300 includes processing component 1322 , which includes one or more processors, and memory resources represented by memory 1332 for storing instructions executable by processing component 1322 , such as application programs.
  • the application programs stored in the memory 1332 may include one or more modules corresponding to a set of instructions.
  • the processing component 1322 is configured to execute instructions to perform the above method.
  • Electronic device 1300 may also include a power supply component 1326 configured to perform power management of electronic device 1300, a wired or wireless network interface 1350 configured to connect electronic device 1300 to a network, and an input/output (I/O ) interface 1358.
  • the electronic device 1300 can operate based on the operating system stored in the memory 1332, such as the Microsoft server operating system (Windows ServerTM), the graphical user interface-based operating system (Mac OS XTM) introduced by Apple Inc., the multi-user and multi-process computer operating system (UnixTM). ), a free and open source Unix-like operating system (LinuxTM), an open source Unix-like operating system (FreeBSDTM), or similar.
  • a non-volatile computer-readable storage medium such as the memory 1332 including computer program instructions, which can be executed by the processing component 1322 of the electronic device 1300 to implement the above-mentioned method.
  • the present disclosure can be a system, method and/or computer program product.
  • a computer program product may include a computer readable storage medium having computer readable program instructions thereon for causing a processor to implement various aspects of the present disclosure.
  • a computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device, and may be a volatile storage medium or a nonvolatile storage medium.
  • a computer readable storage medium may be, but is not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
  • Examples (non-exhaustive list) of computer-readable storage media include: portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory) , static random access memory (SRAM), portable compact disc read only memory (CD-ROM), digital versatile disc (DVD), memory sticks, floppy disks, mechanically encoded devices such as punched cards with instructions stored thereon, or The protruding structure in the groove, and any suitable combination of the above.
  • RAM random access memory
  • ROM read-only memory
  • EPROM or flash memory erasable programmable read-only memory
  • SRAM static random access memory
  • CD-ROM compact disc read only memory
  • DVD digital versatile disc
  • memory sticks floppy disks, mechanically encoded devices such as punched cards with instructions stored thereon, or The protruding structure in the groove, and any suitable combination of the above.
  • Computer-readable storage media as used herein is not to be interpreted as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., pulses of light through fiber optic cables), or transmitted electrical signals.
  • the computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or downloaded to an external computer or external storage device over a network, such as the Internet, local area network, wide area network, and/or wireless network.
  • the network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers.
  • a network adapter card or a network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device .
  • Computer program instructions for performing the operations of the present disclosure may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, state setting data, or Source or object code written in any combination, including object-oriented programming languages—such as Smalltalk, C++, etc., and conventional procedural programming languages—such as the “C” language or similar programming languages.
  • Computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server implement.
  • the remote computer can be connected to the user computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as via the Internet using an Internet service provider). connect).
  • LAN local area network
  • WAN wide area network
  • an electronic circuit such as a programmable logic circuit, field programmable gate array (FPGA), or programmable logic array (PLA)
  • FPGA field programmable gate array
  • PDA programmable logic array
  • These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine such that when executed by the processor of the computer or other programmable data processing apparatus , producing an apparatus for realizing the functions/actions specified in one or more blocks in the flowchart and/or block diagram.
  • These computer-readable program instructions can also be stored in a computer-readable storage medium, and these instructions cause computers, programmable data processing devices and/or other devices to work in a specific way, so that the computer-readable medium storing instructions includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks in flowcharts and/or block diagrams.
  • each block in a flowchart or block diagram may represent a module, a portion of a program segment, or an instruction that includes one or more Executable instructions.
  • the functions noted in the block may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented by a dedicated hardware-based system that performs the specified function or action , or may be implemented by a combination of dedicated hardware and computer instructions.
  • the computer program product can be specifically realized by means of hardware, software or a combination thereof.
  • the computer program product is embodied as a computer storage medium, and in another optional embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK) etc. wait.
  • a software development kit Software Development Kit, SDK

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Image Analysis (AREA)

Abstract

本公开涉及一种视频处理方法及装置、电子设备、存储介质、计算机程序、计算机程序产品,该方法包括:按照时间轴顺序在待处理视频中确定包括目标帧和相邻至少一个参考帧的目标帧序列,通过对目标帧序列进行注意力区域检测,得到用于区分目标帧中注意力区域和非注意力区域的目标检测图像。基于目标检测图像对根据目标帧确定的背景图像和前景图像进行透明度融合,得到在注意力区域显示前景图像,非注意力区域显示至少部分背景图像的目标替代图像,通过目标替代图像更新目标帧。通过在注意力区域显示前景图像,非注意力区域显示至少部分背景图像的目标替代图像更新目标帧,减少了整个待处理视频帧的码率,进而减少后续编码过程中产生的编码噪音。

Description

视频处理方法及装置、电子设备、存储介质、计算机程序、计算机程序产品
相关申请的交叉引用
本公开基于申请号为202110963126.9、申请日为2021年08月20日、申请名称为“视频处理方法及装置、电子设备和存储介质”的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本公开作为参考。
技术领域
本公开涉及计算机技术领域,尤其涉及一种视频处理方法及装置、电子设备、存储介质、计算机程序、计算机程序产品。
背景技术
在视频处理领域,识别视频中注意力区域并调整码率,对于提升视频主观质量或观众的视觉体验至关重要。然而,如何快速准确辨别注意力区域是一种挑战。此外,在视频编码过程中,简单地提升注意力区域码率,并降低非注意力区域码率,会导致在低码率下对非注意力区域造成编码噪声。
发明内容
本公开提出了一种视频处理方法及装置、电子设备、存储介质、计算机程序、计算机程序产品,旨在快速准确地识别视频中注意力区域,并减少在视频编码过程中产生的编码噪声。
本公开实施例提供了一种视频处理方法,所述方法包括:
按照时间轴顺序在待处理视频中确定目标帧序列,所述目标帧序列中包括:目标帧和距离所述目标帧预设长度范围内的至少一个参考帧;
根据所述目标帧序列进行注意力区域检测,得到用于表征所述目标帧中注意力区域和非注意力区域的目标检测图像;
根据所述目标帧确定对应的背景图像和前景图像;
根据所述目标检测图像对所述背景图像和前景图像进行透明度融合,得到目标替代图像,所述目标替代图像的注意力区域为所述前景图像,所述目标替代图像的非注意力区域为至少部分所述背景图像;
通过所述目标替代图像更新所述目标帧。
本公开实施例提供了一种视频处理装置,所述装置包括:
序列确定模块,配置为按照时间轴顺序在待处理视频中确定目标帧序列,所述目标帧序列中包括:目标帧和距离所述目标帧预设长度范围内的至少一个参考帧;
注意力区域检测模块,配置为根据所述目标帧序列进行注意力区域检测,得到用于表征所述目标帧中注意力区域和非注意力区域的目标检测图像;
图像确定模块,配置为根据所述目标帧确定对应的背景图像和前景图像;
图像融合模块,配置为根据所述目标检测图像对所述背景图像和前景图像进行透明度融合,得到目标替代图像,所述目标替代图像的注意力区域为所述前景图像,所述目标替代图像的非注意力区域为至少部分所述背景图像;
图像更新模块,配置为通过所述目标替代图像更新所述目标帧。
本公开实施例提供了一种电子设备,包括:处理器;配置为存储处理器可执行指令的存储器;其中,所述处理器被配置为调用所述存储器存储的指令,以执行上述方法的部分或全部步骤。
本公开实施例提供了一种计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指令被处理器执行时实现上述方法的部分或全部步骤。
本公开实施例提供一种计算机程序,所述计算机程序包括计算机可读代码,在所述计算机可读代码被计算机读取并执行的情况下,实现本公开任一实施例中的方法的部分或全部步骤。
本公开实施例提供一种计算机程序产品,所述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,所述计算机程序被计算机读取并执行时,实现本公开任一实施例中的方法的部分或全部步骤。
本公开实施例中,可以确定目标帧的背景图像和前景图像,并通过在注意力区域显示前景图像,非注意力区域显示背景图像的目标替代图像更新目标帧,减少了整个待处理视频帧的码率,进而减少在后续编码过程中产生的编码噪音。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,而非限制本公开。根据下面参考附图对示例性实施例的详细说明,本公开的其它特征及方面将变得清楚。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,这些附图示出了符合本公开的实施例,并与说明书一起用于说明本公开的技术方案。
图1为本公开实施例提供的一种视频处理方法的流程图;
图2为本公开实施例提供的一种确定目标帧序列的示意图;
图3A为本公开实施例提供的一种注意力区域检测过程的流程图;
图3B为本公开实施例提供的一种得到第一检测图像的流程示意图;
图4为本公开实施例提供的一种第二次图像处理过程的示意图;
图5为本公开实施例提供的一种注意力区域检测过程的示意图;
图6为本公开实施例提供的一种目标检测图像的示意图;
图7为本公开实施例提供的一种确定目标替代图像过程的示意图;
图8为本公开实施例提供的一种透明度融合过程的示意图;
图9为本公开实施例提供的一种确定自适应量化参数过程的示意图;
图10A为本公开实施例提供的一种数据传输过程的示意图;
图10B为本公开实施例提供的另一种数据传输过程的示意图;
图11为本公开实施例提供的一种视频处理装置的示意图;
图12为本公开实施例提供的一种电子设备的框图;
图13为本公开实施例提供的另一种电子设备的框图。
具体实施方式
以下将参考附图详细说明本公开的各种示例性实施例、特征和方面。附图中相同的附图标记表示功能相同或相似的元件。尽管在附图中示出了实施例的各种方面,但是除非特别指出,不必按比例绘制附图。
在这里专用的词“示例性”意为“用作例子、实施例或说明性”。这里作为“示例性”所说明的任何实施例不必解释为优于或好于其它实施例。
本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中术语“至少一种”表示多种中的任意一种或多种中的至少两种的任意组合,例如,包括A、B、C中的至少一种,可以表示包括从A、B和C构成的集合中选择的任意一个或多个元素。
另外,为了更好地说明本公开,在下文的具体实施方式中给出了众多的具体细节。本领域技术人员应当理解,没有某些具体细节,本公开同样可以实施。在一些实例中,对于本领域技术人员熟知的方法、手段、元件和电路未作详细描述,以便于凸显本公开的主旨。
图1为本公开实施例提供的一种视频处理方法的流程图。该视频处理方法可以由终端设备或其它处理设备执行,其中,终端设备可以为用户设备(User Equipment,UE)、移动设备、用户终端、终端、蜂窝电话、无绳电话、个人数字处理(Personal Digital Assistant,PDA)、手持设备、计算设备、车载设备、可穿戴设备等。在一些可能的实现方式中,该视频处理方法可以通过处理器调用存储器中存储的计算机可读指令的方式来实现。
在一个示例性的应用场景中,可以通过对预先确定的待处理视频执行本公开实施例 的视频处理方法,将其中的每一帧更新为在注意力区域和非注意力区域清晰度不同的图像,并确定各帧对应的自适应量化参数,通过待处理视频中各帧和对应的自适应量化参数进行视频编码。在一些实施例中,该视频处理方法和视频编码可以通过同一设备完成,或先由终端设备或其它设备执行视频处理方法后传输至视频编码器进行视频编码。
如图1所示,本公开实施例的视频处理方法包括以下步骤:
步骤S10、按照时间轴顺序在待处理视频中确定目标帧序列。
在一种可能的实现方式中,本公开实施例可以通过对待处理视频中每一帧分别进行处理的方式执行视频处理方法。也就是说,可以将待处理视频中各帧分别作为目标帧进行图像处理,以确定用于替换目标帧的目标替代图像。在完成当前目标帧的图像处理后,重新在待处理视频中确定未处理的帧作为新的目标帧,直到完成待处理视频中全部帧的图像处理,进而完成待处理视频的视频处理过程。在一些实施例中,为了提高视频处理效率,目标帧的处理顺序可以基于时间轴顺序依次确定。
在一些实施例中,待处理视频实质上为多个帧组成的帧序列,其中各帧均记录一幅静态的图像信息。因此,为了对目标帧进行视频处理,需要获取距离目标帧预设长度范围内的多个帧作为参考帧,以通过对参考帧与目标帧图像内容的对比,进行注意力区域检测,确定目标帧中的注意力区域,进行视频处理。
也就是说,在执行本公开实施例的视频处理方法时,先按照时间轴顺序依次在待处理视频中确定目标帧序列,目标帧序列中包括:目标帧和距离目标帧预设长度范围内的至少一个参考帧。“预设长度”可以为预设的帧数。其中,在目标帧之前或之后没有预设长度的帧数的情况下,即目标帧为视频开始的前几帧或末尾的后几帧时,可以不确定对应的帧序列,直接将相邻参考帧的注意力区域识别结果,作为当前目标帧的注意力区域识别结果。
以待处理视频中包括T1-T10帧为例进行说明。在预设长度为2的情况下,可以根据时间轴顺序依次确定目标帧序列为(T1,T2,T3,T4,T5)、(T2,T3,T4,T5,T6)、(T3,T4,T5,T6,T7)、(T4,T5,T6,T7,T8)、(T5,T6,T7,T8,T9)和(T6,T7,T8,T9,T10)。其中,各目标帧序列对应的目标帧依次为T3、T4、T5、T6、T7和T8,可以通过对各目标帧序列依次进行注意力区域检测,得到对应目标帧的注意力区域的识别结果。在一些实施例中,可以将T3的识别结果作为T1和T2的识别结果,以及将T8的识别结果作为T9和T10的识别结果。
在一个可能的实现方式中,本公开实施例获取目标帧序列的过程可以包括:按照时间轴顺序将待处理视频中各帧依次加入预设的先入先出队列,响应于队列中各位置均被占用,将队列中间位置的帧作为目标帧序列的目标帧,其它位置的帧作为目标帧序列的参考帧,确定目标帧序列。也就是说,预先设定一个长度固定的先入先出队列,根据待处理视频中各帧在时间轴上的顺序依次将各帧加入该队列,其中,各帧占用队列中的一个位置。在队列中各位置均被占用,即各位置中均存储有待处理视频中的一帧时,获取处于队列中间位置的帧作为目标帧,获取处于队列中其它位置的帧作为参考帧,根据目 标帧和对应参考帧确定目标帧序列。其中,中间位置表征队列中最中间的一个位置,或者最中间两个位置中的预定位置。例如,当队列长度为奇数时,队列中仅包括一个最中间位置,确定该位置中的帧为目标帧。当队列长度为偶数时,队列中包括两个中间位置,可以确定两个位置中靠前位置内存储的帧为目标帧。
在一些实施例中,先入先出队列的长度可以为目标帧和参考帧的数量总和,即预设长度乘以二,并将相乘的结果加一,这里,预设长度可以等同于预设长度范围所指示的预设长度,预设长度可以为预设的帧数。例如,当预设长度为2时,先入先出队列的长度为5。在一些实施例中,在根据当前目标帧序列完成对目标帧的处理过程后,从队列中的第一个位置弹出队列中存储的帧,并将待处理视频中的下一帧压入队列。
图2为本公开实施例提供的一种确定目标帧序列的示意图。如图2所示,在确定目标帧序列时,将待处理视频20中的各帧沿时间轴顺序21依次加入预设的先入先出队列22。在一种可能的实现方式中,在队列22中的各位置依次被帧T-2、T-1、T、T+1和T+2占用的情况下,将当前处于队列22中间位置的帧T作为目标帧,将其它位置的帧T-2、T-1、T+1和T+2作为参考帧,确定目标帧序列(T-2,T-1,T,T+1,T+2),以基于目标帧序列(T-2,T-1,T,T+1,T+2)对目标帧T进行注意力区域识别等视频处理过程。其中,可以在T-3帧被弹出,T+2帧被压入队列时确定目标帧序列。在一些实施例中,在完成当前目标帧序列(T-2,T-1,T,T+1,T+2)对应目标帧T的处理过程后,从当前队列22中第一个位置弹出最先加入队列22的帧T-2,并将时间轴上位于当前队列22中最后一个位置中帧T+2之后的帧T+3压入队列22,使得队列22中其它位置的帧向前移动一个位置。
在一种可选的实现方式中,上述顺序确定目标帧序列的方式能够提高整个视频处理过程的效率。同时,确定包括参考帧和目标帧的目标帧序列,以基于目标帧序列进行注意力区域检测,提高了目标帧对应注意力区域识别结果的准确程度。
步骤S20、根据所述目标帧序列进行注意力区域检测,得到用于表征所述目标帧中注意力区域和非注意力区域的目标检测图像。
在一种可能的实现方式中,可以通过对目标帧序列进行注意力区域检测的方式,确定目标检测图像,该目标检测图像为:用于表征目标帧中注意力区域和非注意力区域的图像。在一些实施例中,可以通过对比目标帧序列中目标帧和参考帧中图像内容确定该注意力区域。在一些实施例中,注意力区域为人类在目标帧中会重点关注的区域,例如,可以为图像中相对静态背景下的运动区域或图像中特定轮廓所在的区域。其中,相对静态背景下的运动区域可以为:足球比赛视频中足球所在的区域等,特定轮廓所在的区域可以为:人脸识别场景下人脸轮廓所在的区域等。或者,在需要重点关注背景的情况下,注意力区域还可以为除了运动区域以外的其它区域。
图3A为本公开实施例提供的一种注意力区域检测过程的流程图。如图3A所示,在一个可能的实现方式中,本公开实施例对目标检测序列进行注意力区域检测,得到用于表征目标帧注意力区域和非注意力区域的目标检测图像过程,可以包括以下步骤:
步骤S21、对所述目标帧序列进行第一次图像处理,得到特征张量。
在一种可能的实现方式中,对目标帧序列进行第一次图像处理,得到特征张量,该特征张量用于表征:目标帧序列中目标帧和各参考帧的图像特征,其中,每一目标帧序列对应一特征张量。在一些实施例中,该第一次图像处理过程旨在将目标帧序列中的各帧由高分辨率图像转换为低分辨率图像,便于提高后续的注意力区域的检测速度和效率。
在一些实施例中,第一次图像处理的过程可包括:以预定倍数对目标帧序列中各帧进行降采样,根据降采样后的各帧,确定特征张量。也就是说,预先设定一个倍数,通过降采样的方式将目标帧序列中的各帧缩小预定倍数,再根据缩小后的各帧确定特征张量。在一些实施例中,该降采样的方式可以采用任意方式,例如最近邻插值,双线性插值,均值插值,中值插值等方法,在此不做限定。
在一种可能的实现方式中,当本公开实施例应用于视频编码场景时,为了提高后续视频编码过程的效率,可以根据编码过程中应用的宏块尺寸设定预定倍数。例如,当宏块尺寸为16×16时,设定预定倍数为16,即通过降采样的方式对各帧缩小16倍得到宏块级别的帧。
在一种可能的实现方式中,根据降采样后的各帧,确定的特征张量为四维的特征张量,其中,特征张量的四个维度分别为对应帧的时序、通道、高度和宽度。在一些实施例中,时序可以根据各帧在待处理图像中的时间轴顺序确定,通道根据各帧的色彩通道数确定,高度和宽度根据各帧的分辨率尺寸确定。四维的特征张量可适用于后文的MobileNetV3神经网络等轻量级神经网络,用于作为神经网络的输入数据。
步骤S22、将所述特征张量输入训练得到的神经网络中进行注意力区域检测。
在一种可能的实现方式中,将特征张量输入训练得到的神经网络中进行注意力区域检测,以通过对比目标帧和各参考帧的图像内容,确定目标帧中的注意力区域,输出用于表征注意力区域和非注意力区域的第一检测图像。在一些实施例中,第一检测图像的分辨率与降采样后各帧的分辨率相同。例如,在确定运动区域为注意力区域的情况下,可以通过对象识别确定目标帧和参考帧中包括的多个对象区域,再对比目标帧和参考帧中相同对象所在的对象区域的位置,根据位置的变化距离大于预设阈值的对象所对应的对象区域在目标帧中的位置,确定注意力区域。
在一种可能的实现方式中,该进行注意力区域检测的神经网络为一种轻量化的神经网络。在一些实施例中,该神经网络可以为MobileNetV3神经网络,依次包括:起始部分、中间部分和最后部分。其中,起始部分包括一个用于特征提取的卷积核大小为3×3的卷积层,中间部分包括11或15个bneck模块,最后部分包括一个平均池化层和一个卷积核大小为1×1的卷积层,bneck模块中包括依次连接的通道可分离卷积和通道注意力机制,并通过残差连接的方式减少卷积过程中的数据丢失。
图3B为本公开实施例提供的一种得到第一检测图像的流程示意图。如图3B所示,以先入先出队列的长度为5为例,可以将待处理视频中的帧T压入先入先出队列401,在将帧T压入先入先出队列401时,会弹出帧T-5,这样先入先出队列401中可以存储 有帧T、帧T-1、帧T-2、帧T-3以及帧T-4。在实现的过程中,可以对先入先出队列401中的各帧分别进行降采样处理,并根据降采样后的各帧,得到特征向量402,将特征向量402输入MobileNetV3神经网络403,通过MobileNetV3神经网络403输出第一检测图像404。
由于MobileNetV3神经网络的结构特性使得MobileNetV3神经网络能够在减少运算量的同时提高计算结果的准确度,本公开实施例基于该神经网络,能够在低分辨率输入的情况下实时进行注意力区域检测,提高检测速度,同时提高检测结果的准确程度。
步骤S23、对所述第一检测图像进行第二次图像处理,得到与所述目标帧分辨率相同的目标检测图像。
在一种可能的实现方式中,对第一检测图像进行第二次图像处理,得到与目标帧分辨率相同的目标检测图像。其中,第二次图像处理过程用于将第一检测图像的尺寸还原为目标帧的原始尺寸,以基于得到的目标检测图像对目标帧进行图像处理和更新。在一种可能的实现方式中,对第一检测图像进行第二次图像处理的过程包括:以预定倍数对第一检测图像进行上采样,得到分辨率与目标帧相同的第二检测图像。以预设尺寸的窗口和步长对第二检测图像进行最大池化,得到目标检测图像。
在一些实施例中,通过与上述降采样倍数相同的预设倍数对第一检测图像进行上采样,可将第一检测图像的分辨率还原至目标帧相同的分辨率,得到第二检测图像。其中,对第一检测图像进行上采样的方式与可以采用任意方式,例如,最近邻插值,双线性插值,均值插值,中值插值等方法,在此不做限定。当然,也可以选用双三次插值法进行上采样,以提高最终得到的图像效果。
在一种可能的实现方式中,对第二检测图像进行最大池化的窗口尺寸可以根据上采样比例确定,即与上述的预定倍数相同。例如,当预定倍数为16时,可以确定最大池化的窗口尺寸为16×16。同时,为使得进行最大池化后得到的目标检测图像尺寸不发生改变,可以预先设定最大池化过程的步长为1。在本公开实施例应用于视频编码场景时,上述最大池化过程中的窗口尺寸的确定方式,可以提高后续视频编码过程的效率。
图4为本公开实施例提供的一种第二次图像处理过程的示意图。如图4所示,本公开实施例在通过神经网络对特征张量进行注意力区域检测得到第一检测图像40后,先通过上采样的方式将第一检测图像分辨率还原至与目标帧相同的第二检测图像41。同时,通过最大池化的方式提取第二检测图像41的纹理特征,得到能够清晰区分出注意力区域和非注意力区域的目标检测图像42,便于后续的图像处理。
图5为本公开实施例提供的一种注意力区域检测过程的示意图。如图5所示,本公开实施例对目标帧进行注意力区域检测的过程为:先确定目标帧对应的目标帧序列50,通过第一次图像处理对目标帧序列50中各帧降采样得到特征向量51。将低分辨率的特征向量51输入训练得到的神经网络52,能快速且准确地得到低分辨率的第一检测图像53。通过对第一检测图像53进行第二次图像处理,得到纹理特征清晰且分辨率高的目标检测图像54。该注意力区域检测过程提高了检测效率,且提高了检测结果的准确程度。
图6为本公开实施例提供的一种目标检测图像的示意图。如图6所示,目标帧对应的目标检测图像与目标帧的分辨率相同,各像素的值均为0-1的数值。其中,各数值用于表征对应像素在注意力区域中的概率,例如,数值为1的像素为注意力区域中的像素,数值为0的像素为非注意力区域的像素。
步骤S30、根据所述目标帧确定对应的背景图像和前景图像。
在一种可能的实现方式中,通过不同的图像处理方式分别对目标帧进行图像处理,以得到视觉效果不同的背景图像和前景图像。其中,对目标帧进行模糊处理,得到背景图像,对目标帧进行锐化处理,得到前景图像。在一些实施例中,本公开实施例对目标帧进行模糊处理的方式可以包括高斯模糊、椒盐模糊、运动模糊以及遮挡模糊等任意图像模糊处理方式,在此不做限定。
在一些实施例中,本公开实施例对目标帧进行锐化处理的方式可以包括:索贝尔算子锐化、拉普拉斯算子锐化、prewitt算子锐化以及canny算子锐化等任意图像锐化处理方式,在此不做限定。本公开实施例可以通过不同的处理方式分别确定前景图像和背景图像,以基于注意力区域融合前景图像和背景图像,增强注意力区域的图像轮廓,以提升清晰度,降低非注意力区域的图像清晰度,提高最终处理后得到图像的视觉体验。
步骤S40、根据所述目标检测图像对所述背景图像和前景图像进行透明度融合,得到目标替代图像。
在一种可能的实现方式中,目标替代图像的注意力区域为前景图像,非注意力区域为背景图像。根据目标检测图像对背景图像和前景图像进行透明度融合,得到目标替代图像的方式包括:根据目标检测图像确定透明度通道,根据透明度通道对背景图像和前景图像进行透明度融合,得到在注意力区域位置显示前景图像,在非注意力区域位置显示全部或部分背景图像的目标替代图像。
在一些实施例中,通过对目标检测图像进行归一化的方式将目标检测图像中各像素值重新映射到到0~1范围之内,得到对应的透明度通道。其中,像素值为1的区域为注意力区域,像素值不为1的区域为非注意力区域。在一些实施例中,像素值1表征透明度0%的位置,像素值0表征透明度100%的位置,0-1之间的像素值表征不透明的概率。
在一些实施例中,根据透明度通道对背景图像和前景图像进行透明度融合的方式可包括:根据透明度通道中各像素值表征的概率调整前景图像中各像素的透明度,再将调整后的前景图像与背景图像进行融合,得到目标替代图像。其中,目标替代图像在注意力区域位置显示不透明的前景图像,背景图像被遮盖。在非注意力区域,由于前景图像的透明度位于0-100%之间,能够全部或部分显示背景图像。在一些实施例中,在像素值为0的非注意力区域,该前景图像的透明度为100%,可以直接显示背景图像,在像素值非0和非1的像素值位置,根据该位置的像素值调整对应前景图像的透明度,以在该位置同时显示部分前景图像和部分背景图像。
图7为本公开实施例提供的一种确定目标替代图像过程的示意图。如图7所示,通 过对目标帧70进行模糊处理,得到背景图像71,对目标帧70进行锐化处理,得到前景图像72。同时,通过对目标检测图像73进行归一化处理得到透明度通道74。通过对背景图像71、前景图像72和透明度通道74进行透明度融合,即可确定用于替换目标帧的目标替代图像75。
图8为本公开实施例提供的一种透明度融合过程的示意图。如图8所示,在对前景图像80、背景图像81和透明度通道82进行透明度融合时,将前景图像80作为图像顶层、背景图像81作为图像底层,对位于图像顶层的前景图像80和位于图像底层的背景图像81进行叠加。在一些实施例中,根据透明度通道82将前景图像80中注意力区域的透明度调节为100%(即,调节为不透明),即在目标替代图像83的注意力区域显示位于图像顶层的前景图像80,将前景图像80中值为0的非注意力区域的透明度调节为100%,即在目标替代图像83的非注意力区域显示位于图像底层的背景图像81。对于目标检测图像中像素值介于0-1之间的非注意力区域,根据对应像素值调整各前景图像在各像素位置的透明度,以同时显示前景图像80和部分背景图像81,例如,当像素值为0.8时,将前景图像80的透明度调整为20%。
本公开实施例可以通过透明度融合的方式在注意力区域显示清晰的前景图像,在非注意力区域显示模糊的背景图像,提高得到目标替代图像的主观视觉体验。
步骤S50、通过所述目标替代图像更新所述目标帧。
在一种可选的实现方式中,在得到在注意力区域显示前景图像,在非注意力区域显示背景图像的目标替代图像后,通过目标替代图像更新待处理视频中的目标帧。在一些实施例中,在视频编码场景中,可以将更新后的目标帧作为输入帧,输入视频编码器进行视频编码。
在一些实施例中,响应于目标帧被更新,弹出队列中第一个位置中存储的帧,并将待处理视频中的下一帧压入队列。也就是说,在待处理视频中的目标帧被更新后,判断完成当前目标帧的处理过程,通过弹出队列中第一个位置中存储的帧,并将下一帧压入队列的方式,重新确定时间轴顺序上位于前一个目标帧之后的下一帧作为新的目标帧。同时,重新获取队列中各帧,以确定新的目标帧对应的目标帧序列。此时,被更新过的目标帧成为新的目标序列中的参考帧。
在一种可选的实现方式中,本公开实施例的视频处理方法应用于视频编码场景。为提高视频编码过程的效果,还需要确定目标检测图像对应的自适应量化参数,将更新后的目标帧和对应的自适应量化参数输入视频编码器,基于对应的自适应量化参数对目标帧进行视频编码。其中,将更新后的目标帧和对应的自适应量化参数输入视频编码器的过程可以是:将更新后的目标帧作为输入帧输入视频编码器,将自适应量化参数输入视频编码器的自适应量化接口。
以神经网络是MobileNetV3神经网络为例,本公开实施例中,可以基于MobileNetV3的轻量化神经网络,对降采样后得到的特征向量进行处理,能够实现对降采样至宏块级别的帧序列(视频)进行实时显著性检测,得到目标检测图像。在得到目标检测图像之 后,基于目标检测图像,对目标帧序列(原始视频)进行后处理,并输出自适应量化参数,能够在降低码率的同时,提升视频的主观清晰度。
在一些实施例中,确定目标检测图像对应的自适应量化参数的过程包括:对目标检测图像进行直方图统计,得到对应的直方图映射表。根据直方图映射表映射目标检测图像,得到对应的初步量化参数。在一些实施例中,该映射过程可以为:初始化一个与目标检测图像尺寸相同的空白图像,对于目标检测图像中的各像素值,在直方图映射表中确定对应的数值,并将各数值存入该空白图像上与对应像素值位置相同的位置,得到对应的初步量化参数。或者,确定目标检测图像中的各像素值在直方图映射表中对应的数值,根据各数值替换目标检测图像中对应的像素值,得到初步量化参数。
在一些实施例中,通过对初步量化参数进行降采样,得到自适应量化参数。该自适应量化参数用于在视频编码过程中,对更新后的目标帧进行视频编码。该降采样过程用于将初步量化参数转换为适合进行视频编码的图像尺寸。在一种可选的实现方式中,对初步量化参数进行降采样的过程与对目标帧序列中各帧进行降采样的过程相同;对初步量化参数进行缩放与对目标帧序列中各帧进行缩放的缩放倍数也相同,在此不再赘述。
图9为本公开实施例提供的一种确定自适应量化参数过程的示意图。如图9所示,在视频编码的应用场景下,本公开实施例在确定目标检测图像90后,可以通过直方图映射的方式得到目标帧对应的初步量化参数91。其中,直方图映射的过程包括:对目标检测图像90进行直方图统计得到对应的直方图映射表,再通过直方图映射表映射目标检测图像的方式得到初步量化参数91。在一些实施例中,通过与目标帧序列中各帧降采样过程相同的预定倍数对初步量化参数进行降采样,得到自适应量化参数92。
图10A为本公开实施例提供的一种数据传输过程的示意图。如图10A所示,在将目标替代图像100更新至待处理视频中的目标帧位置后,将该目标替代图像100作为视频编码器的输入帧,输入视频编码器102。同时,还将基于目标检测图像确定的自适应量化参数101作为用于对目标替代图像100进行视频编码的参数,输入视频编码器102的自适应量化接口。
图10B为本公开实施例提供的另一种数据传输过程的示意图。基于图10B所示,可以通过对目标帧1001进行模糊处理,得到背景图像1002。对目标帧1001进行锐化处理,得到前景图像1003。对目标检测图像1004进行归一化处理,得到透明度通道1005。然后对背景图像1002、前景图像1003和透明度通道1005进行透明度融合,即可确定用于替换目标帧1001的目标替代图像1006。
在确定目标检测图像1004后,可以通过直方图映射的方式得到目标帧1001对应的初步量化参数1007,进而对初步量化参数1007进行降采样得到自适应量化参数1008。
在将目标替代图像1006更新至待处理视频中的目标帧位置后,将该目标替代图像1006作为视频编码器的输入帧,输入视频编码器1009。同时,还将基于目标检测图像1004确定的自适应量化参数1008作为用于对目标替代图像1006进行视频编码的参数,输入视频编码器1009的自适应量化接口。
在视频编码场景中,本公开实施例可以基于目标帧的注意力区域检测结果,确定对应的自适应量化参数,以进行自适应量化调整,提高视频编码过程的效率。
本公开实施例确定目标帧的背景图像和前景图像,并通过在注意力区域显示前景图像,非注意力区域显示背景图像的目标替代图像更新目标帧,减少了整个待处理视频帧的码率,减少了在后续编码过程中产生的编码噪音。在一些实施例中,本公开实施例通过将帧序列中各帧降采样后,进行注意力区域检测,提高了注意力区域检测过程的效率,实现了实时的注意力区域检测。
且通过本公开中的实施例,可以实时识别人眼感兴趣的区域,并将有限的码率用于保护注意力区域的质量,在视频总码率下降的情况下,还可以保持主观质量不变,从而节省网络带宽。从用户的使用角度考虑,还可以节省下载视频所需的流量,并减少视频延迟的情况,进而提升用户体验。从视频服务商的角度考虑,可以节省视频的储存空间与传输带宽,从而降低服务器成本。
可以理解,本公开提及的上述各个方法实施例,在不违背原理逻辑的情况下,均可以彼此相互结合形成结合后的实施例,限于篇幅,本公开不再赘述。本领域技术人员可以理解,在具体实施方式的上述方法中,各步骤的具体执行顺序应当以其功能和可能的内在逻辑确定。此外,本公开还提供了视频处理装置、电子设备、计算机可读存储介质、程序,上述均可用来实现本公开提供的任一种视频处理方法,相应技术方案和描述和参见方法部分的相应记载,不再赘述。
图11为本公开实施例提供的一种视频处理装置的示意图,如图11所示,所述装置包括:序列确定模块110,配置为按照时间轴顺序在待处理视频中确定目标帧序列,所述目标帧序列中包括:目标帧和距离所述目标帧预设长度范围内的至少一个参考帧;注意力区域检测模块111,配置为根据所述目标帧序列进行注意力区域检测,得到用于表征所述目标帧中注意力区域和非注意力区域的目标检测图像;图像确定模块112,配置为根据所述目标帧确定对应的背景图像和前景图像;图像融合模块113,配置为根据所述目标检测图像对所述背景图像和前景图像进行透明度融合,得到目标替代图像,所述目标替代图像的注意力区域为所述前景图像,所述目标替代图像的非注意力区域为至少部分所述背景图像;图像更新模块114,配置为通过所述目标替代图像更新所述目标帧。
在一种可能的实现方式中,所述注意力区域检测模块,包括:第一处理子模块,配置为对所述目标帧序列进行第一次图像处理,得到特征张量,所述特征张量用于表征:所述目标帧序列中目标帧和各参考帧的图像特征;检测子模块,配置为将所述特征张量输入训练得到的神经网络中进行注意力区域检测,通过对比所述目标帧和各所述参考帧确定所述目标帧中的注意力区域,输出用于表征所述目标帧中注意力区域和非注意力区域的第一检测图像,所述非注意力区域为除了注意力区域以外的区域;第二处理子模块,配置为对所述第一检测图像进行第二次图像处理,得到与所述目标帧分辨率相同的目标检测图像。
在一种可能的实现方式中,所述第一处理子模块,包括:降采样单元,配置为以预 定倍数对所述目标帧序列中各帧进行降采样;特征张量确定单元,配置为根据降采样后的各帧,确定特征张量。在一种可能的实现方式中,所述特征张量包括四维的特征张量,所述特征张量的四个维度分别为对应帧的时序、通道、高度和宽度。
在一种可能的实现方式中,所述第二处理子模块,包括:上采样单元,配置为以所述预定倍数对所述第一检测图像进行上采样,得到分辨率与所述目标帧相同的第二检测图像;池化单元,配置为以预设尺寸的窗口和步长对所述第二检测图像进行最大池化,得到目标检测图像。在一种可能的实现方式中,所述神经网络为MobileNetV3神经网络。
在一种可能的实现方式中,所述图像确定模块,包括:背景确定子模块,配置为对所述目标帧进行模糊处理,得到背景图像;前景确定子模块,配置为对所述目标帧进行锐化处理,得到前景图像。
在一种可能的实现方式中,所述图像融合模块,包括:通道确定子模块,配置为根据所述目标检测图像确定透明度通道;图像融合子模块,配置为根据所述透明度通道对所述背景图像和前景图像进行透明度融合,得到在所述注意力区域位置显示所述前景图像,在非所述注意力区域位置显示所述背景图像的目标替代图像。
在一种可能的实现方式中,所述序列确定模块,包括:队列插入子模块,配置为按照时间轴顺序将所述待处理视频中各帧依次加入预设的先入先出队列;序列确定子模块,配置为响应于所述队列中各位置均被占用,将所述队列中间位置的帧作为目标帧,其它位置的帧作为参考帧,确定目标帧序列。
在一种可能的实现方式中,所述装置还包括:队列更新模块,配置为响应于所述目标帧被更新,弹出所述队列中第一个位置中存储的帧,并将所述待处理视频中的下一帧压入所述队列。在一种可能的实现方式中,所述装置还包括:参数确定模块,配置为确定所述目标检测图像对应的自适应量化参数;数据传输模块,配置为将更新后的所述目标帧和对应的自适应量化参数输入视频编码器,基于对应的自适应量化参数对所述目标帧进行视频编码。
在一种可能的实现方式中,所述参数确定模块,包括:直方图统计子模块,配置为对所述目标检测图像进行直方图统计,得到对应的直方图映射表;第一参数确定子模块,配置为根据所述直方图映射表映射所述目标检测图像,得到对应的初步量化参数;第二参数确定子模块,配置为对所述初步量化参数进行降采样,得到自适应量化参数。
在一种可能的实现方式中,所述数据传输模块包括:数据传输子模块,配置为将更新后的所述目标帧作为输入帧输入所述视频编码器,将所述自适应量化参数输入所述视频编码器的自适应量化接口。
在一些实施例中,本公开实施例提供的装置具有的功能或包含的模块可以配置为执行上文方法实施例描述的方法,其具体实现可以参照上文方法实施例的描述,为了简洁,这里不再赘述。本公开实施例还提出一种计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指令被处理器执行时实现上述方法。计算机可读存储介质可以是易失性或非易失性计算机可读存储介质。
本公开实施例还提出一种电子设备,包括:处理器;配置为存储处理器可执行指令的存储器;其中,所述处理器被配置为调用所述存储器存储的指令,以执行上述方法的部分或全部步骤。本公开实施例还提出一种计算机程序,所述计算机程序包括计算机可读代码,在所述计算机可读代码被计算机读取并执行的情况下,实现本公开任一实施例中的方法的部分或全部步骤。本公开实施例还提供了一种计算机程序产品,包括计算机可读代码,或者承载有计算机可读代码的非易失性计算机可读存储介质,当所述计算机可读代码在电子设备的处理器中运行时,所述电子设备中的处理器执行上述方法的部分或全部步骤。
电子设备可以被提供为终端、服务器或其它形态的设备。
图12为本公开实施例提供的一种电子设备的框图。例如,电子设备1200可以是移动电话,计算机,数字广播终端,消息收发设备,游戏控制台,平板设备,医疗设备,健身设备,个人数字助理等终端。
参照图12,电子设备1200可以包括以下一个或多个组件:处理组件1202,存储器1204,电源组件1206,多媒体组件1208,音频组件1210,输入/输出(I/O)接口1212,传感器组件1214,以及通信组件1216。
处理组件1202通常控制电子设备1200的整体操作,诸如与显示、电话呼叫、数据通信、相机操作和记录操作相关联的操作。处理组件1202可以包括一个或多个处理器1220用于执行指令,以完成上述的方法的全部或部分步骤。此外,处理组件1202可以包括一个或多个模块,便于处理组件1202和其它组件之间的交互。例如,处理组件1202可以包括多媒体模块,以方便多媒体组件1208和处理组件1202之间的交互。
存储器1204被配置为存储各种类型的数据以支持在电子设备1200的操作。这些数据的示例包括用于在电子设备1200上操作的任何应用程序或方法的指令,如,联系人数据、电话簿数据、消息、图片、视频等。存储器1204可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM)、电可擦除可编程只读存储器(EEPROM)、可擦除可编程只读存储器(EPROM)、可编程只读存储器(PROM)、只读存储器(ROM)、磁存储器、快闪存储器、磁盘或光盘。电源组件1206为电子设备1200的各种组件提供电力。电源组件1206可以包括电源管理系统,一个或多个电源,及其它与为电子设备1200生成的用于管理和分配电力相关联的组件。
多媒体组件1208包括在所述电子设备1200和用户之间的提供一个输出接口的屏幕。在一些实施例中,屏幕可以包括液晶显示器(LCD)和触摸面板(TP)。如果屏幕包括触摸面板,屏幕可以被实现为触摸屏,以接收来自用户的输入信号。触摸面板包括一个或多个触摸传感器以感测触摸、滑动和触摸面板上的手势。所述触摸传感器不仅可以感测触摸或滑动动作的边界,还可以检测与触摸或滑动操作相关的持续时间和压力。在一些实施例中,多媒体组件1208包括一个前置摄像头和/或一个后置摄像头。当电子设备1200处于操作模式,如拍摄模式或视频模式时,前置摄像头和/或后置摄像头可以接收外部的多媒体数据。每个前置摄像头和每个后置摄像头可以是一个固定的光学透镜 系统或具有焦距和光学变焦能力。
音频组件1210被配置为输出和/或输入音频信号。例如,音频组件1210包括一个麦克风(MIC),当电子设备1200处于操作模式,如呼叫模式、记录模式和语音识别模式时,麦克风被配置为接收外部音频信号。所接收的音频信号可以被存储在存储器1204或经由通信组件1216发送。在一些实施例中,音频组件1210还包括一个扬声器,用于输出音频信号。I/O接口1212为处理组件1202和外围接口模块之间提供接口,上述外围接口模块可以是键盘,点击轮,按钮等。这些按钮可包括但不限于:主页按钮、音量按钮、启动按钮和锁定按钮。
传感器组件1214包括一个或多个传感器,用于为电子设备1200提供各个方面的状态评估。例如,传感器组件1214可以检测到电子设备1200的打开/关闭状态,组件的相对定位,例如所述组件为电子设备1200的显示器和小键盘,传感器组件1214还可以检测电子设备1200或电子设备1200一个组件的位置改变,还可以检测用户与电子设备1200接触的存在或不存在,还可以检测电子设备1200的方位、加速、减速或者电子设备1200的温度变化。传感器组件1214可以包括接近传感器,被配置用来在没有任何的物理接触时检测附近物体的存在。传感器组件1214还可以包括光传感器,如互补金属氧化物半导体(CMOS)或电荷耦合装置(CCD)图像传感器,用于在成像应用中使用。在一些实施例中,该传感器组件1214还可以包括加速度传感器、陀螺仪传感器、磁传感器、压力传感器或温度传感器。
通信组件1216被配置为便于电子设备1200和其它设备之间有线或无线方式的通信。电子设备1200可以接入基于通信标准的无线网络,如无线网络(WiFi),第四代移动通信技术(4G)或第五代移动通信技术(5G),或它们的组合。在一个示例性实施例中,通信组件1216经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一个示例性实施例中,所述通信组件1216还包括近场通信(NFC)模块,以促进短程通信。例如,在NFC模块可基于射频识别(RFID)技术,红外数据协会(IrDA)技术,超宽带(UWB)技术,蓝牙(BT)技术和其它技术来实现。
在示例性实施例中,电子设备1200可以被一个或多个应用专用集成电路(ASIC)、数字信号处理器(DSP)、数字信号处理设备(DSPD)、可编程逻辑器件(PLD)、现场可编程门阵列(FPGA)、控制器、微控制器、微处理器或其它电子元件实现,用于执行上述方法。在示例性实施例中,还提供了一种非易失性计算机可读存储介质,例如包括计算机程序指令的存储器1204,上述计算机程序指令可由电子设备1200的处理器1220执行以完成上述方法的部分或全部步骤。
图13为本公开实施例提供的另一种电子设备的框图。如,电子设备1300可以被提供为一服务器。参照图13,电子设备1300包括处理组件1322,其包括一个或多个处理器,以及由存储器1332所代表的存储器资源,用于存储可由处理组件1322的执行的指令,例如应用程序。存储器1332中存储的应用程序可以包括一个或一个以上的对应于一组指令的模块。此外,处理组件1322被配置为执行指令,以执行上述方法。
电子设备1300还可以包括一个电源组件1326,被配置为执行电子设备1300的电源管理,一个有线或无线网络接口1350,被配置为将电子设备1300连接到网络,和一个输入/输出(I/O)接口1358。电子设备1300可以操作基于存储在存储器1332的操作系统,例如微软服务器操作系统(Windows ServerTM),苹果公司推出的基于图形用户界面操作系统(Mac OS XTM),多用户多进程的计算机操作系统(UnixTM),自由和开放源代码的类Unix操作系统(LinuxTM),开放源代码的类Unix操作系统(FreeBSDTM)或类似。在示例性实施例中,还提供了一种非易失性计算机可读存储介质,例如包括计算机程序指令的存储器1332,上述计算机程序指令可由电子设备1300的处理组件1322执行以完成上述方法。本公开可以是系统、方法和/或计算机程序产品。计算机程序产品可以包括计算机可读存储介质,其上载有用于使处理器实现本公开的各个方面的计算机可读程序指令。
计算机可读存储介质可以是可以保持和存储由指令执行设备使用的指令的有形设备,可为易失性存储介质或者非易失性存储介质。计算机可读存储介质可以是(但不限于)电存储设备、磁存储设备、光存储设备、电磁存储设备、半导体存储设备或者上述的任意合适的组合。计算机可读存储介质的例子(非穷举的列表)包括:便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、静态随机存取存储器(SRAM)、便携式压缩盘只读存储器(CD-ROM)、数字多功能盘(DVD)、记忆棒、软盘、机械编码设备、例如其上存储有指令的打孔卡或凹槽内凸起结构、以及上述的任意合适的组合。这里所使用的计算机可读存储介质不被解释为瞬时信号本身,诸如无线电波或者其它自由传播的电磁波、通过波导或其它传输媒介传播的电磁波(例如,通过光纤电缆的光脉冲)、或者通过电线传输的电信号。
这里所描述的计算机可读程序指令可以从计算机可读存储介质下载到各个计算/处理设备,或者通过网络,例如因特网、局域网、广域网和/或无线网下载到外部计算机或外部存储设备。网络可以包括铜传输电缆、光纤传输、无线传输、路由器、防火墙、交换机、网关计算机和/或边缘服务器。每个计算/处理设备中的网络适配卡或者网络接口从网络接收计算机可读程序指令,并转发该计算机可读程序指令,以供存储在各个计算/处理设备中的计算机可读存储介质中。
用于执行本公开操作的计算机程序指令可以是汇编指令、指令集架构(ISA)指令、机器指令、机器相关指令、微代码、固件指令、状态设置数据、或者以一种或多种编程语言的任意组合编写的源代码或目标代码,所述编程语言包括面向对象的编程语言—诸如Smalltalk、C++等,以及常规的过程式编程语言—诸如“C”语言或类似的编程语言。计算机可读程序指令可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络—包括局域网(LAN)或广域网(WAN)—连接到用户计算机,或者,可以连接到 外部计算机(例如利用因特网服务提供商来通过因特网连接)。在一些实施例中,通过利用计算机可读程序指令的状态信息来个性化定制电子电路,例如可编程逻辑电路、现场可编程门阵列(FPGA)或可编程逻辑阵列(PLA),该电子电路可以执行计算机可读程序指令,从而实现本公开的各个方面。
这里参照根据本公开实施例的方法、设备、装置(系统)和计算机程序产品的流程图和/或框图描述了本公开的各个方面。应当理解,流程图和/或框图的每个方框以及流程图和/或框图中各方框的组合,都可以由计算机可读程序指令实现。
这些计算机可读程序指令可以提供给通用计算机、专用计算机或其它可编程数据处理装置的处理器,从而生产出一种机器,使得这些指令在通过计算机或其它可编程数据处理装置的处理器执行时,产生了实现流程图和/或框图中的一个或多个方框中规定的功能/动作的装置。也可以把这些计算机可读程序指令存储在计算机可读存储介质中,这些指令使得计算机、可编程数据处理装置和/或其它设备以特定方式工作,从而,存储有指令的计算机可读介质则包括一个制造品,其包括实现流程图和/或框图中的一个或多个方框中规定的功能/动作的各个方面的指令。
也可以把计算机可读程序指令加载到计算机、其它可编程数据处理装置、或其它设备上,使得在计算机、其它可编程数据处理装置或其它设备上执行一系列操作步骤,以产生计算机实现的过程,从而使得在计算机、其它可编程数据处理装置、或其它设备上执行的指令实现流程图和/或框图中的一个或多个方框中规定的功能/动作。
附图中的流程图和框图显示了根据本公开的多个实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段或指令的一部分,所述模块、程序段或指令的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个连续的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或动作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
该计算机程序产品可以具体通过硬件、软件或其结合的方式实现。在一个可选实施例中,所述计算机程序产品具体体现为计算机存储介质,在另一个可选实施例中,计算机程序产品具体体现为软件产品,例如软件开发包(Software Development Kit,SDK)等等。以上已经描述了本公开的各实施例,上述说明是示例性的,并非穷尽性的,并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择,旨在最好地解释各实施例的原理、实际应用或对市场中的技术的改进,或者使本技术领域的其它普通技术人员能理解本文披露的各实施例。

Claims (30)

  1. 一种视频处理方法,所述方法包括:
    按照时间轴顺序在待处理视频中确定目标帧序列,所述目标帧序列中包括:目标帧和距离所述目标帧预设长度范围内的至少一个参考帧;
    根据所述目标帧序列进行注意力区域检测,得到用于表征所述目标帧中注意力区域和非注意力区域的目标检测图像;
    根据所述目标帧确定对应的背景图像和前景图像;
    根据所述目标检测图像对所述背景图像和所述前景图像进行透明度融合,得到目标替代图像,所述目标替代图像的注意力区域为所述前景图像,所述目标替代图像的非注意力区域为至少部分所述背景图像;
    通过所述目标替代图像更新所述目标帧。
  2. 根据权利要求1所述的方法,其中,所述根据所述目标帧序列进行注意力区域检测,得到用于表征所述目标帧中注意力区域和非注意力区域的目标检测图像,包括:
    对所述目标帧序列进行第一次图像处理,得到特征张量,所述特征张量用于表征:所述目标帧序列中所述目标帧和各所述参考帧的图像特征;
    将所述特征张量输入训练得到的神经网络中进行注意力区域检测,通过对比所述目标帧和各所述参考帧,确定所述目标帧中的注意力区域,并输出用于表征所述目标帧中注意力区域和非注意力区域的第一检测图像,非注意力区域为除了注意力区域以外的区域;
    对所述第一检测图像进行第二次图像处理,得到与所述目标帧分辨率相同的目标检测图像。
  3. 根据权利要求2所述的方法,其中,所述对所述目标帧序列进行第一次图像处理,得到特征张量,包括:
    以预定倍数对所述目标帧序列中各帧进行降采样;
    根据降采样后的各帧,确定所述特征张量。
  4. 根据权利要求2或3所述的方法,其中,所述特征张量包括四维的特征张量,所述特征张量的四个维度分别为对应帧的时序、通道、高度和宽度。
  5. 根据权利要求2-4中任意一项所述的方法,其中,所述对所述第一检测图像进行第二次图像处理,得到与所述目标帧分辨率相同的目标检测图像,包括:
    以预定倍数对所述第一检测图像进行上采样,得到分辨率与所述目标帧相同的第二检测图像;
    以预设尺寸的窗口和步长对所述第二检测图像进行最大池化,得到所述目标检测图像。
  6. 根据权利要求2-5中任意一项所述的方法,其中,所述神经网络为MobileNetV3 神经网络。
  7. 根据权利要求1-6中任意一项所述的方法,其中,所述根据所述目标帧确定对应的背景图像和前景图像,包括:
    对所述目标帧进行模糊处理,得到所述背景图像;
    对所述目标帧进行锐化处理,得到所述前景图像。
  8. 根据权利要求1-7中任意一项所述的方法,其中,所述根据所述目标检测图像对所述背景图像和所述前景图像进行透明度融合,得到目标替代图像,包括:
    根据所述目标检测图像确定透明度通道;
    根据所述透明度通道对所述背景图像和所述前景图像进行透明度融合,得到在注意力区域位置显示所述前景图像,在非所述注意力区域位置显示所述背景图像的目标替代图像。
  9. 根据权利要求1-8中任意一项所述的方法,其中,所述按照时间轴顺序在待处理视频中确定目标帧序列,包括:
    按照时间轴顺序将所述待处理视频中各帧依次加入预设的先入先出队列;
    响应于所述队列中各位置均被占用,将所述队列中间位置的帧作为所述目标帧序列的所述目标帧,其它位置的帧作为所述目标帧序列的所述参考帧,确定所述目标帧序列。
  10. 根据权利要求9所述的方法,其中,所述方法还包括:
    响应于所述目标帧被更新,弹出所述队列中第一个位置中存储的帧,并将所述待处理视频中的下一帧压入所述队列。
  11. 根据权利要求1-10中任意一项所述的方法,其中,所述方法还包括:
    确定所述目标检测图像对应的自适应量化参数;
    将更新后的所述目标帧和对应的自适应量化参数输入视频编码器,基于对应的自适应量化参数对更新后的所述目标帧进行视频编码。
  12. 根据权利要求11所述的方法,其中,所述确定所述目标检测图像对应的自适应量化参数,包括:
    对所述目标检测图像进行直方图统计,得到对应的直方图映射表;
    根据所述直方图映射表映射所述目标检测图像,得到对应的初步量化参数;
    对所述初步量化参数进行降采样,得到所述自适应量化参数。
  13. 根据权利要求11或12所述的方法,其中,所述将更新后的所述目标帧和对应的自适应量化参数输入视频编码器,包括:
    将更新后的所述目标帧作为输入帧输入所述视频编码器,将所述自适应量化参数输入所述视频编码器的自适应量化接口。
  14. 一种视频处理装置,所述装置包括:
    序列确定模块,配置为按照时间轴顺序在待处理视频中确定目标帧序列,所述目标帧序列中包括:目标帧和距离所述目标帧预设长度范围内的至少一个参考帧;
    注意力区域检测模块,配置为根据所述目标帧序列进行注意力区域检测,得到用于表征所述目标帧中注意力区域和非注意力区域的目标检测图像;
    图像确定模块,配置为根据所述目标帧确定对应的背景图像和前景图像;
    图像融合模块,配置为根据所述目标检测图像对所述背景图像和所述前景图像进行透明度融合,得到目标替代图像,所述目标替代图像的注意力区域为所述前景图像,所述目标替代图像的非注意力区域为至少部分所述背景图像;
    图像更新模块,配置为通过所述目标替代图像更新所述目标帧。
  15. 根据权利要求14所述的装置,其中,所述注意力区域检测模块,包括:
    第一处理子模块,配置为对所述目标帧序列进行第一次图像处理,得到特征张量,所述特征张量用于表征:所述目标帧序列中所述目标帧和各所述参考帧的图像特征;
    检测子模块,配置为将所述特征张量输入训练得到的神经网络中进行注意力区域检测,通过对比所述目标帧和各所述参考帧,确定所述目标帧中的注意力区域,并输出用于表征所述目标帧中注意力区域和非注意力区域的第一检测图像,非注意力区域为除了注意力区域以外的区域;
    第二处理子模块,配置为对所述第一检测图像进行第二次图像处理,得到与所述目标帧分辨率相同的目标检测图像。
  16. 根据权利要求15所述的装置,其中,所述第一处理子模块,包括:
    降采样单元,配置为以预定倍数对所述目标帧序列中各帧进行降采样;
    特征张量确定单元,配置为根据降采样后的各所述帧,确定所述特征张量。
  17. 根据权利要求15或16所述的装置,其中,所述特征张量包括四维的特征张量,所述特征张量的四个维度分别为对应帧的时序、通道、高度和宽度。
  18. 根据权利要求15-17中任意一项所述的装置,其中,所述第二处理子模块,包括:
    上采样单元,配置为以预定倍数对所述第一检测图像进行上采样,得到分辨率与所述目标帧相同的第二检测图像;
    池化单元,配置为以预设尺寸的窗口和步长对所述第二检测图像进行最大池化,得到所述目标检测图像。
  19. 根据权利要求15-18中任一项所述的装置,其中,所述神经网络为MobileNetV3神经网络。
  20. 根据权利要求14至19任一项所述的装置,其中,所述图像确定模块,包括:
    背景确定子模块,配置为对所述目标帧进行模糊处理,得到所述背景图像;
    前景确定子模块,配置为对所述目标帧进行锐化处理,得到所述前景图像。
  21. 根据权利要求14至20任一项所述的装置,其中,所述图像融合模块包括:
    通道确定子模块,配置为根据所述目标检测图像确定透明度通道;
    图像融合子模块,配置为根据所述透明度通道对所述背景图像和所述前景图像进行透明度融合,得到在注意力区域位置显示所述前景图像,在非所述注意力区域位置显示 所述背景图像的目标替代图像。
  22. 根据权利要求14至21任一项所述的装置,其中,所述序列确定模块,包括:
    队列插入子模块,配置为按照时间轴顺序将所述待处理视频中各帧依次加入预设的先入先出队列;
    序列确定子模块,配置为响应于所述队列中各位置均被占用,将所述队列中间位置的帧作为所述目标帧序列的所述目标帧,其它位置的帧作为所述目标帧序列所述参考帧,确定所述目标帧序列。
  23. 根据权利要求22所述的装置,其中,所述装置还包括:
    队列更新模块,配置为响应于所述目标帧被更新,弹出所述队列中第一个位置中存储的帧,并将所述待处理视频中的下一帧压入所述队列。
  24. 根据权利要求14至23任一项所述的装置,其中,所述装置还包括:
    参数确定模块,配置为确定所述目标检测图像对应的自适应量化参数;
    数据传输模块,配置为将更新后的所述目标帧和对应的自适应量化参数输入视频编码器,基于对应的自适应量化参数对所述目标帧进行视频编码。
  25. 根据权利要求24所述的装置,其中,所述参数确定模块,包括:
    直方图统计子模块,配置为对所述目标检测图像进行直方图统计,得到对应的直方图映射表;
    第一参数确定子模块,配置为根据所述直方图映射表映射所述目标检测图像,得到对应的初步量化参数;
    第二参数确定子模块,配置为对所述初步量化参数进行降采样,得到所述自适应量化参数。
  26. 根据权利要求24或25所述的装置,其中,所述数据传输模块,包括:
    数据传输子模块,配置为将更新后的所述目标帧作为输入帧输入所述视频编码器,将所述自适应量化参数输入所述视频编码器的自适应量化接口。
  27. 一种电子设备,包括:
    处理器;
    用于存储处理器可执行指令的存储器;
    其中,所述处理器被配置为调用所述存储器存储的指令,以执行权利要求1至13中任意一项所述的方法。
  28. 一种计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指令被处理器执行时实现权利要求1至13中任意一项所述的方法。
  29. 一种计算机程序,包括计算机可读代码,在计算机可读代码在设备上运行的情况下,设备中的处理器执行用于实现权利要求1至13中任一所述的方法。
  30. 一种计算机程序产品,配置为存储计算机可读指令,所述计算机可读指令被执行时使得计算机执行权利要求1至13中任一所述的方法。
PCT/CN2022/070177 2021-08-20 2022-01-04 视频处理方法及装置、电子设备、存储介质、计算机程序、计算机程序产品 WO2023019870A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110963126.9 2021-08-20
CN202110963126.9A CN113660531B (zh) 2021-08-20 2021-08-20 视频处理方法及装置、电子设备和存储介质

Publications (1)

Publication Number Publication Date
WO2023019870A1 true WO2023019870A1 (zh) 2023-02-23

Family

ID=78491865

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/070177 WO2023019870A1 (zh) 2021-08-20 2022-01-04 视频处理方法及装置、电子设备、存储介质、计算机程序、计算机程序产品

Country Status (2)

Country Link
CN (1) CN113660531B (zh)
WO (1) WO2023019870A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113660531B (zh) * 2021-08-20 2024-05-17 北京市商汤科技开发有限公司 视频处理方法及装置、电子设备和存储介质
CN115984944A (zh) * 2023-01-20 2023-04-18 北京字跳网络技术有限公司 表情信息识别方法、装置、设备、可读存储介质及产品

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120327172A1 (en) * 2011-06-22 2012-12-27 Microsoft Corporation Modifying video regions using mobile device input
CN104715451A (zh) * 2015-03-11 2015-06-17 西安交通大学 一种基于颜色及透明度一致优化的图像无缝融合方法
US20170244908A1 (en) * 2016-02-22 2017-08-24 GenMe Inc. Video background replacement system
CN107369145A (zh) * 2017-06-16 2017-11-21 广东欧珀移动通信有限公司 图像处理方法、装置及终端设备
CN113068034A (zh) * 2021-03-25 2021-07-02 Oppo广东移动通信有限公司 视频编码方法及装置、编码器、设备、存储介质
CN113255685A (zh) * 2021-07-13 2021-08-13 腾讯科技(深圳)有限公司 一种图像处理方法、装置、计算机设备以及存储介质
CN113660531A (zh) * 2021-08-20 2021-11-16 北京市商汤科技开发有限公司 视频处理方法及装置、电子设备和存储介质

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9142010B2 (en) * 2012-01-04 2015-09-22 Audience, Inc. Image enhancement based on combining images from multiple cameras

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120327172A1 (en) * 2011-06-22 2012-12-27 Microsoft Corporation Modifying video regions using mobile device input
CN104715451A (zh) * 2015-03-11 2015-06-17 西安交通大学 一种基于颜色及透明度一致优化的图像无缝融合方法
US20170244908A1 (en) * 2016-02-22 2017-08-24 GenMe Inc. Video background replacement system
CN107369145A (zh) * 2017-06-16 2017-11-21 广东欧珀移动通信有限公司 图像处理方法、装置及终端设备
CN113068034A (zh) * 2021-03-25 2021-07-02 Oppo广东移动通信有限公司 视频编码方法及装置、编码器、设备、存储介质
CN113255685A (zh) * 2021-07-13 2021-08-13 腾讯科技(深圳)有限公司 一种图像处理方法、装置、计算机设备以及存储介质
CN113660531A (zh) * 2021-08-20 2021-11-16 北京市商汤科技开发有限公司 视频处理方法及装置、电子设备和存储介质

Also Published As

Publication number Publication date
CN113660531A (zh) 2021-11-16
CN113660531B (zh) 2024-05-17

Similar Documents

Publication Publication Date Title
JP7262659B2 (ja) 目標対象物マッチング方法及び装置、電子機器並びに記憶媒体
CN109118430B (zh) 超分辨率图像重建方法及装置、电子设备及存储介质
CN113766313B (zh) 视频数据处理方法及装置、电子设备和存储介质
CN110060215B (zh) 图像处理方法及装置、电子设备和存储介质
TWI706379B (zh) 圖像處理方法及裝置、電子設備和儲存介質
CN109087238B (zh) 图像处理方法和装置、电子设备以及计算机可读存储介质
WO2023019870A1 (zh) 视频处理方法及装置、电子设备、存储介质、计算机程序、计算机程序产品
WO2020042826A1 (zh) 视频流降噪方法和装置、电子设备及存储介质
CN111445414B (zh) 图像处理方法及装置、电子设备和存储介质
WO2016192325A1 (zh) 视频文件的标识处理方法及装置
CN109784164B (zh) 前景识别方法、装置、电子设备及存储介质
WO2022227394A1 (zh) 图像处理方法、装置、设备、存储介质及程序
WO2023071167A1 (zh) 图像处理方法及装置、电子设备、存储介质和程序产品
CN112634160A (zh) 拍照方法及装置、终端、存储介质
CN111369482B (zh) 图像处理方法及装置、电子设备和存储介质
CN109840890B (zh) 图像处理方法及装置、电子设备和存储介质
CN110874809A (zh) 图像处理方法及装置、电子设备和存储介质
US20220188982A1 (en) Image reconstruction method and device, electronic device, and storage medium
CN110458771B (zh) 图像处理方法及装置、电子设备和存储介质
CN113706421B (zh) 一种图像处理方法及装置、电子设备和存储介质
CN111583142A (zh) 图像降噪方法及装置、电子设备和存储介质
KR20210053121A (ko) 이미지 처리 모델의 훈련 방법, 장치 및 매체
WO2022141969A1 (zh) 图像分割方法及装置、电子设备、存储介质和程序
CN109816620B (zh) 图像处理方法及装置、电子设备和存储介质
CN113177890A (zh) 图像处理方法及装置、电子设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22857194

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE