WO2021217829A1 - 一种对视频进行转码的方法和装置 - Google Patents

一种对视频进行转码的方法和装置 Download PDF

Info

Publication number
WO2021217829A1
WO2021217829A1 PCT/CN2020/097172 CN2020097172W WO2021217829A1 WO 2021217829 A1 WO2021217829 A1 WO 2021217829A1 CN 2020097172 W CN2020097172 W CN 2020097172W WO 2021217829 A1 WO2021217829 A1 WO 2021217829A1
Authority
WO
WIPO (PCT)
Prior art keywords
transcoding
video
frame image
target
target frame
Prior art date
Application number
PCT/CN2020/097172
Other languages
English (en)
French (fr)
Inventor
刘安捷
Original Assignee
网宿科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 网宿科技股份有限公司 filed Critical 网宿科技股份有限公司
Priority to EP20792515.7A priority Critical patent/EP3923585A1/en
Priority to US17/088,442 priority patent/US11166035B1/en
Publication of WO2021217829A1 publication Critical patent/WO2021217829A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs
    • H04N21/2343Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/40Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using video transcoding, i.e. partial or full decoding of a coded input stream followed by re-encoding of the decoded output stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/134Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
    • H04N19/167Position within a video image, e.g. region of interest [ROI]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/70Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by syntax aspects related to video coding, e.g. related to compression standards
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs
    • H04N21/23418Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs
    • H04N21/2343Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
    • H04N21/234345Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements the reformatting operation being performed only on part of the stream, e.g. a region of the image or a time segment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • This application relates to the technical field of video processing, and in particular to a method and device for transcoding video.
  • the background server of the video service platform can first obtain the packaged initial video (which can be called the input video), and then generate multiple transcoding tasks according to different transcoding rules, and then create a corresponding transcoding process for each transcoding task , And then the steps of parsing, transcoding, and encapsulating the video data of the initial video can be realized through the transcoding thread in the transcoding process.
  • the background server may push the generated transcoded video data to the user terminal.
  • embodiments of the present application provide a method and device for transcoding a video.
  • the technical solution is as follows.
  • An embodiment of the present application provides a method for transcoding a video, the method includes:
  • the multi-level ROI area and other areas of the target frame image are respectively transcoded using different transcoding rates.
  • An embodiment of the present application provides a video transcoding device, the device includes:
  • the semantic segmentation module is used to obtain the target frame image of the video to be transcoded, the feature extraction module based on the semantic segmentation model generates the global feature map of the target frame image, and the feature segmentation module based on the semantic segmentation model analyzes the global feature Perform feature segmentation of the map to determine the multi-level ROI area of the target frame image;
  • the video transcoding module is configured to use different transcoding rates to perform transcoding processing on the multi-level ROI region and other regions of the target frame image respectively.
  • An embodiment of the present application provides a background server, the background server includes a processor and a memory, the memory stores at least one instruction, at least one program, code set or instruction set, the at least one instruction, the at least A section of program, the code set or the instruction set is loaded and executed by the processor to implement the method for transcoding the video as described above.
  • An embodiment of the present application provides a computer-readable storage medium that stores at least one instruction, at least one program, code set, or instruction set, the at least one instruction, the at least one program, and the code
  • the set or instruction set is loaded and executed by the processor to implement the method for transcoding the video as described above.
  • the background server obtains the target frame image of the video to be transcoded, and generates a global feature map of the target frame image based on the feature extraction module of the semantic segmentation model;
  • the feature segmentation module based on the semantic segmentation model performs feature segmentation on the global feature map to determine the multi-level ROI area of the target frame image; use different transcoding rates to transcode the multi-level ROI area and other areas of the target frame image separately .
  • the backend server uses the semantic segmentation model to segment the multi-level ROI area in the video frame image, so that the range of the ROI area can be flexibly and accurately determined according to different scenes, and the effectiveness of the ROI area is guaranteed to the maximum extent.
  • the characteristic area and then use different transcoding rates to transcode each ROI area, so that all levels of content in the video frame image can reasonably occupy the bit rate share, which can not only ensure the quality of the video, but also reduce the video transmission time. Bandwidth resource consumption.
  • Fig. 1 is a flowchart of a method for transcoding a video provided by an embodiment of the present application
  • FIG. 2 is a schematic flowchart of a semantic segmentation provided by an embodiment of the present application
  • FIG. 3 is a schematic structural diagram of a pyramid pooling module of a conventional PSPNet model provided by an embodiment of the present application;
  • FIG. 4 is a schematic structural diagram of a pyramid pooling module of a reconstructed PSPNet model provided by an embodiment of the present application;
  • FIG. 5 is a schematic structural diagram of another pyramid pooling module of a reconstructed PSPNet model provided by an embodiment of the present application.
  • Fig. 6 is an output diagram of semantic segmentation of a video frame image provided by an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of an apparatus for transcoding a video provided by an embodiment of the present application.
  • Fig. 8 is a schematic structural diagram of a backend server provided by an embodiment of the present application.
  • the embodiment of the present application provides a method for transcoding a video
  • the execution subject of the method may be a background server of a video service platform.
  • the background server may have a video transcoding function.
  • the background server may transcode the video frame images, and then provide the transcoded video data to the outside world.
  • the back-end server can also have image analysis functions, and can adjust the video transcoding processing based on the image analysis results.
  • the video service platform can be equipped with multiple back-end servers, and each back-end server can be used to perform transcoding tasks for multiple videos, and each video can correspond to multiple transcoding tasks according to the transcoding requirements.
  • the above-mentioned background server may include a processor, a memory, and a transceiver.
  • the processor may be used for transcoding the video in the following process.
  • the memory may be used to store the data required and generated during the following processing, and send and receive data.
  • the device can be used to receive and send related data in the following processing.
  • the method for transcoding the video disclosed in this embodiment can be applied to live video, and can also be applied to video on demand.
  • Step 101 Obtain a target frame image of the video to be transcoded, and generate a global feature map of the target frame image based on the feature extraction module of the semantic segmentation model.
  • the background server can determine whether the video is a video to be transcoded (that is, whether there is a transcoding requirement) according to the video transcoding rules preset by the video service platform. If so, the background server can sequentially transcode the frame images of the video to be transcoded. Specifically, the background server may first obtain the target frame image of the video to be transcoded, and then input the target frame image into the feature extraction module of the semantic segmentation model to generate a global feature map of the target frame image. Among them, the target frame image can be any frame image of the video to be transcoded, any key frame image of the video to be transcoded, or any frame image within a specified time period of the video to be transcoded. It can depend on transcoding requirements.
  • Step 102 Perform feature segmentation on the global feature map based on the feature segmentation module of the semantic segmentation model, and determine the multi-level ROI region of the target frame image.
  • the backend server after the backend server generates the global feature map of the target frame image, it can input the global feature map into the feature segmentation module of the semantic segmentation model to perform feature segmentation processing, and then obtain the multi-level ROI area of the target frame image according to the segmentation results .
  • the semantic segmentation model in this embodiment can use the PSPNet model, and Figure 2 shows the internal structure of the PSPNet model.
  • the target frame image first passes through the feature extraction module (the ResNet50 model is selected in this embodiment), and the feature map (ie, the global feature map) is initially obtained, and then the feature map is pooled in multiple sizes, and the pooling cores are respectively selected as 1*1 and 2*2 , 4*4, 8*8, get the intermediate map of four sizes, then go through the bilinear interpolation or deconvolution operation to unify the size of all the intermediate maps, and then go through the concat operation (that is, the splicing operation, through the concatenate function to combine several).
  • the feature map is stitched in the dimension of the channel) to combine the features of multiple intermediate maps, and finally, after performing convolution and upsampling, the pixel-level segmentation task can be completed for the target frame image.
  • the background server can select a partial image area obtained by the segmentation as the multi-level ROI area of the target frame image.
  • the PSPNet model can use the TensorRT engine to implement accelerated operations to increase the video transcoding rate.
  • the adaptive pooling layer in the conventional PSPNet model a set of pooling layers of different sizes can be used to replace the adaptive pooling layer to achieve a similar adaptive pooling effect, and TensorRT acceleration can be achieved.
  • the transposed convolutional layer can be used to replace the bilinear upsampling layer to achieve an effect similar to adaptive pooling, and TensorRT acceleration can be achieved.
  • the following takes the feature map output of the feature extraction module as 2048x64x64 (that is, the number of channels is 2048, the image size is 64x64) as an example, and the pyramid pooling module of the conventional PSPNet model and the reconstructed PSPNet model are respectively given.
  • the pyramid pooling module both use four-scale feature fusion processing as an example to illustrate, the PSPNet model can also support any other multiple-scale feature fusion) processing process.
  • the pyramid pooling module of the conventional PSPNet model can be referred to as shown in Figure 3:
  • the pyramid pooling module of the reconstructed PSPNet model can be referred to as shown in Figure 4:
  • kernel size 64, 32, 16 and 8 and stride is equal to kernel size
  • multi-scale sharing of the same transposed convolutional layer can be set to reduce the number of parameters and the amount of calculation.
  • the corresponding processing can be as follows: replace the multi-scale bilinear upsampling layer in the conventional PSPNet model with the nearest Up-sampling layer and shared transposed convolutional layer in adjacent pools.
  • the backend server uses the transposed convolutional layer to replace the bilinear upsampling layer in the conventional PSPNet model
  • the nearest neighbor pool upsampling layer and shared transposition can be used
  • the convolutional layer is used to replace, that is, the multi-scale feature map is upsampled to the same size with the parameter-free nearest neighbor upsampling layer, and then the same shared transposed convolutional layer is used to uniformly expand the feature map to the specified size .
  • the processing process of the pyramid pooling module of the reconstructed PSPNet model can be as follows:
  • kernel size 64, 32, 16 and 8 and stride is equal to kernel size
  • the semantic segmentation model can also use other models such as U-Net model, LinkNet model, PSANet model, HRNet model, OCNet model, DeepLabv3 model and DeepLabv3+ model, and the model can be adapted before selection.
  • different levels of ROI regions can correspond to different components of a real object.
  • the processing of step 102 can be as follows: the feature segmentation module based on the semantic segmentation model performs feature segmentation on the global feature map to determine the target frame image Multiple components of at least one target object in at least one of the components; determine the multi-level ROI area of the target frame image based on the transcoding priority corresponding to each component.
  • the feature segmentation of the target frame image can be realized, so that multiple components of at least one target object in the target frame image can be obtained.
  • the segmentation law of the target object and its components can be generated by the semantic segmentation model based on a large number of image material training, that is to say, through model training, you can define which objects need to be segmented from the video frame image, and the actual object Which components need to be separated into one piece independently.
  • Figure 6 exemplarily shows the result of semantic segmentation, where the entire image contains two large pieces of background and human body, and the human body is further subdivided into five parts: head, upper body, lower body, arms and legs.
  • the background server can determine the multi-level ROI area of the target frame image based on the transcoding priority corresponding to each component, where the higher the transcoding priority corresponding to the ROI area, the higher the transcoding rate; other areas Use the lowest transcoding rate for transcoding.
  • the transcoding priority of different components of different target objects can be preset in the backend server.
  • the human head has the highest transcoding priority, followed by the upper body, lower body, limbs and other parts of the transcoding priority. Decrease sequentially.
  • the transcoding priority of the human head and the car logo in an image can be the same at the highest level.
  • each level of ROI in the target frame image can contain multiple components of multiple objects.
  • the background server may regularly train the semantic segmentation model based on image materials in a preset training material set, where the image materials are marked with contour lines of various components of different physical objects.
  • the backend server can regularly optimize the training of the semantic segmentation model to improve the accuracy of the semantic segmentation model for image semantic segmentation. Therefore, the technical personnel of the video service platform can select various types of frame images of various types of videos, and use open source labeling tools (such as labelme) to mark the contours of the various components of different physical objects in the frame images, so that they can be labeled.
  • the frame images of are stored in the preset training material set as the image material for training the semantic segmentation model. Among them, when marking the contour line, the technician can selectively mark part of the physical objects in the image as required, and can divide the physical objects arbitrarily as required. In this way, the background server can regularly train the semantic segmentation model based on the image material in the preset training material set.
  • different semantic segmentation models can be selected for semantic segmentation.
  • the background server can train dedicated semantic segmentation models for frame images in videos of different video types, and one semantic segmentation model can be dedicated to semantic segmentation of frame images in videos of one video type.
  • one semantic segmentation model can be dedicated to semantic segmentation of frame images in videos of one video type.
  • the background server may first detect the target video type to which the video to be transcoded belongs, and then call the target semantic segmentation model corresponding to the target video type. It is not difficult to think that when the back-end server trains the semantic segmentation model, it can set training material sets corresponding to multiple video types.
  • a training material set can contain image materials of the same video type, so that the back-end server can use each training material set. Train a semantic segmentation model corresponding to the corresponding video type.
  • the background server can also train dedicated semantic segmentation models for frame images of different image types, and one semantic segmentation model can be dedicated to semantic segmentation of frame images of one image type.
  • one semantic segmentation model can be dedicated to semantic segmentation of frame images of one image type.
  • the background server can first detect the target image type to which the target frame image belongs, and then call the target semantic segmentation model corresponding to the target image type.
  • the back-end server trains the semantic segmentation model, it can set training material sets corresponding to multiple image types.
  • a training material set can contain image materials of the same image type, so that the back-end server can use each training material set to train The semantic segmentation model corresponding to the corresponding image type.
  • a dedicated semantic segmentation model can also be trained.
  • a semantic segmentation model can be dedicated to frame images of one image type under one video type.
  • the image is segmented semantically.
  • the background server obtains the target frame image, it can first detect the target video type to which the video to be transcoded belongs and the target image type to which the target frame image belongs, and then call the target semantic segmentation model corresponding to the target video type and the target image type .
  • the back-end server trains the semantic segmentation model, it can set training material sets corresponding to multiple image types under multiple video types.
  • a training material set can contain image materials of the same image type under the same video type, so that the back-end server can Using each training material set, a semantic segmentation model corresponding to the image type under the corresponding video type is trained.
  • Step 103 using different transcoding rates to perform transcoding processing on the multi-level ROI region and other regions of the target frame image respectively.
  • the background server after the background server determines the multi-level ROI area of the target frame image, it can use different transcoding rates for transcoding processing for all levels of ROI area and other areas in the target frame image except the ROI area. .
  • the level of the ROI area can be determined according to the image content in the area.
  • the target frame image is a full-length image of a person, and the head can be set as the highest ROI area, the upper body is the second highest ROI area, the lower body is the intermediate ROI area, and the limbs It is the low-level ROI area, and the background is the lowest ROI area.
  • Different transcoding rates can be selected for different levels of ROI regions, and the same transcoding rate can also be selected for adjacent parts of ROI regions.
  • the transcoding area can be used to set the transcoding rate of each region of the image.
  • the processing of step 103 can be as follows: according to the transcoding rate from high to low, select the target level conversion in turn Code rate: According to the transcoding area corresponding to the target-level transcoding rate, select the transcoding area corresponding to the target-level transcoding rate from the unselected ROI area in the order of transcoding priority from high to low .
  • the target-level transcoding rate can be any level of transcoding rate.
  • the background server can support multi-level transcoding rate for video frame image transcoding.
  • you can first follow the transcoding rate from high to low. , Select the target-level transcoding rate in turn.
  • the background server can obtain the transcoding area corresponding to the preset target-level transcoding rate, and at the same time, select the first ROI region from all the unselected ROI regions in the order of transcoding priority from high to low. Then compare the area of the first ROI area with the size of the transcoding area. If the area of the first ROI area is smaller than the transcoding area, the second ROI area with the next highest transcoding priority is selected, and the sum of the area of the first ROI area and the second ROI area is compared with the size of the transcoding area.
  • the background server may use all the selected ROI regions as transcoding regions corresponding to the target-level transcoding rate.
  • the selected transcoding area will be larger than the preset transcoding area, which may cause load pressure on the device performance and line bandwidth, so it can be transcoded at all levels of transcoding rates.
  • the ROI area with the lowest priority select the excess area area, and then divide the excess area area into the transcoding area corresponding to the next-level transcoding rate. For example, when selecting an excess area area, the edge area of the ROI area can be selected.
  • the background server may periodically adjust the transcoding area corresponding to the transcoding rate of each level according to the current device performance load and line bandwidth load, as well as the attribute information of the video to be transcoded.
  • the background server can periodically detect the current device performance load and line bandwidth load, and adjust the transcoding processing of each video to be transcoded on the current device according to the detection result.
  • the background server may first determine one or more videos to be transcoded for which transcoding processing needs to be adjusted according to the attribute information of all videos to be transcoded. After that, the background server can adjust the transcoding area corresponding to the transcoding rate of each level for each video to be transcoded according to the attribute information of the video to be transcoded and the aforementioned detection result.
  • the back-end server can also select the video to be transcoded that needs to be adjusted based on multi-dimensional attribute information such as the attribution party, release time, video type, and video duration.
  • the components of different types of physical objects have different transcoding priorities.
  • there may be the following processing According to the video type of the video to be transcoded and the object type of each target object, adjust the transcoding priority corresponding to the component of each target object.
  • the backend server performs semantic segmentation on the target frame image, and after obtaining multiple components of the target object in the target frame image, it can obtain the video type of the video to be transcoded, and the object type of each target object, and then based on two Types of information, adjust the transcoding priority corresponding to each component of the target object. It is understandable that in different types of videos, the key content of the screen display is different.
  • the human body’s torso and limbs are more focused than the human head, so the human body’s torso and limbs can be transcoded
  • the priority is increased, the transcoding priority of the head is lowered; and in talk show videos, the transcoding priority of the human head can be higher than the transcoding priority of the human body’s torso and limbs; another example is tourism.
  • the transcoding priority of the human body can be lower than the transcoding priority of the scene.
  • the background server may also adjust the transcoding priority corresponding to the component parts of each physical object based on the image type of each frame image and the physical object type of the physical object in the frame image.
  • the transcoding rate of the designated area in the video frame image can be adjusted according to actual needs.
  • the processing of step 103 can be as follows: receiving the clear/blurred information of the video to be transcoded, and according to the clear/clear characteristics.
  • the fuzzy information determines the target characteristic area in the target frame image; the multi-level ROI area, the target characteristic area and other areas of the target frame image are respectively transcoded with different transcoding rates.
  • the video provider or the technical personnel of the video service platform can set the feature clarity/blur information of the video to be transcoded at the back-end server to improve or reduce the clarity of the transcoding of one or some feature content in the video frame image Spend.
  • the video server can receive the feature clear/blur information of the video to be transcoded, and then determine the target feature area in the target frame image according to the feature clear/blurred information, and the target feature area is Contains the characteristic content pointed to by the characteristic clear/fuzzy information.
  • the background server can use different transcoding rates to perform transcoding processing on the multi-level ROI area, target feature area and other areas of the target frame image respectively.
  • the feature clear/fuzzy information may directly include the specific value of the transcoding rate of the target characteristic region, or may also include the adjustment range of the transcoding rate of the target characteristic region.
  • the background server obtains the target frame image of the video to be transcoded, and generates the global feature map of the target frame image based on the feature extraction module of the semantic segmentation model; the feature segmentation module based on the semantic segmentation model performs feature segmentation on the global feature map , Determine the multi-level ROI area of the target frame image; use different transcoding rates to perform transcoding processing on the multi-level ROI area and other areas of the target frame image respectively.
  • the backend server uses the semantic segmentation model to segment the multi-level ROI area in the video frame image, so that the range of the ROI area can be flexibly and accurately determined according to different scenes, and the effectiveness of the ROI area is guaranteed to the maximum extent.
  • the characteristic area, and then use different transcoding rates to transcode each ROI area so that all levels of content in the video frame image can reasonably occupy the bit rate share, which can not only ensure the quality of the video, but also reduce the video transmission time. Bandwidth resource consumption.
  • an embodiment of the present application also provides a device for transcoding a video. As shown in FIG. 7, the device includes:
  • the semantic segmentation module 701 is used to obtain the target frame image of the video to be transcoded, the feature extraction module based on the semantic segmentation model generates the global feature map of the target frame image, and the feature segmentation module based on the semantic segmentation model analyzes the global Perform feature segmentation on the feature map, and determine the multi-level ROI area of the target frame image;
  • the video transcoding module 702 is configured to use different transcoding rates to perform transcoding processing on the multi-level ROI region and other regions of the target frame image respectively.
  • the semantic segmentation module 701 is specifically configured to:
  • a multi-level ROI area of the target frame image is determined.
  • the video transcoding module 702 is specifically configured to:
  • the transcoding area corresponding to the target-level transcoding rate select the transcoding area corresponding to the target-level transcoding rate from the unselected ROI area in the order of transcoding priority from high to low .
  • the background server obtains the target frame image of the video to be transcoded, and generates the global feature map of the target frame image based on the feature extraction module of the semantic segmentation model; the feature segmentation module based on the semantic segmentation model performs feature segmentation on the global feature map , Determine the multi-level ROI area of the target frame image; use different transcoding rates to perform transcoding processing on the multi-level ROI area and other areas of the target frame image respectively.
  • the backend server uses the semantic segmentation model to segment the multi-level ROI area in the video frame image, so that the range of the ROI area can be flexibly and accurately determined according to different scenes, and the effectiveness of the ROI area is guaranteed to the maximum extent.
  • the characteristic area, and then use different transcoding rates to transcode each ROI area so that all levels of content in the video frame image can reasonably occupy the bit rate share, which can not only ensure the quality of the video, but also reduce the video transmission time. Bandwidth resource consumption.
  • Fig. 8 is a schematic structural diagram of a background server provided by an embodiment of the present application.
  • the background server 800 may have relatively large differences due to different configurations or performances, and may include one or more central processing units 822 (for example, one or more processors) and a memory 832, and one or more storage application programs 842 or The storage medium 830 of the data 844 (for example, one or a storage device with a large amount of data).
  • the memory 832 and the storage medium 830 may be short-term storage or persistent storage.
  • the program stored in the storage medium 830 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the background server 800.
  • the central processing unit 822 may be configured to communicate with the storage medium 830, and execute a series of instruction operations in the storage medium 830 on the background server 800.
  • the background server 800 may also include one or more power supplies 829, one or more wired or wireless network interfaces 850, one or more input and output interfaces 858, one or more keyboards 856, and/or, one or more operating systems 841, such as Windows Server, Mac OS X, Unix, Linux, FreeBSD, etc.
  • the background server 800 may include a memory and one or more programs. One or more programs are stored in the memory and configured to be executed by one or more processors. The above instructions for transcoding the video.
  • the program can be stored in a computer-readable storage medium.
  • the storage medium mentioned can be a read-only memory, a magnetic disk or an optical disk, etc.

Abstract

本申请公开了一种对视频进行转码的方法和装置,属于视频处理技术领域。所述方法包括:获取待转码视频的目标帧图像,基于语义分割模型的特征提取模块生成目标帧图像的全局特征图(101);基于语义分割模型的特征分割模块对全局特征图进行特征分割,确定目标帧图像的多级ROI区域(102);采用不同的转码码率对目标帧图像的多级ROI区域和其它区域分别进行转码处理(103)。

Description

一种对视频进行转码的方法和装置
交叉引用
本申请引用于2020年04月30日递交的名称为“一种对视频进行转码的方法和装置”的第202010365039.9号中国专利申请,其通过引用被全部并入本申请。
技术领域
本申请涉及视频处理技术领域,特别涉及一种对视频进行转码的方法和装置。
背景技术
伴随着互联网技术的发展以及现代带宽的不断提速,互联网与人们的生活联系的日益密切,越来越多的人们热衷于在互联网中获取视频来丰富生活,当下,高品质的视频已成为人们日常需求的首选。对于视频服务平台来说,为了适应不同的网络带宽、终端处理能力和用户需求,其往往需要在服务端对视频数据进行转码处理。
视频服务平台的后台服务器可以先获取封装好的初始视频(可称为输入视频),然后按照不同的转码规则,生成多个转码任务,然后针对每个转码任务创建对应的转码进程,进而可以通过转码进程内的转码线程实现对初始视频的视频数据的解析、转码、封装等步骤。在对视频数据进行转码处理之后,在接收到用户终端对于某一转码规则下的视频数据的获取请求后,后台服务器可以向用户终端推送已生成的转码后的视频数据。
如果转码后视频的码率过小,则视频的画面质量将会很差,甚至出现画面失真或马赛克现象,而如果转码后视频的码率过大,则会导致传输视频时带 宽资源的浪费,因此,目前亟需一种既能保证视频的画面质量,又能降低视频传输时的带宽资源消耗的视频转码技术。
发明内容
为了解决现有技术的问题,本申请实施例提供了一种对视频进行转码的方法和装置,所述技术方案如下。
本申请实施例提供了一种对视频进行转码的方法,所述方法包括:
获取待转码视频的目标帧图像,基于语义分割模型的特征提取模块生成所述目标帧图像的全局特征图;
基于所述语义分割模型的特征分割模块对所述全局特征图进行特征分割,确定所述目标帧图像的多级ROI区域;
采用不同的转码码率对所述目标帧图像的多级ROI区域和其它区域分别进行转码处理。
本申请实施例提供了一种视频进行转码的装置,所述装置包括:
语义分割模块,用于获取待转码视频的目标帧图像,基于语义分割模型的特征提取模块生成所述目标帧图像的全局特征图,基于所述语义分割模型的特征分割模块对所述全局特征图进行特征分割,确定所述目标帧图像的多级ROI区域;
视频转码模块,用于采用不同的转码码率对所述目标帧图像的多级ROI区域和其它区域分别进行转码处理。
本申请实施例提供了一种后台服务器,所述后台服务器包括处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现如上所述的对视频进行转码的方法。
本申请实施例提供了一种计算机可读存储介质,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至 少一段程序、所述代码集或指令集由处理器加载并执行以实现如上所述的对视频进行转码的方法。
本申请实施例提供的技术方案带来的有益效果是:本申请实施例中,后台服务器获取待转码视频的目标帧图像,基于语义分割模型的特征提取模块生成目标帧图像的全局特征图;基于语义分割模型的特征分割模块对全局特征图进行特征分割,确定目标帧图像的多级ROI区域;采用不同的转码码率对目标帧图像的多级ROI区域和其它区域分别进行转码处理。这样,在对视频进行转码时,后台服务器通过语义分割模型在视频帧图像中分割出多级ROI区域,从而可以依据场景不同灵活准确地确定ROI区域的范围,最大限度保证了ROI区域内有效特征的面积,再利用不同转码码率对各个ROI区域进行转码,使得视频帧图像中各层次内容均能合理占用码率份额,既可以保证视频的画面质量,又能降低视频传输时的带宽资源消耗。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例提供的一种对视频进行转码的方法流程图;
图2是本申请实施例提供的一种语义分割的流程示意图;
图3是本申请实施例提供的一种常规的PSPNet模型的金字塔池化模块的结构示意图;
图4是本申请实施例提供的一种重构后的PSPNet模型的金字塔池化模块的结构示意图;
图5是本申请实施例提供的另一种重构后的PSPNet模型的金字塔池化模块的结构示意图;
图6是本申请实施例提供的一种对视频帧图像进行语义分割的输出图;
图7是本申请实施例提供的一种对视频进行转码的装置结构示意图;
图8是本申请实施例提供的一种后台服务器的结构示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作详细地描述。
本申请实施例提供了一种对视频进行转码的方法,该方法的执行主体可以是视频服务平台的后台服务器。其中,后台服务器可以具备视频转码功能,后台服务器在接收到待转码视频的数据流后,可以对视频帧图像进行转码处理,然后将转码后的视频数据提供给外界。同时,后台服务器还可以具备图像分析功能,并可依据图像分析结果来调整对视频的转码处理。视频服务平台可以设置有多个后台服务器,每台后台服务器可以用于执行多个视频的转码任务,每个视频按转码需求可以对应存在多个转码任务。上述后台服务器可以包括处理器、存储器和收发器,处理器可以用于进行下述流程中对视频进行转码的处理,存储器可以用于存储下述处理过程中需要的数据以及产生的数据,收发器可以用于接收和发送下述处理过程中的相关数据。本实施例公开的对视频进行转码的方法可应用于直播视频,也可应用于点播视频。
下面将结合具体实施例,对图1所示的处理流程进行详细的说明,内容可以如下。
步骤101,获取待转码视频的目标帧图像,基于语义分割模型的特征提取模块生成目标帧图像的全局特征图。
在实施中,后台服务器在获取到某一视频的数据流后,可以按照视频服务平台预设的视频转码规则,判断该视频是否为待转码视频(即是否存在转码需求)。如果是,后台服务器则可以对待转码视频的帧图像依序进行转码处理。具体来说,后台服务器可以先获取待转码视频的目标帧图像,然后将目标帧图 像输入语义分割模型的特征提取模块,生成目标帧图像的全局特征图。其中,目标帧图像可以是待转码视频的任一帧图像,也可以是待转码视频的任一关键帧图像,还可以是待转码视频的指定时间段内的任一帧图像,具体可以视转码需求而定。
步骤102,基于语义分割模型的特征分割模块对全局特征图进行特征分割,确定目标帧图像的多级ROI区域。
在实施中,后台服务器在生成目标帧图像的全局特征图后,可以将全局特征图输入语义分割模型的特征分割模块中执行特征分割处理,之后可以根据分割结果得到目标帧图像的多级ROI区域。其中,本实施例中语义分割模型可以选用PSPNet模型,图2显示了PSPNet模型的内部结构。目标帧图像先经过特征提取模块(本实施例选用ResNet50模型),初步得到feature map(即全局特征图),再将feature map通过多尺寸池化,池化核分别选取1*1,2*2,4*4,8*8,得到四个尺寸的中间map,然后经过双线性插值或反卷积操作统一所有中间map的尺寸,继而经过concat操作(即拼接操作,通过concatenate函数把几个特征图在通道channel的维度拼接)将多个中间map的特征组合起来,最终在执行卷积和上采样后可以针对目标帧图像的完成像素级的分割任务。接下来,在实现图像语义分割后,后台服务器即可以将分割得到的部分图像区域选取作为目标帧图像的多级ROI区域。
此处,PSPNet模型可以采用Tensor RT引擎实现加速运算,以提高视频转码的速率。具体的,针对常规的PSPNet模型中的自适应池化层,可以利用一组不同尺寸的池化层替换自适应池化层,以达到类似自适应池化的效果,并可以实现TensorRT加速。而针对常规的PSPNet模型中的双线性上采样层,可以利用转置卷积层替换双线性上采样层,以达到类似自适应池化的效果,并可以实现TensorRT加速。为了便于理解,下面以特征提取模块输出为2048x64x64的特征图(即通道数为2048,图像尺寸为64x64)为例,分别给出了常规的PSPNet模型的金字塔池化模块和重构后的PSPNet模型的金字塔池化模块(二者均选用 四种尺度的特征融合处理为例进行说明,PSPNet模型还可以支持其它任意多种尺度的特征融合)的处理过程。
一、常规的PSPNet模型的金字塔池化模块,可参考图3所示:
(1)通过自适应池化层,生成4个不同尺度的特征图,并利用1x1的卷积层减少通道数,得到512x1x1、512x2x2、512x4x4、512x8x8的特征图;
(2)将512x1x1、512x2x2、512x4x4、512x8x8的4个特征图分别通过双线性上采样层,进行不同倍数的图像尺寸放大,得到4个512x64x64的特征图;
(3)将4个512x64x64的特征图沿着通道维度拼接成一个2048x64x64的特征图;
(4)将2048x64x64的特征图拼接上输入的特征图得到一个4096x64x64的特征图作为输出。
二、重构后的PSPNet模型的金字塔池化模块,可参考图4所示:
(1)通过四个不同尺寸的池化层(kernel size分别为64、32、16和8且stride等于kernel size)分别对输入图像进行下采样,得到4个不同尺度的特征图,并用1x1的卷积层减少通道数,得到512x1x1、512x2x2、512x4x4、512x8x8的特征图;
(2)将512x1x1、512x2x2、512x4x4、512x8x8的4个特征图分别通过转置卷积,进行不同倍数的图像尺寸放大,得到4个512x64x64的特征图;其中,转置卷积层可以通过设置kernel size、stride、padding来达到上采样固定倍数的目的;比如要扩大n倍,则kernel size=2n、stride=n、padding=取整(n/2);
(3)将4个512x64x64的特征图沿着通道维度拼接成一个2048x64x64的特征图;
(4)将2048x64x64的特征图拼接上输入的特征图得到一个4096x64x64的特征图作为输出。
在一个实施例中,可以设置多尺度共享同一个转置卷积层,以降低参数数量和计算量,相应的处理可以如下:将常规PSPNet模型中多尺度的双线性上 采样层替换为最近邻池上采样层和共享转置卷积层。
在实施中,后台服务器在利用转置卷积层替换常规PSPNet模型中的双线性上采样层时,针对多个尺度的双线性上采样层,可以利用最近邻池上采样层和共享转置卷积层来进行替换,即先用无参数的最近邻上采样层将多尺度的特征图分别上采样到同个尺寸,然后共用同一个共享转置卷积层将特征图统一扩大到指定尺寸。具体的,参考图5所示,继续以特征提取模块输出为2048x64x64的特征图(即通道数为2048,图像尺寸为64x64)为例,重构后的PSPNet模型的金字塔池化模块的处理过程可以如下:
(1)通过四个不同尺寸的池化层(kernel size分别为64、32、16和8且stride等于kernel size)分别对输入图像进行下采样,得到4个不同尺度的特征图,并用1x1的卷积层减少通道数,得到512x1x1、512x2x2、512x4x4、512x8x8的特征图;
(2)将512x1x1、512x2x2、512x4x4、512x8x8的4个特征图分别通过最近邻插值方式上采样到512x16x16的尺寸大小;
(3)通过共享转置卷积层将512x16x16的特征图扩大4倍,得到512x64x64的特征图;
(4)将四个512x64x64的特征图沿着通道维度拼接成一个2048x64x64的特征图;
(5)将2048x64x64的特征图拼接上输入的特征图得到一个4096x64x64的特征图作为输出。
当然,语义分割模型还可以选用U-Net模型、LinkNet模型、PSANet模型、HRNet模型、OCNet模型、DeepLabv3模型和DeepLabv3+模型等其它模型,在选用前对模型进行适应性修改即可。
在一个实施例中,不同级别的ROI区域可以对应一个实物的不同组成部分,相应的,步骤102的处理可以如下:基于语义分割模型的特征分割模块对全局特征图进行特征分割,确定目标帧图像中至少一个目标实物的多个组成部 分;基于每个组成部分对应的转码优先级,确定目标帧图像的多级ROI区域。
在实施中,后台服务器将目标帧图像的全局特征图输入特征分割模块之后,可以实现对目标帧图像的特征分割,从而可以得到目标帧图像中至少一个目标实物的多个组成部分。可以理解,目标实物以及其组成部分的分割规律可以是语义分割模型基于大量图像素材训练生成的,也即是说,通过模型训练,可以定义哪些实物需要从视频帧图像中分割出,以及实物的哪些组成部分需要被独立分割为一块。图6示例性地给出了语义分割的结果,其中,整幅图像包含背景和人体两大块,人体又进一步细分为头部、上半身、下半身、双臂和双腿5个部分。之后,后台服务器可以基于每个组成部分对应的转码优先级,确定目标帧图像的多级ROI区域,其中,ROI区域对应的转码优先级越高,其转码码率越高;其它区域采用最低转码码率进行转码。值得一提的是,后台服务器中可以预先设定有不同目标实物的不同组成部分的转码优先级,如人体的头部转码优先级最高,之后上半身、下半身、四肢等部分转码优先级依次降低。此外,不同实物间可以存在转码优先级相同的组成部分,如一幅图像中人体头部和汽车车标的转码优先级可以同属于最高级。也就是说,目标帧图像中每一级ROI区域,都可以包含多个实物的多个组成部分。
在一个实施例中,后台服务器可以基于预设的训练素材集中的图像素材,定期训练语义分割模型,其中,图像素材中标注有不同实物的各组成部分的轮廓线。
在实施中,后台服务器可以定期对语义分割模型进行优化训练,以提升语义分割模型对图像语义分割的精准度。故此,视频服务平台的技术人员可以选取各种类型视频的各类帧图像,并利用开源标注工具(如labelme)在帧图像中标注出不同实物的各个组成部分的轮廓线,从而可以将标注后的帧图像存至预设的训练素材集中,以作为训练语义分割模型的图像素材。其中,在标注轮廓线时,技术人员可以根据需要选择性地标记图像中的部分实物,并可以视需求对实物进行任意分割。这样,后台服务器可以基于预设的训练素材集中的图 像素材,定期训练语义分割模型。
在一个实施例中,针对不同类型的帧图像,可以选取不同的语义分割模型进行语义分割,相应的,在对目标帧图像进行特征提取前可以存在如下处理:调用所述待转码视频的目标视频类型和所述目标帧图像的目标图像类型对应的目标语义分割模型。
在实施中,后台服务器处可以针对不同视频类型的视频中的帧图像,训练有专用的语义分割模型,一个语义分割模型可以专用于对一种视频类型下的视频中的帧图像进行语义分割。比如可以存在美食类、游戏类、体育类等不同类型视频对应的不同PSPNet模型。这样,后台服务器在对待转码视频进行语义分割前,可以先检测待转码视频所属的目标视频类型,再调用目标视频类型对应的目标语义分割模型。不难想到,后台服务器在训练语义分割模型时,可以设定多种视频类型对应的训练素材集,一个训练素材集中可以包含同一视频类型的图像素材,从而后台服务器可以利用每个训练素材集,训练出相应视频类型对应的语义分割模型。
当然,后台服务器处还可以针对不同图像类型的帧图像,训练有专用的语义分割模型,一个语义分割模型可以专用于对一种图像类型的帧图像进行语义分割。比如可以存在人物图像、美食图像、室内环境图像等不同图像类型对应的不同PSPNet模型。这样,后台服务器在获取到目标帧图像之后,可以先检测目标帧图像所属的目标图像类型,再调用目标图像类型对应的目标语义分割模型。同样,后台服务器在训练语义分割模型时,可以设定多种图像类型对应的训练素材集,一个训练素材集中可以包含同一图像类型的图像素材,从而后台服务器可以利用每个训练素材集,训练出相应图像类型对应的语义分割模型。
本实施例中,针对不同视频类型下视频中的不同图像类型的帧图像,也可以训练有专用的语义分割模型,一个语义分割模型可以专用于对一种视频类型下的一种图像类型的帧图像进行语义分割。比如可以存在美食类视频的人物图像、体育类视频的任务图像等对应的不同语义分割模型。这样,后台服务器 在获取到目标帧图像之后,可以先检测待转码视频所属的目标视频类型,以及目标帧图像所属的目标图像类型,再调用目标视频类型和目标图像类型对应的目标语义分割模型。同样,后台服务器在训练语义分割模型时,可以设定多种视频类型下多种图像类型对应的训练素材集,一个训练素材集中可以包含同一视频类型下同一图像类型的图像素材,从而后台服务器可以利用每个训练素材集,训练出相应视频类型下的图像类型对应的语义分割模型。
步骤103,采用不同的转码码率对目标帧图像的多级ROI区域和其它区域分别进行转码处理。
在实施中,后台服务器在确定了目标帧图像的多级ROI区域之后,可以针对各级ROI区域和目标帧图像中除ROI区域外的其它区域,分别采用不同的转码码率进行转码处理。其中,ROI区域的级别可以是根据区域内的图像内容确定的,如目标帧图像为人的全身像,可以设定头部为最高ROI区域,上半身为次高ROI区域,下半身为中级ROI区域,四肢为低级ROI区域,背景为最低ROI区域。不同级别的ROI区域可以选用不同的转码码率,相邻的部分ROI区域也可以选用相同的转码码率。
在一个实施例中,可以利用转码面积来设定图像各个区域的转码码率,相应的,步骤103的处理可以如下:按照转码码率从高至低的顺序,依次选取目标级转码码率;依据目标级转码码率对应的转码面积,按照转码优先级从高至低的顺序,从未被选取的ROI区域中,选取目标级转码码率对应的转码区域。
其中,目标级转码码率可以是任一级的转码码率。
在实施中,后台服务器针对视频帧图像转码可以支持多级的转码码率,在对目标帧图像中的各个区域进行转码处理时,可以先按照转码码率从高至低的顺序,依次选取目标级转码码率。之后,后台服务器可以获取预设的目标级转码码率对应的转码面积,同时按照转码优先级从高至低的顺序,在所有未被选取的ROI区域中,选取第一ROI区域,然后比对第一ROI区域的面积与转码面积的大小。如果第一ROI区域的面积小于转码面积,则选取转码优先级次一 级的第二ROI区域,再比对第一ROI区域和第二ROI区域的面积之和与转码面积的大小。若两个ROI区域的面积之和仍小于转码面积,则可以继续选取转码优先级再次一级的第三ROI区域,并继续比对面积大小,以此类推,直至所有选取出的ROI区域的面积之和大于转码面积。进而,后台服务器可以将所有选取出的ROI区域作为目标级转码码率对应的转码区域。
细节而言,基于上述处理,选取出的转码区域均将大于预设的转码面积,这样可能会对设备性能以及线路带宽造成负载压力,因而可以在各级转码码率对应的转码优先级最低的一个ROI区域中,选取出超额面积区域,再将超额面积区域划入次一级转码码率对应的转码区域中。例如,选取超额面积区域时,可以选取ROI区域的边缘区域。
在一个实施例中,后台服务器可以定期根据当前设备性能负载和线路带宽负载,以及待转码视频的属性信息,调整各级转码码率对应的转码面积。
在实施中,后台服务器可以定期检测当前设备性能负载和线路带宽负载,并根据检测结果调整当前设备上的各个待转码视频的转码处理。具体来说。后台服务器可以先根据所有待转码视频的属性信息,确定需要调整转码处理的一个或多个待转码视频。之后,后台服务器可以根据待转码视频的属性信息,以及上述检测结果,针对各个待转码视频,调整各级转码码率对应的转码面积。例如,若当前设备性能负载和线路带宽负载均较低,则可以选取热度较高的待转码视频,增加高转码码率对应的转码面积;而若当前设备性能负载和线路带宽负载均较高,则可以选取热度较低的待转码视频,降低高转码码率对应的转码面积。当然,除了视频热度,后台服务器还可以根据归属方、发布时间、视频类型和视频时长等多维度的属性信息,选取需要调整的待转码视频。
在一个实施例中,不同类型的视频中,不同类型实物的组成部分对应的转码优先级不同,相应的,在确定目标帧图像中至少一个目标实物的多个组成部分之后,可以存在如下处理:根据待转码视频的视频类型和每个目标实物的实物类型,调整每个目标实物的组成部分对应的转码优先级。
在实施中,后台服务器对目标帧图像进行语义分割,得到目标帧图像中目标实物的多个组成部分后,可以获取待转码视频的视频类型,和每个目标实物的实物类型,然后基于两种类型信息,调整每个目标实物的组成部分对应的转码优先级。可以理解,不同类型的视频中,画面展示的重点内容也不相同,如在舞蹈类视频中,人体的躯干和四肢的关注度大于人体的头部,故而可以将人体的躯干和四肢的转码优先级调高,头部的转码优先级调低;而在脱口秀类视频中,人体的头部的转码优先级可以高于人体的躯干和四肢的转码优先级;又例如旅游类视频相比于人物类视频,人体的转码优先级可以低于景物的转码优先级。本实施例中,后台服务器还可以基于每一帧图像的图像类型和帧图像中实物的实物类型,调整每个实物的组成部分对应的转码优先级。
在一个实施例中,可以根据实际需求,调整视频帧图像中指定区域的转码码率,相应的,步骤103的处理可以如下:接收待转码视频的特征清晰/模糊信息,根据特征清晰/模糊信息确定目标帧图像中的目标特征区域;采用不同的转码码率对目标帧图像的多级ROI区域、目标特征区域和其它区域分别进行转码处理。
在实施中,视频提供方或者视频服务平台的技术人员可以在后台服务器处设置待转码视频的特征清晰/模糊信息,以提高或降低视频帧图像中某个或某些特征内容的转码清晰度。这样,在对待转码视频进行转码时,视频服务器可以接收待转码视频的特征清晰/模糊信息,然后根据特征清晰/模糊信息确定目标帧图像中的目标特征区域,该目标特征区域中即包含特征清晰/模糊信息所指向的特征内容。之后,后台服务器则可以采用不同的转码码率对目标帧图像的多级ROI区域、目标特征区域和其它区域分别进行转码处理。其中,特征清晰/模糊信息中可以直接包含有目标特征区域的转码码率的具体数值,或者也可以包含有目标特征区域的转码码率的调整幅度。
本申请实施例中,后台服务器获取待转码视频的目标帧图像,基于语义分割模型的特征提取模块生成目标帧图像的全局特征图;基于语义分割模型的 特征分割模块对全局特征图进行特征分割,确定目标帧图像的多级ROI区域;采用不同的转码码率对目标帧图像的多级ROI区域和其它区域分别进行转码处理。这样,在对视频进行转码时,后台服务器通过语义分割模型在视频帧图像中分割出多级ROI区域,从而可以依据场景不同灵活准确地确定ROI区域的范围,最大限度保证了ROI区域内有效特征的面积,再利用不同转码码率对各个ROI区域进行转码,使得视频帧图像中各层次内容均能合理占用码率份额,既可以保证视频的画面质量,又能降低视频传输时的带宽资源消耗。
基于相同的技术构思,本申请实施例还提供了一种对视频进行转码的装置,如图7所示,所述装置包括:
语义分割模块701,用于获取待转码视频的目标帧图像,基于语义分割模型的特征提取模块生成所述目标帧图像的全局特征图,基于所述语义分割模型的特征分割模块对所述全局特征图进行特征分割,确定所述目标帧图像的多级ROI区域;
视频转码模块702,用于采用不同的转码码率对所述目标帧图像的多级ROI区域和其它区域分别进行转码处理。
在一个实施例中,所述语义分割模块701,具体用于:
基于语义分割模型对所述全局特征图进行特征分割,确定所述目标帧图像中至少一个目标实物的多个组成部分;
基于每个所述组成部分对应的转码优先级,确定所述目标帧图像的多级ROI区域。
在一个实施例中,所述视频转码模块702,具体用于:
按照转码码率从高至低的顺序,依次选取目标级转码码率;
依据所述目标级转码码率对应的转码面积,按照转码优先级从高至低的顺序,从未被选取的ROI区域中,选取所述目标级转码码率对应的转码区域。
本申请实施例中,后台服务器获取待转码视频的目标帧图像,基于语义分割模型的特征提取模块生成目标帧图像的全局特征图;基于语义分割模型的 特征分割模块对全局特征图进行特征分割,确定目标帧图像的多级ROI区域;采用不同的转码码率对目标帧图像的多级ROI区域和其它区域分别进行转码处理。这样,在对视频进行转码时,后台服务器通过语义分割模型在视频帧图像中分割出多级ROI区域,从而可以依据场景不同灵活准确地确定ROI区域的范围,最大限度保证了ROI区域内有效特征的面积,再利用不同转码码率对各个ROI区域进行转码,使得视频帧图像中各层次内容均能合理占用码率份额,既可以保证视频的画面质量,又能降低视频传输时的带宽资源消耗。
图8是本申请实施例提供的后台服务器的结构示意图。该后台服务器800可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上中央处理器822(例如,一个或一个以上处理器)和存储器832,一个或一个以上存储应用程序842或数据844的存储介质830(例如一个或一个以上海量存储设备)。其中,存储器832和存储介质830可以是短暂存储或持久存储。存储在存储介质830的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对后台服务器800中的一系列指令操作。本实施例中,中央处理器822可以设置为与存储介质830通信,在后台服务器800上执行存储介质830中的一系列指令操作。
后台服务器800还可以包括一个或一个以上电源829,一个或一个以上有线或无线网络接口850,一个或一个以上输入输出接口858,一个或一个以上键盘856,和/或,一个或一个以上操作系统841,例如Windows Server,Mac OS X,Unix,Linux,FreeBSD等等。
后台服务器800可以包括有存储器,以及一个或者一个以上的程序,其中一个或者一个以上程序存储于存储器中,且经配置以由一个或者一个以上处理器执行所述一个或者一个以上程序包含用于进行上述对视频进行转码的指令。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘 或光盘等。
以上所述仅为本申请的部分实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (15)

  1. 一种对视频进行转码的方法,包括:
    获取待转码视频的目标帧图像,基于语义分割模型的特征提取模块生成所述目标帧图像的全局特征图;
    基于所述语义分割模型的特征分割模块对所述全局特征图进行特征分割,确定所述目标帧图像的多级ROI区域;
    采用不同的转码码率对所述目标帧图像的多级ROI区域和其它区域分别进行转码处理。
  2. 根据权利要求1所述的方法,其中,所述基于语义分割模型的特征分割模块对所述全局特征图进行特征分割,确定所述目标帧图像的多级ROI区域,包括:
    基于语义分割模型的特征分割模块对所述全局特征图进行特征分割,确定所述目标帧图像中至少一个目标实物的多个组成部分;
    基于每个所述组成部分对应的转码优先级,确定所述目标帧图像的多级ROI区域。
  3. 根据权利要求2所述的方法,还包括:
    基于预设的训练素材集中的图像素材,定期训练所述语义分割模型,其中,所述图像素材中标注有不同实物的各组成部分的轮廓线。
  4. 根据权利要求1所述的方法,其中,所述基于语义分割模型的特征提取模块生成所述目标帧图像的全局特征图之前,还包括:
    调用所述待转码视频的目标视频类型和所述目标帧图像的目标图像类型对应的目标语义分割模型。
  5. 根据权利要求2所述的方法,其中,所述ROI区域对应的转码优先级越高,其转码码率越高;所述其它区域采用最低转码码率进行转码。
  6. 根据权利要求5所述的方法,其中,所述采用不同的转码码率对所述目 标帧图像的多级ROI区域和其它区域分别进行转码处理,包括:
    按照转码码率从高至低的顺序,依次选取目标级转码码率;
    依据所述目标级转码码率对应的转码面积,按照转码优先级从高至低的顺序,从未被选取的ROI区域中,选取所述目标级转码码率对应的转码区域。
  7. 根据权利要求6所述的方法,还包括:
    定期根据当前设备性能负载和线路带宽负载,以及所述待转码视频的属性信息,调整各级转码码率对应的转码面积。
  8. 根据权利要求2所述的方法,其中,所述确定所述目标帧图像中至少一个目标实物的多个组成部分之后,还包括:
    根据所述待转码视频的视频类型和每个所述目标实物的实物类型,调整每个目标实物的组成部分对应的转码优先级。
  9. 根据权利要求1所述的方法,其中,所述采用不同的转码码率对所述目标帧图像的多级ROI区域和其它区域分别进行转码处理,包括:
    接收所述待转码视频的特征清晰/模糊信息,根据所述特征清晰/模糊信息确定所述目标帧图像中的目标特征区域;
    采用不同的转码码率对所述目标帧图像的多级ROI区域、目标特征区域和其它区域分别进行转码处理。
  10. 根据权利要求1所述的方法,其中,所述语义分割模型采用Tensor RT引擎实现加速运算。
  11. 一种对视频进行转码的装置,包括:
    语义分割模块,用于获取待转码视频的目标帧图像,基于语义分割模型的特征提取模块生成所述目标帧图像的全局特征图,基于所述语义分割模型的特征分割模块对所述全局特征图进行特征分割,确定所述目标帧图像的多级ROI区域;
    视频转码模块,用于采用不同的转码码率对所述目标帧图像的多级ROI区域和其它区域分别进行转码处理。
  12. 根据权利要求11所述的装置,其中,所述语义分割模块,具体用于:
    基于语义分割模型的特征分割模块对所述全局特征图进行特征分割,确定所述目标帧图像中至少一个目标实物的多个组成部分;
    基于每个所述组成部分对应的转码优先级,确定所述目标帧图像的多级ROI区域。
  13. 根据权利要求12所述的装置,其中,所述视频转码模块,具体用于:
    按照转码码率从高至低的顺序,依次选取目标级转码码率;
    依据所述目标级转码码率对应的转码面积,按照转码优先级从高至低的顺序,从未被选取的ROI区域中,选取所述目标级转码码率对应的转码区域。
  14. 一种后台服务器,所述后台服务器包括处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现如权利要求1至10任一所述的对视频进行转码的方法。
  15. 一种计算机可读存储介质,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由处理器加载并执行以实现如权利要求1至10任一所述的对视频进行转码的方法。
PCT/CN2020/097172 2020-04-30 2020-06-19 一种对视频进行转码的方法和装置 WO2021217829A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP20792515.7A EP3923585A1 (en) 2020-04-30 2020-06-19 Video transcoding method and device
US17/088,442 US11166035B1 (en) 2020-04-30 2020-11-03 Method and device for transcoding video

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010365039.9 2020-04-30
CN202010365039.9A CN111629212B (zh) 2020-04-30 2020-04-30 一种对视频进行转码的方法和装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/088,442 Continuation US11166035B1 (en) 2020-04-30 2020-11-03 Method and device for transcoding video

Publications (1)

Publication Number Publication Date
WO2021217829A1 true WO2021217829A1 (zh) 2021-11-04

Family

ID=72273001

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/097172 WO2021217829A1 (zh) 2020-04-30 2020-06-19 一种对视频进行转码的方法和装置

Country Status (3)

Country Link
EP (1) EP3923585A1 (zh)
CN (1) CN111629212B (zh)
WO (1) WO2021217829A1 (zh)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022067656A1 (zh) * 2020-09-30 2022-04-07 华为技术有限公司 一种图像处理方法及装置
CN112291563A (zh) * 2020-10-22 2021-01-29 咪咕视讯科技有限公司 一种视频编码方法、设备及计算机可读存储介质
CN112954354B (zh) * 2021-02-01 2022-07-22 北京字跳网络技术有限公司 视频的转码方法、装置、设备和介质
WO2023053364A1 (ja) * 2021-09-30 2023-04-06 楽天グループ株式会社 情報処理装置、情報処理方法及び情報処理プログラム
CN114897916A (zh) * 2022-05-07 2022-08-12 虹软科技股份有限公司 图像处理方法及装置、非易失性可读存储介质、电子设备

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140177706A1 (en) * 2012-12-21 2014-06-26 Samsung Electronics Co., Ltd Method and system for providing super-resolution of quantized images and video
CN105306960A (zh) * 2015-10-18 2016-02-03 北京航空航天大学 一种用于传输高质量在线课程视频的动态自适应流系统
CN106791856A (zh) * 2016-12-28 2017-05-31 天津天地伟业生产力促进有限公司 一种基于自适应感兴趣区域的视频编码方法

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102984495A (zh) * 2012-12-06 2013-03-20 北京小米科技有限责任公司 一种视频图像的处理方法及装置
CN108475331B (zh) * 2016-02-17 2022-04-05 英特尔公司 用于对象检测的方法、装置、系统和计算机可读介质
US10380741B2 (en) * 2016-12-07 2019-08-13 Samsung Electronics Co., Ltd System and method for a deep learning machine for object detection
US10321728B1 (en) * 2018-04-20 2019-06-18 Bodygram, Inc. Systems and methods for full body measurements extraction
CN109862019B (zh) * 2019-02-20 2021-10-22 联想(北京)有限公司 数据处理方法、装置以及系统
CN109889839B (zh) * 2019-03-27 2020-11-20 上海交通大学 基于深度学习的感兴趣区域图像编码、解码系统及方法
CN110784745B (zh) * 2019-11-26 2021-12-07 科大讯飞股份有限公司 一种视频传输方法、装置、系统、设备及存储介质

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140177706A1 (en) * 2012-12-21 2014-06-26 Samsung Electronics Co., Ltd Method and system for providing super-resolution of quantized images and video
CN105306960A (zh) * 2015-10-18 2016-02-03 北京航空航天大学 一种用于传输高质量在线课程视频的动态自适应流系统
CN106791856A (zh) * 2016-12-28 2017-05-31 天津天地伟业生产力促进有限公司 一种基于自适应感兴趣区域的视频编码方法

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ANONYMOUS: "Development of Model Compression and Inference Acceleration Algorithms of Image Super Resolution Deep Neural Networks", CHINA MASTER’S THESES FULL-TEXT DATABASE, 15 March 2020 (2020-03-15), pages 1 - 101, XP055794585 *
LI SHI-JING; QING LIN-BO; HE XIAO-HAI; HAN JIE: "Road Scene Segmentation Based on NVIDIA Jetson TX2", COMPUTER SYSTEMS & APPLICATIONS, vol. 28, no. 1, 15 January 2019 (2019-01-15), pages 239 - 244, XP055794567, ISSN: 1003-3254, DOI: 10.15888/j.cnki.csa.006730 *

Also Published As

Publication number Publication date
CN111629212A (zh) 2020-09-04
EP3923585A4 (en) 2021-12-15
CN111629212B (zh) 2023-01-20
EP3923585A1 (en) 2021-12-15

Similar Documents

Publication Publication Date Title
WO2021217829A1 (zh) 一种对视频进行转码的方法和装置
US20220261960A1 (en) Super-resolution reconstruction method and related apparatus
JP6811796B2 (ja) 拡張現実アプリケーションのためのビデオにおけるリアルタイムオーバーレイ配置
KR102145220B1 (ko) 딥러닝을 활용하여 2차원 이미지를 3차원 이미지로 변환하는 방법 및 장치
CN112235520A (zh) 一种图像处理方法、装置、电子设备及存储介质
US20230082715A1 (en) Method for training image processing model, image processing method, apparatus, electronic device, and computer program product
CN109523558A (zh) 一种人像分割方法及系统
CN115294055A (zh) 图像处理方法、装置、电子设备和可读存储介质
JP2023001926A (ja) 画像融合方法及び装置、画像融合モデルのトレーニング方法及び装置、電子機器、記憶媒体、並びにコンピュータプログラム
Xue et al. Enhancement and fusion of multi-scale feature maps for small object detection
WO2021217828A1 (zh) 一种对视频进行转码的方法和装置
CN113344794B (zh) 一种图像处理方法、装置、计算机设备及存储介质
US11166035B1 (en) Method and device for transcoding video
JP7367187B2 (ja) 非遮蔽ビデオオーバーレイ
US11232544B1 (en) History clamping for denoising dynamic ray-traced scenes using temporal accumulation
WO2020259123A1 (zh) 一种调整图像画质方法、装置及可读存储介质
CN111696034A (zh) 图像处理方法、装置及电子设备
CN104123707B (zh) 一种基于局部秩先验的单幅图像超分辨率重建方法
WO2023273515A1 (zh) 目标检测方法、装置、电子设备和存储介质
US20230021463A1 (en) Multi-frame image super resolution system
US20210344937A1 (en) Method and device for transcoding video
Guo et al. No-reference omnidirectional video quality assessment based on generative adversarial networks
US20230336799A1 (en) Video streaming scaling using virtual resolution adjustment
US20230053776A1 (en) Region-of-interest (roi) guided sampling for ai super resolution transfer learning feature adaptation
CN113505680B (zh) 基于内容的高时长复杂场景视频不良内容检测方法

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2020792515

Country of ref document: EP

Effective date: 20201027

NENP Non-entry into the national phase

Ref country code: DE