WO2021164216A1 - 一种视频编码方法、装置、设备及介质 - Google Patents

一种视频编码方法、装置、设备及介质 Download PDF

Info

Publication number
WO2021164216A1
WO2021164216A1 PCT/CN2020/108788 CN2020108788W WO2021164216A1 WO 2021164216 A1 WO2021164216 A1 WO 2021164216A1 CN 2020108788 W CN2020108788 W CN 2020108788W WO 2021164216 A1 WO2021164216 A1 WO 2021164216A1
Authority
WO
WIPO (PCT)
Prior art keywords
area
image frame
code rate
encoding
coding
Prior art date
Application number
PCT/CN2020/108788
Other languages
English (en)
French (fr)
Inventor
全赛彬
岳泊暄
王成
龚骏辉
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2021164216A1 publication Critical patent/WO2021164216A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/134Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
    • H04N19/146Data rate or code amount at the encoder output
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/24Aligning, centring, orientation detection or correction of the image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/124Quantisation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/17Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object
    • H04N19/172Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object the region being a picture, frame or field
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/42Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by implementation details or hardware specially adapted for video compression or decompression, e.g. dedicated software implementation

Definitions

  • This application relates to the field of computer technology, and in particular to a video encoding method, device, device, and computer-readable storage medium.
  • a front-end collection device such as a camera, collects video data, and then stores the data in a back-end storage device to provide the corresponding user.
  • videos usually need to be kept for a period of time. For example, in the security industry, videos need to be saved for at least 30 days, panoramic images (large images) need to be saved for at least 90 days, and target images (small images) need to be saved for at least 365 days. If the video data collected by the front-end collection device is directly transmitted to the back-end storage device for storage, it will bring greater transmission pressure and storage pressure, and increase storage and transmission costs.
  • the embodiment of the present application provides a video encoding method, by which encoding is performed, the encoding quality is guaranteed, and video compression is realized, thereby reducing the storage cost and transmission cost of the video.
  • This application also provides corresponding devices, equipment, media, computer-readable storage media, and computer program products.
  • this application provides a video encoding method.
  • This method supports the extraction of N-level regions of interest from the target image frame.
  • the N-level regions of interest include at least the object area, the feature part area of the object, and the background area used to isolate the target image frame and the feature of the object.
  • a code stream feature is preset for the above-mentioned N-level interest area, and the code stream feature at least includes a code rate.
  • the code rate set for the N-level region of interest satisfies a preset relationship
  • the preset relationship is specifically that the code rate of the background area is less than the code rate of the object area, and the code rate of the feature part area of the object
  • the code rate and the code rate of the isolated area, the code rate of the object area is lower than the code rate of the feature part area of the object.
  • the coding strategy of the target image frame can be determined according to the code stream characteristics set for the above-mentioned N-level region of interest, and the target image frame is coded according to the coding strategy to obtain the target video stream, and the image of the target video stream
  • the code rate of the N-level region of interest in the frame satisfies the foregoing preset relationship.
  • the code rate set for the background area is the lowest, and the coding strategy determined based on the code stream characteristics including the minimum code rate can include a larger quantization parameter, and the quantization parameter is strongly related to the coding quality and compression rate. Encoding based on larger quantization parameters can retain only the lowest subjective quality and improve the video compression rate.
  • the code rate set for the target area is higher than the code rate of the background area. For example, it can be the next lowest.
  • the coding strategy determined based on the bit stream characteristics including the second lowest bit rate can include a centered quantization parameter, and the processing is performed based on the centered quantization parameter. Encoding can achieve a balance between compression rate and object recognition rate.
  • the code rate set for the characteristic part area of the object is higher (higher than the code rate set for the above-mentioned target area), and the coding strategy determined based on the code stream characteristics including the higher code rate may include a smaller quantization parameter, When encoding based on a smaller quantization parameter, the encoding quality can be guaranteed and the object recognition rate can be ensured.
  • the embodiment of the present application when encoding pixels in any region of the target image frame, the pixels near the region are referred to, and the encoding quality of the two is positively correlated. That is to say, while keeping the coding strategy unchanged, the higher the code rate of the pixels near a region, the higher the image quality of the region after coding.
  • the embodiment of the present application also increases the code rate of the background area near the characteristic part area of the object, thereby improving the coding quality of the characteristic part area of the object.
  • the area near the characteristic part area of the object and whose set bit rate is increased is called an isolation area.
  • the isolation area is generally a ring-shaped area surrounding the feature part area of the object. It isolates the characteristic part area of the object from the background area by enclosing the characteristic part area of the object.
  • the isolation area generally includes a part of the background area. Of course, in some cases, a part of the target area is also included.
  • the remaining area in the background area is still called the background area (unless otherwise specified, the term background area in the embodiments of the application refers to this area), and the remaining area in the object area is still referred to as the object Area (unless otherwise specified, the term object area in the embodiments of the present application refers to the area).
  • the area near the characteristic part area of the object does not occupy a large area in the entire target image frame, so even if the code rate of the area near the characteristic part area of the object is increased ,
  • the overall target image frame still has a higher compression rate.
  • the average compression rate of the entire image does not increase much compared with not increasing the code rate of the area near the feature part area of the object.
  • the feature area of the object contributes a lot to object recognition.
  • the embodiment of the present application significantly improves the coding quality of the image frame when the size of the image frame is slightly increased.
  • the N-level region of interest can be extracted in the following manner. Specifically, the object area and the characteristic part area of the object are first extracted from the target image frame, and then the characteristic part area of the object is expanded, that is, the characteristic part area of the object is expanded to the outside through convolution.
  • the area that expands to the outside that is, the difference area of the characteristic part area of the object before and after the expansion process, is the isolation area. Since the characteristic part area of the extracted object may have incomplete parts, especially when the characteristic part area of the object is extracted by the model inference method, for this reason, the embodiment of the present application may also be isolated by performing expansion processing on the characteristic part area of the object. Area, and set a higher bit rate for the isolated area, so as to make up for the incomplete part of the area of interest, and avoid the incomplete part of the area of interest from being encoded into a low-rate video stream, thereby affecting the overall encoding quality.
  • the coding strategy includes coding parameters.
  • the coding parameters of the target image frame can be determined by using the coding strategy model according to the code stream characteristics of the N-level region of interest.
  • the target image frame can be quantized according to the encoding parameter, thereby realizing video encoding.
  • This method uses artificial intelligence (AI) to learn coding strategy decision-making experience. Based on this decision-making experience, coding strategies can be automatically determined for different regions based on preset bit rate characteristics, without the need to manually set coding strategies based on experience. End-to-end automatic video coding improves coding efficiency.
  • the coding strategy model based on AI can be applied to a variety of scenarios to meet the general requirements of different scenarios.
  • the isolation area overlaps with the background area, or the isolation area overlaps with the target area, etc. Based on this, it can be based on the higher bit rate.
  • the coding strategy model can be obtained by training in the following manner. Specifically, the training video stream is obtained, and then the training video stream is encoded using the encoding strategy in the encoding strategy library, the compression rate and quality of the encoded video stream are determined, and then the compression rate and the quality are determined according to the compression rate and the quality.
  • the training video stream matches the target coding strategy, and finally the training video stream and the target coding strategy (as the label of the training video stream) are used for model training to obtain the coding strategy model.
  • the reference image frame may be down-sampled to obtain the target image frame. Since the size of the target image frame obtained by downsampling is small and the resolution is relatively low, encoding based on the target image frame can reduce the amount of calculation, improve the encoding efficiency, and shorten the encoding time, thereby meeting real-time service requirements.
  • the reference image frame may be an original image frame generated by an image sensor of a video capture device such as a complementary metal oxide semiconductor (CMOS) sensor or a charge-coupled device (CCD) sensor through photoelectric conversion.
  • CMOS complementary metal oxide semiconductor
  • CCD charge-coupled device
  • the video capture device may also process the original image frame through an image signal processor (ISP), for example, perform processing such as removing bad pixels, denoising, rotating, sharpening, and/or color enhancement.
  • ISP image signal processor
  • the reference image frame may also be an image frame processed by ISP. It should also be noted that the reference image frame can be directly obtained by the video capture device, or can be obtained by decoding the input video stream.
  • the code rate set for the isolated area may also be higher than the code rate set for the target area.
  • this application provides a video encoding device.
  • the device includes an extraction module, a determination module and an encoding module.
  • the extraction module is used to extract N-level regions of interest from the target image frame.
  • the N-level interest area includes at least an object area, a characteristic part area of the object, and an isolation area.
  • the isolation area is used to isolate a background area in the target image frame and a characteristic part area of the object.
  • the determining module is configured to determine the coding strategy of the target image frame according to the code stream characteristics set for the N-level region of interest, and the code stream characteristics include at least a code rate.
  • the code rate set for the N-level region of interest satisfies a preset relationship, and the preset relationship includes that the code rate of the background area is less than the code rate of the target area and the code rate of the isolation area, The code rate of the object area is lower than the code rate of the feature part area of the object.
  • the encoding module is configured to encode the target image frame according to the encoding strategy to obtain a target video stream, and the code rate of the N-level region of interest in the image frame of the target video stream meets the preset relation.
  • the extraction module is specifically used for:
  • the isolation area is the difference between the feature part area of the object before and after the expansion process.
  • the encoding strategy includes encoding parameters
  • the determining module is specifically configured to:
  • the encoding module is specifically used for:
  • the target image frame is quantized according to the coding parameter, thereby realizing video coding.
  • the determining module is specifically configured to:
  • the coding strategy model is used to determine the coding parameters corresponding to the overlapping regions according to the bit stream characteristics set for the region of interest with a higher bit rate.
  • the device further includes:
  • the training module is used to obtain a training video stream, use the coding strategy in the coding strategy library to encode the training video stream, determine the compression rate and quality of the encoded video stream, and determine and
  • the target coding strategy matched by the training video stream is trained using the training video stream and the target coding strategy to obtain a coding strategy model.
  • the device further includes:
  • the down-sampling module is used to down-sample the reference image frame to obtain the target image frame.
  • the preset relationship further includes: the code rate of the target area is less than the code rate of the isolation area.
  • the present application provides a device including a processor and a memory.
  • the processor and the memory communicate with each other.
  • the processor is configured to execute instructions stored in the memory, so that the device executes the video encoding method in the first aspect or any implementation manner of the first aspect.
  • the present application provides a computer-readable storage medium having instructions stored in the computer-readable storage medium, which when run on a device such as a computer device, cause the device to execute the first aspect or the first aspect described above. Any one of the video encoding methods described in the implementation manner.
  • the present application provides a computer program product containing instructions, which when run on a device, causes the device to execute the video encoding method described in the first aspect or any one of the implementation manners of the first aspect.
  • FIG. 1 is a schematic diagram of a scene of a video encoding method provided by an embodiment of the application
  • FIG. 2A is a schematic diagram of a scene of a video encoding method provided by an embodiment of this application;
  • 2B is a schematic diagram of a scene of a video encoding method provided by an embodiment of this application.
  • 2C is a schematic diagram of a scene of a video encoding method provided by an embodiment of the application.
  • 2D is a schematic diagram of a scene of a video encoding method provided by an embodiment of this application.
  • FIG. 3 is a schematic structural diagram of a video encoding device provided by an embodiment of the application.
  • FIG. 4 is a flowchart of a video encoding method provided by an embodiment of the application.
  • FIG. 5 is a schematic diagram of an N-level region of interest provided by an embodiment of this application.
  • FIG. 6 is a schematic structural diagram of a device provided by an embodiment of the application.
  • FIG. 7 is a schematic structural diagram of a device provided by an embodiment of the application.
  • Video specifically refers to a sequence of image frames whose rate of change reaches a preset rate.
  • the preset rate may specifically be 24 frames per second. That is, when the continuous image frame changes up to 24 frames per second, according to the principle of persistence of vision, the human eye cannot distinguish a single static picture. At this time, the continuous image frame presents a smooth and continuous visual effect, so it can be called a video.
  • Video captured by video capture devices such as cameras often occupies more storage space, which brings greater challenges to video transmission and storage.
  • video coding is also called video compression, which is essentially a data compression method.
  • the original video data is processed in accordance with a set standard or a set protocol, such as predicting, transforming, and quantizing the video to reduce redundancy in the original video data, so as to achieve the original video data compression, which can save storage space And the purpose of transmission bandwidth.
  • the original video data can be restored through video decoding. Since information is often lost in the process of video encoding, the amount of information in the retrieved video may be less than that of the original video.
  • an ROI refers to an area that needs to be processed in a video image frame outlined in a box, circle, ellipse, or irregular polygon.
  • the ROI can be a human body, a face, and other areas.
  • ROI can be areas such as cars and license plates.
  • the ROI may be areas such as teachers, students, and writing boards.
  • the industry has proposed a method for identifying ROI based on deep neural networks (DNN), and then performing video encoding based on the ROI.
  • DNN deep neural networks
  • the recognition of ROI based on deep neural network can be divided into two methods, including target detection method and semantic segmentation method.
  • the target detection method requires DNN to be able to judge the type of the target (also called object) in the image frame, and mark the position of the target in the image frame, that is, to complete the two tasks of target classification and positioning at the same time.
  • the target refers to the object being photographed in the image frame, including a person or a vehicle, and so on.
  • RCNN region-based convolutional neural networks
  • YOLO look once
  • SSDs single-anchor multi-box detectors
  • Semantic segmentation methods require DNN to be able to identify image frames from the pixel level (or above), that is, to label the object category to which each pixel (or above) in the image frame belongs.
  • U-Net fully convolutional networks
  • FCN fully convolutional networks
  • mask RCNN mask-based RCNN
  • the object category For example, identify whether the pixel belongs to a license plate, and then identify the license plate ROI.
  • the above-mentioned target detection method or semantic segmentation method usually recognizes two-level ROI, namely, the object-level ROI and the feature-part-level ROI of the object.
  • the object-level ROI may specifically be a human body, a vehicle, etc.
  • the object-level feature part ROI may be a human face, a license plate, and so on.
  • an embodiment of the present application provides a video encoding method.
  • the embodiment of this application first extracts N from the target image frame when the video is encoded.
  • Each area in the video frame has the following bit rate characteristics.
  • the background part (background area) of the non-ROI area has the lowest bit rate.
  • the coding strategy determined based on the stream characteristics including the lowest bit rate can have a larger quantization parameter.
  • the quantization parameter is strongly related to the encoding quality and compression rate. Although the video quality is reduced when encoding based on a larger quantization parameter, the video compression rate can be improved.
  • the code rate of the object-level ROI is higher than the code rate of the background area.
  • the target area may have the next lowest bit rate.
  • the coding strategy determined based on the bit stream characteristics including the second-low bit rate can have a centered quantization parameter, and encoding based on the centered quantization parameter can achieve a balance between the compression rate and the object recognition rate.
  • the characteristic part-level ROI (the characteristic part area of the object) has a higher code rate (higher than the code rate of the target area).
  • the coding strategy determined based on the bit stream characteristics including the higher bit rate can have a smaller quantization parameter, and the coding quality can be guaranteed when the smaller quantization parameter is coded, and the object recognition rate can be ensured.
  • the isolated ROI has a higher code rate (higher than the code rate of the background area).
  • the embodiment of the present invention additionally proposes the concept of isolating ROI. Specifically, according to the encoding algorithm, when encoding pixels in any region of the target image frame, the pixels near the region are referred to, and the encoding quality of the two is positively correlated. That is to say, while keeping the coding strategy unchanged, the higher the code rate of the pixels near a region, the higher the image quality of the region after coding.
  • the embodiment of the present application also improves the code rate of the area near the characteristic part area of the object (ie, the isolation area, which includes at least a part of the background area), thereby improving the coding quality of the characteristic part area of the object.
  • the area near the characteristic part area of the object does not occupy a large area in the entire target image frame, so even if the code rate of the area near the characteristic part area of the object is increased ,
  • the overall target image frame still has a higher compression rate.
  • the average compression rate of the entire image does not increase much compared with not increasing the code rate of the area near the feature part area of the object.
  • the feature area of the object contributes a lot to object recognition.
  • the code rate of the area near the feature area of the object the coding quality of the feature area of the object can be greatly improved, thereby significantly improving the coding quality of the target image frame. . That is, in the case of slightly increasing the size of the image frame in the embodiment of the present application, the coding quality of the image frame is obviously improved, and the requirements of services such as object recognition are met.
  • the video encoding method provided in this application can be applied to multiple scenarios in different fields.
  • the above-mentioned video encoding method can be applied to a person/vehicle recognition scene in the security field. For example, recognize the faces of people entering and leaving the community, and realize personnel control. Or recognize the license plates of passing vehicles to realize vehicle management and control.
  • the above-mentioned video method can also be used in video conference scenes, or live video scenes in industries such as new media and online education, and will not be described in detail here.
  • the following text takes a security scene as an example to illustrate the video encoding method.
  • the video encoding method provided in this application can be applied to the application scenario shown in FIG. 1.
  • the application scenario includes a video capture device 102, a storage device 104, and a terminal device 106.
  • the video capture device 102 is connected to the storage device 104 through a network
  • the storage device 104 is connected to the terminal device 106 through a network.
  • the video acquisition device 102 collects video data, and transmits the video data to the storage device 104 for storage.
  • the user can initiate a video acquisition operation through the terminal device 106, and the terminal device 106, in response to the video acquisition operation, acquires video data, and plays the video data for the user to view.
  • the video capture device 102 may be a camera, for example, a network camera (Internet protocol camera, IPC), or a software defined camera (software define camera, SDC).
  • the camera is composed of a lens and an image sensor (such as CMOS, CCD).
  • the storage device 104 may be a storage server or a cloud server.
  • the terminal device can be a smart phone, a tablet computer, a desktop computer, and so on.
  • the video encoding method provided in this application is specifically executed by a video encoding device.
  • the video encoding device is software or hardware, which can be deployed in the aforementioned video capture device 102, or in the storage device 104, and can also be deployed in the aforementioned video capture device 102, storage device 104, and video encoding device other than the terminal device 106 .
  • the video encoding apparatus 200 may be deployed in the video capture device 102 at the front end.
  • the video encoding device 200 specifically includes a communication module 202, an extraction module 206, a determination module 208, and an encoding module 210.
  • the communication module 202 is used to obtain the original image frames collected by the video capture device 200 to obtain the target image frames.
  • the extraction module 206 is used for extracting an N-level ROI from the target image frame, and the N-level ROI includes at least an object area, a feature part area of the object, and an isolation area. The isolation area is used to isolate the characteristic part of the object and the background area in the target image frame.
  • a code stream feature is preset for the above-mentioned N-level ROI, and the code stream feature at least includes a code rate.
  • the code rate set for the N-level ROI satisfies the following preset relationship:
  • bitrate is the bit rate
  • subscript bg identifies the background area background
  • subscript obj identifies the object area
  • the subscript fee identifies the feature area of the object
  • the subscript isol identifies the isolation area isolation.
  • the code rate of the background area is the lowest
  • the code rate of the object area is less than the code rate of the feature part area of the object
  • the code rate of the isolated area is greater than the code rate of the background area.
  • the code rate of the isolated area can also be greater than the code rate of the target area, which can further improve the coding quality. It should be noted that the code rate of the characteristic part area of the object and the isolation area can be set according to actual needs. In some possible implementations, the code rate of the isolated area can be set to be greater than the code rate of the feature part area of the object, as shown below:
  • bitrate feat ⁇ bitrate isol
  • the code rate of the isolated area can also be set to be less than or equal to the code rate of the feature part area of the object, as shown below:
  • the determining module 208 determines the encoding strategy of the target image frame according to the code stream characteristics set for the above-mentioned N-level ROI.
  • the encoding module 210 is specifically configured to encode a target image frame according to an encoding strategy to obtain an encoded target video stream, and the code rate of the N-level ROI in the image frame of the target video stream satisfies the preset relationship.
  • the target video stream can still maintain better compression performance, reduce the transmission bandwidth to and from the storage device 104 and from the storage device 104 to the terminal device 106, and save the storage space of the back-end storage device 104 and the terminal device 106.
  • the target video stream can also obtain better coding quality to meet the needs of business such as object recognition.
  • the video encoding apparatus 200 may also be deployed in the back-end storage device 104. Since what is input to the video encoding device 200 is an encoded video stream, based on the video encoding device 200 shown in FIG. 2A, the video encoding device 200 shown in FIG. 2B further includes a decoding module 204.
  • the communication module 202 is used to obtain the original video stream, and the decoding module is used to decode the original video stream to obtain the original image frame.
  • the target image frame can be obtained based on the original image frame.
  • the extraction module 206 extracts the N-level ROI from the target image frame, and the determination module 208 determines the encoding strategy of the target image frame according to the code stream characteristics set for the N-level ROI.
  • the coding strategy includes coding parameters.
  • the coding parameters may include quantization parameters, which can represent the quantization step size. The smaller the quantization step size, the higher the quantization accuracy and the higher the coding quality. In some cases, the coding parameters may also include coding unit block reference trends and so on.
  • the encoding module 210 encodes the target image frame according to the foregoing encoding strategy to obtain the target video stream.
  • the back-end storage device 104 compresses the video through the above-mentioned video encoding device 200, which saves storage space and reduces storage costs. In addition, the transmission bandwidth of the video from the back-end storage device 104 to the terminal device 106 is reduced, and the transmission cost is reduced.
  • the target video stream has good coding quality, which can meet the needs of object recognition and other services.
  • the video encoding apparatus 200 may also be deployed in the video encoding device 108.
  • the video encoding device 108 may be a box device.
  • the video encoding device 108 includes an input interface 1082, an output interface 1084, a processor 1086, and a memory 1088.
  • the memory 1088 stores instructions and data.
  • the instruction may specifically include an instruction of the video encoding apparatus 200, and the data may specifically include the original video stream input through the input interface 1082.
  • the processor 1086 can call the instruction of the video encoding device 200 to decode the original video stream, and then extract the N-level ROI including the isolation area, and determine the encoding strategy based on the code stream characteristics set for the N-level ROI, based on The coding strategy performs coding to obtain the target video stream, thereby realizing video compression.
  • the memory 1088 also stores the above-mentioned compressed target video stream.
  • the output interface 1084 can output the target video stream for viewing.
  • the input interface 1082, the output interface 1084 may be a network port, such as an Ethernet port or the like.
  • the video encoding device 108 may use an Ethernet card as the network port driver chip.
  • the processor 1086 may be a camera-based system on chip (camera system on chip, camera SOC). As an example, the camera SOC may be Hi3559.
  • the video stream enters the video encoding device 108 from the front-end video acquisition device 102 to perform video encoding, thereby achieving video compression.
  • the video encoding device 108 extracts the N-level ROI, determines an encoding strategy according to the bit stream characteristics set for the N-level ROI, and re-encodes the video based on this, so as to ensure video encoding quality while achieving video compression.
  • the compressed video stream can be stored in the back-end storage device 104 through the network, which reduces the bandwidth entering the storage device 104 and saves storage space.
  • the video encoding device 108 When the video encoding device 108 is deployed between the storage device 104 and the terminal device 106, the user requests video data, and the video data enters the video encoding device 108 from the back-end storage device 104, and the video encoding device 108 recognizes multi-level ROI.
  • the bit stream characteristics set by the level ROI can determine a more accurate coding strategy.
  • the target image frame is coded.
  • the video is compressed as much as possible, so that the bandwidth of the terminal device 106 is reduced to avoid jams Phenomenon occurred, and the user experience was improved.
  • the method includes:
  • the video encoding device 200 acquires a target image frame.
  • the video encoding device 200 can directly acquire the original image frames.
  • the original image frame refers to an image frame generated by photoelectric conversion of an image sensor with a photosensitive function in the video capture device 102, such as a CMOS or CCD.
  • the video capture device 200 After the video capture device 200 obtains the original image frame, it can obtain the target image frame according to the original image frame, so as to perform encoding based on the target image frame.
  • the video capture device 102 also includes an image signal processor (ISP), and the video capture device 102 can also process the original image frame through the ISP, such as removing dead pixels, denoising, rotating, Processing such as sharpening and/or color enhancement.
  • the video capture device 200 may also obtain an ISP processed image frame, and obtain a target image frame according to the ISP processed image frame, so as to perform encoding based on the target image frame.
  • the video encoding device 200 can obtain the original video stream.
  • the original video stream refers to the video stream input to the video encoding device 200.
  • the video stream may be the original image frame collected by the video capture device 102, and the original image frame (when the video capture device 102 includes an ISP, it is processed by the ISP). Image frame) for encoding.
  • the video encoding device 200 can obtain the original image frame by decoding the original video stream after obtaining the original video stream (when the video capture device 102 includes an ISP, it is an image frame processed by the ISP).
  • the target image frame can be obtained from the original image frame (when the video capture device 102 includes an ISP, it is an image frame processed by the ISP), so as to perform encoding based on the target image frame.
  • the original image frame and the image frame processed by the ISP are collectively referred to as the reference image frame in the embodiment of the present application.
  • the video encoding apparatus 200 may directly encode the reference image frame as the target image frame, so that the encoded video quality can be guaranteed.
  • the video encoding device 200 may also down-sample the reference image frame, and use the image frame obtained by the down-sampling as the target image frame.
  • the following is an example of down-sampling the original image frame.
  • Down-sampling the original image frame specifically refers to down-sampling from the spatial dimension.
  • the size of the original image frame is 1080*720
  • the video encoding device 200 may process the 4*4 pixel block in the original image frame and convert it into a new pixel point to obtain the target image frame. In this way, the resolution of the target image frame is reduced, and the size is reduced to 270*180.
  • the video encoding apparatus 200 may also perform down-sampling from the time dimension to reduce the amount of calculation and meet real-time requirements.
  • the processing of the pixel block by the video encoding device 200 may be to take an average of the pixel values as the pixel value of the new pixel.
  • the processing of the pixel block by the video coding device 200 may also be a median value, a maximum value, or a minimum value of the pixel value, which is not limited in the embodiment of the present application.
  • an image frame is generally composed of multiple pixels, and each pixel has a pixel value. Adjacent pixels in the image frame can form a pixel block (1 pixel can also be regarded as a 1*1 pixel block), and convert n*n (n greater than 1) pixel blocks into 1*1 pixels according to the pixel value Blocks can achieve down-sampling of image frames.
  • a pixel block formed by a plurality of adjacent pixels may be used as a unit for encoding.
  • the smallest pixel block for encoding the image frame is the coding unit block.
  • the size of the coding unit block may be the same as or different from the size of the pixel block during downsampling. This embodiment does not limit this.
  • each pixel in the image frame has its own ROI type, where the ROI type is specifically used to indicate that the pixel belongs to the object area, the characteristic part area of the object, the isolation area or the background area, etc., the pixels of the same ROI type
  • the aggregation forms the corresponding area, namely the object area, the characteristic part area of the object, the isolation area or the background area and so on.
  • the video encoding device 200 extracts an N-level ROI from the target image frame.
  • the N-level ROI includes at least the object area, the characteristic part area of the object, and the isolation area.
  • the object area is called the first-level region of interest, that is, L1-ROI
  • the feature area of the object is called the second-level region of interest, that is, L2-ROI
  • the isolated area is called the second-level region of interest.
  • the region of interest is L3-ROI.
  • the target image frame also includes a background area.
  • the video encoding device 200 may regard the area other than the ROI in the target image frame as a background area.
  • the so-called L1-ROI is the area corresponding to the object in the target image frame.
  • the object refers to the entity that needs to be detected or controlled in the target image frame, and the entity may be a human body, a vehicle, and so on.
  • L2-ROI refers to the region corresponding to the characteristic part of the object in the target image frame.
  • the characteristic part refers to the part used to identify the object.
  • the characteristic part may be a face or head and shoulders, etc.
  • the characteristic part may be a license plate.
  • L3-ROI refers to the area separating the background area and L2-ROI in the target image frame. Considering the positional relationship between the background area and the L2-ROI, the L3-ROI may be an area surrounding the L2-ROI.
  • the embodiment of the present application also uses a person recognition scene to illustrate the foregoing background region, L1-ROI, L2-ROI, and L3-ROI as examples.
  • the video encoding device 200 recognizes the human body from the target image frame, and then extracts the face area (in some cases, the head and shoulders area including the face) for each recognized human body as L2 -ROI, the L2-ROI can be specifically referred to the area outlined by the innermost dashed box in FIG. 5. Then, the area surrounding the L2-ROI is extracted as the L3-ROI.
  • the L3-ROI refer to the area outlined by the dashed frame surrounding the innermost layer in FIG. 5.
  • the area except the L2-ROI and the L3-ROI area in the human body area is taken as the L1-ROI, and the L1-ROI can be specifically referred to the area outlined by the dashed box adjacent to the dashed box corresponding to the L3-ROI in FIG. 5.
  • extract the area except L1-ROI, L2-ROI and L3-ROI in the target image frame as the background area.
  • FIG. 5 illustrates an example of the characteristic part area L2-ROI of the object that is completely surrounded by the isolation area L3-ROI.
  • the isolation region L3-ROI may also be a feature part region L2-ROI that partially includes the object, so as to realize isolation between the background region and the feature part region L2-ROI of the object.
  • the isolation region L3-ROI may only include the part in contact with the background region, but not the part in contact with the target region L1-ROI.
  • the video encoding device 200 may be implemented by using a pre-trained ROI extraction model when extracting an N-level ROI.
  • the ROI extraction model can be trained based on a simplified semantic segmentation network with a CNN as the backbone.
  • the input of the ROI extraction model is a single target image frame, and the output is an ROI category with a certain size area (for example, 16*16 or 4*4) as the granularity.
  • the ROI category specifically includes L1, L2, L3, or non-ROI (referred to as background in this embodiment).
  • the video encoding device 200 may first extract L1-ROI and L2-ROI from the target image frame.
  • the video encoding device 200 may also extract the background area together.
  • the video encoding device 200 may use the ROI extraction model to identify whether the ROI category with a certain size area as the granularity in a single target image frame is L1, L2, or non-ROI, so as to extract L1-ROI, L2-ROI and background regions. Need to explain, consider
  • L1-ROI and L2-ROI can be expanded.
  • expansion specifically refers to the convolution processing of the area to be expanded, so that the area to be expanded expands to the outside.
  • the convolution processing of the area to be expanded is implemented based on a preset convolution kernel.
  • the video encoding device 200 may perform expansion processing on the L2-ROI, that is, using the convolution kernel to convolve the L2-ROI, so that the L2-ROI expands outward to obtain the L2-ROI'.
  • the video encoding device 200 can make a difference between the L2-ROI' after the expansion process and the L2-ROI before the expansion process, so as to obtain the L3-ROI.
  • L3-ROI is the difference between L2-ROI' and L2-ROI.
  • the N-level ROI can also be a 4-level or more ROI.
  • the video encoding device 200 may also extract at least one area of the eyes, nose, mouth, etc. of a person's face to obtain an enhanced area.
  • This enhanced region is also called the fourth-level region of interest, that is, L4-ROI.
  • L4-ROI has a higher code rate, and the higher code rate is at least not lower than the code rate of L2-ROI.
  • L4-ROI based on a higher bit rate can enhance the recognition ability of human faces and improve the recognition rate.
  • L4-ROI is located within L2-ROI; in other embodiments, L4-ROI can also be located in other positions: for example, between L1-ROI and L3-ROI, or between L3-ROI and the background area , No more details here.
  • the video encoding device 200 determines the encoding strategy of the target image frame according to the bitstream characteristics set for the N-level region of interest.
  • the code stream feature refers to the feature related to the compression rate and/or quality of the video stream.
  • the code stream characteristics include at least the code rate.
  • the code rate set for the N-level ROI satisfies the preset relationship, that is, the code rate of the background area is less than the code rate of L1-ROI and the code rate of L3-ROI, and the code rate of L1-ROI is less than The bit rate of L2-ROI.
  • the code rate set for the background area is the lowest, the code rate set for the L1-ROI and the code rate set for the L3-ROI higher than the background area.
  • L1-ROI may be the second lowest, and the code rate of L3-ROI may be higher than the code rate of L1-ROI.
  • the code rate set for L2-ROI is relatively high, and specifically may be higher than the code rate set for L1-ROI. Among them, the code rate set for L2-ROI may be less than the code rate set for L3-ROI. Of course, the code rate set for L2-ROI can also be greater than or equal to the code rate set for L3-ROI.
  • the code stream feature also includes the size of the feature part of the object, such as the size of a human face/license plate.
  • the bit stream feature also includes the input bit rate.
  • the code stream feature may also include motion complexity.
  • the bitstream feature may also include the ROI percentage.
  • the proportion of the ROI can be determined by a ratio sequence composed of the ratio of the size of the ROI to the size of the target image frame within a preset time period.
  • the mean value or the median value of the ratio sequence can be determined as the proportion of ROI.
  • the video encoding device 200 may also determine the ROI ratio according to the range in which more ratios in the ratio sequence fall.
  • the foregoing preset time period is specifically a time period determined based on at least one group of pictures (GOP). Assuming that one GOP is 150 frames and the frame change rate is 25 frames per second, the preset time period may be 6 seconds.
  • the coding strategy includes coding parameters.
  • the coding parameters may include quantization parameters (QP).
  • QP can represent the quantization step size, the smaller the QP value, the smaller the quantization step size, the higher the quantization accuracy, and the higher the coding quality.
  • QP usually exists in the form of a quantization parameter map QP map that matches the frame size of the target image.
  • the video encoding device 200 may determine the quantization parameter corresponding to each region according to the code stream characteristics of each region, and then form a QP map corresponding to the entire target image frame according to the quantization parameter corresponding to each region, thereby obtaining an encoding strategy.
  • the coding parameter further includes a coding unit (CU) block reference trend, and the CU block reference trend can take a value of intra, inter, or skip (skip).
  • the encoding parameter may also include a GOP mode, and the GOP mode may be a normal mode (nomalp) or a smart mode (smartp).
  • the coding parameters may also include motion compensation QP.
  • the video encoding device 200 may use the code stream feature set for the N-level ROI as the encoding target, for example, use the code rate set for the N-level ROI as the encoding target, and establish a set of equations about the encoding parameters. , By solving the form of equations, the coding parameters that meet the above coding goals are obtained as coding strategies.
  • the video encoding device 200 may use an encoding strategy model to determine the encoding parameters of the target image frame according to the bit stream characteristics set for the N-level region of interest, so as to obtain the encoding of the target image frame.
  • Strategy the coding strategy model includes the corresponding relationship between the code stream feature and the coding parameter.
  • the video encoding device 200 inputs the code stream feature value coding strategy model, and the coding strategy model can determine the coding parameter matching the code stream feature based on the above corresponding relationship. And output the coding parameters to obtain the coding strategy.
  • the coding strategy model can be obtained by model training through historical video streams as training samples.
  • the video encoding device 200 may obtain a large amount of historical video streams as training video streams, and then use the encoding strategies in the encoding strategy library to encode the training video streams, and determine the compression rates of the video streams encoded by various encoding strategies.
  • quality can be characterized by one or more of peak signal to noise ratio (PSNR) and structural similarity (structural similarity index, SSIM).
  • PSNR peak signal to noise ratio
  • SSIM structural similarity index
  • the video encoding device 200 may select a target encoding strategy that matches the training video stream according to the compression rate and quality.
  • the video encoding device 200 may use an encoding strategy adopted by a video stream with a higher compression rate and higher quality after encoding as a target encoding strategy.
  • the video encoding device 200 can use the target encoding strategy as the label of the training video stream. In this way, a training video stream and its target encoding strategy can form a training sample, and the video encoding device 200 can use the above training samples for model training, thereby obtaining the encoding strategy Model.
  • the video encoding apparatus 200 may cluster encoding strategies to generate an encoding strategy library.
  • the coding strategies can be clustered according to at least one of factors such as scene, resolution, time, or weather.
  • the scene may include animation, surveillance, action movies, ordinary movies, sports events, news and/or games, etc.
  • the resolution can include 720P, 1080P, 2K or 4K and so on.
  • Time can include day or night.
  • the weather may include cloudy, sunny, rainy, foggy, snowy and so on.
  • the video encoding device 200 may traverse the encoding strategy library according to the above categories, and perform encoding based on each encoding strategy in each category.
  • the video encoding device 200 can use the encoding strategy model to determine according to the bit stream characteristics set for the ROI with a higher bit rate.
  • the coding parameters corresponding to the overlapping area For example, when L2-ROI and L3-ROI overlap, the coding strategy model can be used to determine the coding parameters corresponding to the overlapping area based on the code stream characteristics set for the L3-ROI.
  • the video encoding device 200 encodes the target image frame according to the encoding strategy to obtain a target video stream.
  • the video encoding apparatus 200 may quantize the target image frame according to the encoding parameters, especially the QP map in the encoding parameters, so as to implement video encoding.
  • the coding parameters may also include CU block reference trend, GOP mode and/or motion compensation QP.
  • the video encoding device 200 may also determine whether to adopt intra prediction, inter prediction, or skip based on the CU block reference trend.
  • the video encoding device may also determine to adopt nomalp or smartp based on the GOP mode.
  • the video encoding device 200 may also quantize the target image frame based on the motion compensation QP.
  • the image frames of the target video stream encoded according to the coding strategy meet the preset code rate relationship, that is, the image frames of the target video stream.
  • the code rate of the background area is lower than the code rate of the target area, and the code rate of the target area is lower than the code rate of the characteristic part area of the object and the code rate of the isolated area.
  • the code rate of the background area in the image frame of the target video stream is the lowest. Although the quality of the background area is reduced, the compression rate of the background area is improved.
  • the code rate of the object area in the image frame of the target video stream is higher than the code rate of the background area, so the quality and compression rate of the object area are balanced.
  • the feature area of the object in the image frame of the target video stream has a higher code rate. Although the compression rate of the feature area of the object is reduced, the quality of the feature area of the object is improved and the object recognition rate is guaranteed.
  • the embodiment of the present application also increases the code rate of the area near the feature part area of the object (the area with the increased code rate forms the isolation area described above). Because the pixels in one area will refer to nearby pixels when encoding, , The coding quality of the feature area of the object is affected by the isolation area, and the isolation area has a higher code rate. Therefore, the coding quality of the feature area of the object is also improved, which can meet the needs of object recognition and other services.
  • the overall target image frame still has a higher compression rate.
  • the feature area of the object contributes a lot to object recognition.
  • the coding quality of the feature area of the object can be greatly improved, thereby significantly improving the coding quality of the target image frame. . That is, in the case of slightly increasing the size of the image frame in the embodiment of the present application, the coding quality of the image frame is significantly improved.
  • the above is mainly an example of an online video encoding process.
  • the video encoding method provided in the embodiment of the present application can also be used to compress or format a local video file. Specifically, after inputting the local video stream, the local video stream is decapsulated, and then the local video stream is decoded to obtain the target image frame, and then the target image frame is extracted from the target image frame, including the object-level ROI, the characteristic part-level ROI, and the isolation ROI.
  • the coding strategy is determined based on the bit stream characteristics set for the N-level ROI, and encoding is performed based on the coding strategy, so as to achieve local video file compression or format conversion.
  • the video encoding device 200 includes:
  • the communication module 202 is used to obtain a target image frame
  • the extraction module 206 is configured to extract an N-level region of interest from the target image frame.
  • the N-level region of interest includes at least an object area, a feature part area of the object, and an isolation area, and the isolation area is used to isolate the The background area in the target image frame and the characteristic part area of the object, the code rate of the background area is lower than the code rate of the target area, and the code rate of the object area is lower than the code rate of the characteristic part area of the object And the code rate of the isolated area;
  • the determining module 208 is configured to determine the coding strategy of the target image frame according to the code stream characteristics set for the N-level region of interest, where the code stream characteristics include at least a code rate, and are set for the N-level region of interest.
  • the predetermined code rate satisfies a preset relationship, and the preset relationship includes that the code rate of the background area is less than the code rate of the object area and the code rate of the isolated area, and the code rate of the object area is less than the code rate of the isolated area.
  • the code rate of the characteristic part area of the object is configured to determine the coding strategy of the target image frame according to the code stream characteristics set for the N-level region of interest, where the code stream characteristics include at least a code rate, and are set for the N-level region of interest.
  • the predetermined code rate satisfies a preset relationship, and the preset relationship includes that the code rate of the background area is less than the code rate of the object area and the code rate of the isolated area, and the code rate of the
  • the encoding module 210 is configured to encode the target image frame according to the encoding strategy to obtain a target video stream, and the code rate of the N-level region of interest in the image frame of the target video stream satisfies the preset relationship .
  • the video encoding device 200 may further include a decoding module 204.
  • the communication module 202 is specifically configured to obtain an original video stream, which is a video stream obtained by encoding based on an original image frame (when the video capture device includes an ISP, it is an image frame processed by the ISP).
  • the decoding module 204 is used to decode the original video stream to obtain the original image frame (when the video capture device includes an ISP, it is an image frame processed by the ISP).
  • the target image frame can be obtained based on the above-mentioned original image frame (or image frame processed by ISP).
  • the extraction module 206 extracts the N-level region of interest from the target image frame
  • the determination module 208 determines the encoding strategy of the target image frame based on the code stream characteristics set for the N-level region of interest
  • the encoding module 210 The encoding strategy encodes the target image frame to obtain a target video stream, and the code rate of the N-level region of interest in the image frame of the target video stream satisfies the preset relationship.
  • the extraction module 206 is specifically configured to:
  • the isolation area is the difference between the feature part area of the object before and after the expansion process.
  • the encoding strategy includes encoding parameters
  • the determining module 208 is specifically configured to:
  • the encoding module 210 is specifically configured to:
  • the target image frame is quantized according to the coding parameter, thereby realizing video coding.
  • the determining module 208 is specifically configured to:
  • the coding strategy model is used to determine the coding parameters corresponding to the overlapping regions according to the bit stream characteristics set for the region of interest with a higher bit rate.
  • the apparatus 200 further includes:
  • the training module is used to obtain a training video stream, use the coding strategy in the coding strategy library to encode the training video stream, determine the compression rate and quality of the encoded video stream, and determine and
  • the target coding strategy matched by the training video stream is trained using the training video stream and the target coding strategy to obtain a coding strategy model.
  • the apparatus 200 further includes:
  • the down-sampling module is used to down-sample the reference image frame to obtain the target image frame.
  • the preset relationship further includes: the code rate of the target area is less than the code rate of the isolation area.
  • the video encoding device 200 may correspond to the implementation of the video encoding method described in the embodiment of the present application, and the above-mentioned and other operations and/or functions of the various modules in the video encoding device 200 are respectively intended to realize the For the sake of brevity, the corresponding process of each method will not be repeated here.
  • the functions of the above-mentioned video encoding apparatus 200 may be implemented by an independent video encoding device 108, and the specific implementation of the video encoding device 108 may refer to the description of related content in FIG. 3.
  • the functions of the encoding apparatus 200 described above may also be implemented by the video capture device 102 or the storage device 104.
  • Figures 6 to 7 also provide a device.
  • the device shown in FIG. 6 is used to implement the function of the video encoding device 200 on the basis of implementing the video capture function.
  • the device shown in FIG. 7 is also used to implement the function of the video encoding apparatus 200 on the basis of implementing the video storage function.
  • the device 600 includes a bus 601, a processor 602, a communication interface 603, a memory 604, and an image sensor 605.
  • the central processing unit 602, the memory 604, the communication interface 603, and the image sensor 605 communicate through a bus 601.
  • the bus 601 may be a peripheral component interconnect standard (PCI) bus or an extended industry standard architecture (EISA) bus, etc.
  • the bus can be divided into address bus, data bus, control bus and so on. For ease of representation, only one thick line is used in FIG. 6, but it does not mean that there is only one bus or one type of bus.
  • the communication interface 603 is used to communicate with the outside.
  • the image sensor 605 is a photoelectric conversion device. It uses the photoelectric conversion function to convert the light image on the photosensitive surface into an electrical signal proportional to the light image, and then obtain the original image frame.
  • the target image frame can be obtained based on the original image frame.
  • the processor 602 may be a central processing unit (CPU).
  • the memory 604 may include a volatile memory (volatile memory), such as a random access memory (random access memory, RAM).
  • the memory 604 may also include a non-volatile memory (non-volatile memory), such as a read-only memory (ROM), flash memory, HDD or SSD.
  • the memory 604 stores executable code, and the processor 602 executes the executable code to execute the aforementioned video encoding method.
  • the extraction module 206 and the determination module in FIG. 2A are executed.
  • the software or program codes required for the functions of 208 and the encoding module 210 are stored in the memory 604.
  • the functions of the communication module 202 can be implemented through a bus 601 and a communication interface 603, where the bus 601 is used for internal communication, and the communication interface 603 is used for communication with the outside.
  • the original image frame generated by the photoelectric conversion of the image sensor 605 can be transmitted to the processor 602 via the bus 601 for down-sampling processing to obtain the target image frame, which can be stored in the memory 604, and the processor 602 executes each of the stored in the memory 604.
  • the program code corresponding to the module executes the video encoding method provided in the embodiment of the present application on the target image frame to obtain the target video stream.
  • the processor 602 may also include an image signal processor (image signal processor, ISP).
  • ISP image signal processor
  • the ISP can process the original image frames generated by the image sensor 605, such as de-drying and sharpening.
  • the processor 602 down-samples the image frame after the ISP processing to obtain the target image frame.
  • the processor 602 executes the program code corresponding to each module stored in the memory 604, executes the video encoding method provided in the embodiment of the present application on the target image frame, to obtain the target video stream.
  • the memory 604 may also store the foregoing target video stream, and transmit the target video stream to other devices through the communication interface 603.
  • the device 700 includes a bus 701, a processor 702, a communication interface 703, and a memory 704.
  • the processor 702, the memory 704, and the communication interface 703 communicate through a bus 701.
  • the device 700 executes the decoding module 204, the extraction module 206, the determination module 208 and the determination module 208 in FIG. 2B.
  • the software or program code required for the function of the encoding module 210 is stored in the memory 704.
  • the function of the communication module 202 can be implemented through a bus 701 and a communication interface 703, where the bus 701 is used for internal communication, and the communication interface 703 is used for external communication.
  • the communication interface 703 may receive a video stream input by other devices, which is referred to as an original video stream in the embodiment of the present application.
  • the processor 702 executes the program code corresponding to each module stored in the memory 704, and executes the video encoding method provided in the embodiment of the present application on the original video stream.
  • the coding strategy of the target image frame is determined based on the bit stream characteristics set for the N-level ROI, and the target image frame is coded according to the coding strategy of the target image frame to obtain the target video stream .
  • An embodiment of the present application also provides a computer-readable storage medium, including instructions, which when run on a computer device, cause the computer device to execute the above-mentioned video encoding method applied to the video encoding apparatus 200.
  • An embodiment of the present application also provides a computer-readable storage medium, including instructions, which when run on a computer device, cause the computer device to execute the above-mentioned video encoding method applied to the video encoding apparatus 200.
  • the embodiments of the present application also provide a computer program product.
  • the computer device executes any one of the foregoing video encoding methods.
  • the computer program product may be a software installation package.
  • the computer program product may be downloaded and executed on the computer.
  • the device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physically separate.
  • the physical unit can be located in one place or distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the connection relationship between the modules indicates that they have a communication connection between them, which may be specifically implemented as one or more communication buses or signal lines.
  • this application can be implemented by means of software plus necessary general hardware.
  • it can also be implemented by dedicated hardware including dedicated integrated circuits, dedicated CPUs, dedicated memory, Dedicated components and so on to achieve.
  • all functions completed by computer programs can be easily implemented with corresponding hardware.
  • the specific hardware structures used to achieve the same function can also be diverse, such as analog circuits, digital circuits or special-purpose circuits. Circuit etc.
  • software program implementation is a better implementation in more cases.
  • the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a readable storage medium, such as a computer floppy disk. , U disk, mobile hard disk, ROM, RAM, magnetic disk or optical disk, etc., including several instructions to make a computer device (which can be a personal computer, training device, or network device, etc.) execute the various embodiments described in this application method.
  • a computer device which can be a personal computer, training device, or network device, etc.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium.
  • the computer instructions may be transmitted from a website, computer, training device, or data.
  • the center transmits to another website, computer, training equipment, or data center through wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.).
  • wired such as coaxial cable, optical fiber, digital subscriber line (DSL)
  • wireless such as infrared, wireless, microwave, etc.
  • the computer-readable storage medium may be any available medium that can be stored by a computer or a data storage device such as a training device or a data center integrated with one or more available media.
  • the usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, and a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium (for example, a solid state disk (SSD)).

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

一种视频编码方法,包括:从目标图像帧中提取三级或者更多级感兴趣区域,感兴趣区域至少包括对象区域、对象的特征部位区域以及用于隔离目标图像帧中背景区域和对象的特征部位区域的隔离区域,针对这些感兴趣区域分别设定不同的码率,其中,背景区域的码率小于对象区域的码率以及隔离区域的码率,对象区域的码率小于对象的特征部位区域的码率,根据区域的不同选择相应的编码策略,根据编码策略对目标图像帧进行编码,得到目标视频流,该目标视频流的图像帧中各区域的码率与设定相符。通过提高对象的特征部位区域附近区域的码率,形成隔离区域,用于隔离背景区域和对象的特征部位区域,以增强对象的特征部位区域的质量,保障了编码质量。

Description

一种视频编码方法、装置、设备及介质
本申请要求于2020年02月21日提交中国国家知识产权局、申请号为202010108310.0、发明名称为“一种视频编码方法、装置、设备及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,尤其涉及一种视频编码方法、装置、设备以及计算机可读存储介质。
背景技术
随着信息时代的来临,越来越多的行业开始采用视频等技术实现相应的业务。例如,安防行业采用视频监控实现对人或者车辆的监控,企业、机构等组织采用视频会议实现远程办公,新媒体及教育行业采用视频直播实现信息传播。
上述业务的实现依赖于前端的采集设备和后端的存储设备。具体地,前端的采集设备如摄像机采集视频数据,然后将数据存储在后端的存储设备,以便提供给相应的用户。按照相关规定,视频通常需要保存一段时间。例如,在安防行业,视频需保存至少30天,全景图像(大图)需保存至少90天,目标图像(小图)需保存至少365天。若前端的采集设备采集的视频数据直接传输至后端的存储设备进行存储,将会带来较大的传输压力和存储压力,增加了存储成本和传输成本。
基于此,业界亟需提供一种视频编码方法,以减少视频的存储成本和传输成本。
发明内容
本申请实施例提供了一种视频编码方法,通过该方法进行编码,保障了编码质量,实现了视频压缩,从而减少了视频的存储成本和传输成本。本申请还提供了对应装置、设备、介质、计算机可读存储介质以及计算机程序产品。
第一方面,本申请提供了一种视频编码方法。该方法支持从目标图像帧中提取N级感兴趣区域,这N级感兴趣区域至少包括对象区域、对象的特征部位区域以及用于隔离所述目标图像帧中的背景区域以及所述对象的特征部位区域的隔离区域。针对上述N级感兴趣区域预先设定有码流特征,该码流特征至少包括码率。其中,针对N级感兴趣区域设定的所述码率满足预设关系,该预设关系具体为所述背景区域的码率小于所述对象区域的码率、所述对象的特征部位区域的码率以及所述隔离区域的码率,所述对象区域的码率小于所述对象的特征部位区域的码率。根据针对上述N级感兴趣区域设定的码流特征可以确定所述目标图像帧的编码策略,根据所述编码策略对所述目标图像帧进行编码,得到目标视频流,该目标视频流的图像帧中所述N级感兴趣区域的码率满足上述预设关系。
在该方法中,针对背景区域设定的码率最低,基于该最低码率在内的码流特征确定的编码策略可以包括较大的量化参数,而量化参数与编码质量、压缩率强相关,基于较大的 量化参数进行编码可以仅保留最低主观质量,提升视频压缩率。针对对象区域设定的码率高于背景区域的码率,例如可以为次低,基于次低码率在内的码流特征确定的编码策略可以包括居中的量化参数,基于居中的量化参数进行编码可以实现压缩率和对象识别率的平衡。针对对象的特征部位区域设定的码率较高(高于针对上述对象区域设定的码率),基于较高码率在内的码流特征确定的编码策略可以包括较小的量化参数,基于较小的量化参数进行编码时可以保障编码质量,确保对象识别率。
进一步地,根据编码算法,在对目标图像帧中任一区域的像素进行编码时,会参考该区域附近的像素,并且二者的编码质量呈正相关。也就是说,在保持编码策略不变的情况下,一个区域附近像素的码率越高,该区域编码后的图像质量也越高。在识别图像中的对象时,主要是基于对象的特征部位区域实现。基于此,本申请实施例还提高了对象的特征部位区域附近的背景区域的码率,从而提高对象的特征部位区域的编码质量。
为了便于区分,本申请实施例将对象的特征部位区域附近的且提高了设定码率的区域称之为隔离区域。该隔离区域一般是环绕对象的特征部位区域的环形区域。其通过包围对象的特征部位区域将对象的特征部位区域与背景区域隔离。该隔离区域一般包括背景区域的一部分。当然,在一些情况下,还包括对象区域的一部分。本申请实施例将背景区域中余下区域仍然称为背景区域(在没有特别说明的情况下,本申请实施例中背景区域这个称呼均是指该区域),将对象区域中余下区域仍然称为对象区域(在没有特别说明的情况下,本申请实施例中对象区域这个称呼均是指该区域)。
考虑到背景区域大小在整个目标图像帧的占比较大,对象的特征部位区域附近区域在整个目标图像帧中所占的范围并不大,因此即使提高了对象的特征部位区域附近区域的码率,目标图像帧整体仍然具有较高的压缩率。换句话说,与不提高对象的特征部位区域附近区域的码率相比,整个图像的平均压缩率增加的不多。而对象的特征部位区域对于对象识别的贡献较大,通过提高对象的特征部位区域附近区域的码率,能够较大程度提升对象的特征部位区域的编码质量,进而明显改善目标图像帧的编码质量。也即,本申请实施例在略微增加图像帧大小的情况下,明显地改善了图像帧的编码质量。
在一些可能的实现方式中,可以通过如下方式提取N级感兴趣区域。具体地,先从目标图像帧中提取所述对象区域以及所述对象的特征部位区域,然后对所述对象的特征部位区域进行膨胀处理,即通过卷积使得对象的特征部位区域向外部扩展,向外部扩展的区域也即所述对象的特征部位区域在膨胀处理前后的差值区域,即为所述隔离区域。由于提取的对象的特征部位区域可能产生残缺的部分,尤其是在采用模型推理方式提取对象的特征部位区域时,为此,本申请实施例还可以通过对对象的特征部位区域进行膨胀处理得到隔离区域,并为该隔离区域设定较高的码率,从而弥补感兴趣区域的残缺部分,避免了感兴趣区域的残缺部分被编码为低码率的视频流,从而影响整体编码质量。
在一些可能的实现方式中,所述编码策略包括编码参数。在根据所述N级感兴趣区域的码流特征确定所述目标图像帧的编码策略时,可以通过编码策略模型实现。具体地,根据所述N级感兴趣区域的码流特征,利用编码策略模型可以确定所述目标图像帧的所述编码参数。对应地,根据该编码参数可以对目标图像帧进行量化,从而实现视频编码。该方法以人工智能(artificial intelligence,AI)方式学习编码策略决策经验,基于该决策经验可 以实现自动根据预先设定的码率特征为不同区域确定编码策略,无需人工根据经验设置编码策略,如此实现端到端地自动视频编码,提高了编码效率。此外,基于AI所得的编码策略模型能够应用于多种场景,适应不同场景的通用化需求。
在一些可能的实现方式中,考虑到有些感兴趣区域可能存在重叠的情形,例如隔离区域与背景区域存在重叠,或者隔离区域与对象区域存在重叠等等,基于此,可以根据针对码率较高的感兴趣区域设定的码流特征,利用编码策略模型确定重叠区域对应的编码参数。
在一些可能的实现方式中,编码策略模型可以通过如下方式训练得到。具体地,获取训练视频流,然后利用编码策略库中的编码策略对所述训练视频流进行编码,确定编码后的视频流的压缩率和质量,接着根据所述压缩率和所述质量确定与所述训练视频流匹配的目标编码策略,最后利用所述训练视频流以及所述目标编码策略(作为训练视频流的标签)进行模型训练,获得编码策略模型。
在一些可能的实现方式中,考虑到实时性业务需求,还可以在获得参考图像帧后,对该参考图像帧进行降采样,从而得到目标图像帧。由于降采样所得的目标图像帧的尺寸较小,分辨率相对较低,基于该目标图像帧进行编码可以减少计算量,提高编码效率,缩短编码时间,从而满足实时性业务需求。
其中,参考图像帧可以是视频采集设备的图像传感器如互补金属氧化物半导体(complementary metal oxide semiconductor,CMOS)传感器或者电性耦合元件(charge-coupled device,CCD)传感器进行光电转换生成的原始图像帧。在有些情况下,视频采集设备还可以通过图像信号处理器(image signal processor,ISP)对原始图像帧进行处理,例如进行去坏点、去噪、旋转、锐化和/或颜色增强等处理。基于此,参考图像帧还可以是经过ISP处理的图像帧。还需要说明的是,参考图像帧可以直接通过视频采集设备获取,也可以是对输入的视频流进行解码得到。
在一些可能的实现方式,为了获得较好的编码质量,针对隔离区域设定的码率还可以高于针对对象区域设定的码率。
第二方面,本申请提供了一种视频编码装置。该装置包括提取模块、确定模块和编码模块。其中,提取模块用于从目标图像帧中提取N级感兴趣区域。所述N级感兴趣区域至少包括对象区域、对象的特征部位区域以及隔离区域。所述隔离区域用于隔离所述目标图像帧中的背景区域以及所述对象的特征部位区域。所述确定模块,用于根据针对所述N级感兴趣区域设定的码流特征确定所述目标图像帧的编码策略,所述码流特征至少包括码率。针对所述N级感兴趣区域设定的所述码率满足预设关系,所述预设关系包括所述背景区域的码率小于所述对象区域的码率以及所述隔离区域的码率,所述对象区域的码率小于所述对象的特征部位区域的码率。所述编码模块,用于根据所述编码策略对所述目标图像帧进行编码,得到目标视频流,所述目标视频流的图像帧中所述N级感兴趣区域的码率满足所述预设关系。
在一些可能的实现方式中,所述提取模块具体用于:
从目标图像帧中提取所述对象区域以及所述对象的特征部位区域;
对所述对象的特征部位区域进行膨胀处理,获得所述隔离区域,所述隔离区域为所述对象的特征部位区域在膨胀处理前后的差值。
在一些可能的实现方式中,所述编码策略包括编码参数,所述确定模块具体用于:
根据针对所述N级感兴趣区域设定的码流特征,利用编码策略模型确定所述目标图像帧的所述编码参数;
所述编码模块具体用于:
根据所述编码参数对所述目标图像帧进行量化,从而实现视频编码。
在一些可能的实现方式中,所述确定模块具体用于:
当不同级别的感兴趣区域发生重叠时,根据针对码率较高的感兴趣区域设定的码流特征,利用编码策略模型确定重叠区域对应的编码参数。
在一些可能的实现方式中,所述装置还包括:
训练模块,用于获取训练视频流,利用编码策略库中的编码策略对所述训练视频流进行编码,确定编码后的视频流的压缩率和质量,根据所述压缩率和所述质量确定与所述训练视频流匹配的目标编码策略,利用所述训练视频流以及所述目标编码策略进行模型训练,获得编码策略模型。
在一些可能的实现方式中,所述装置还包括:
降采样模块,用于对参考图像帧进行降采样,得到目标图像帧。
在一些可能的实现方式中,所述预设关系还包括:所述对象区域的码率小于所述隔离区域的码率。
第三方面,本申请提供一种设备,所述设备包括处理器和存储器。所述处理器、所述存储器进行相互的通信。所述处理器用于执行所述存储器中存储的指令,以使得所述设备执行如第一方面或第一方面的任一种实现方式中的视频编码方法。
第四方面,本申请提供一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在设备如计算机设备上运行时,使得设备执行上述第一方面或第一方面的任一种实现方式所述的视频编码方法。
第五方面,本申请提供了一种包含指令的计算机程序产品,当其在设备上运行时,使得设备执行上述第一方面或第一方面的任一种实现方式所述的视频编码方法。
本申请在上述各方面提供的实现方式的基础上,还可以进行进一步组合以提供更多实现方式。
附图说明
为了更清楚地说明本申请实施例的技术方法,下面将对实施例中所需使用的附图作以简单地介绍。
图1为本申请实施例提供的一种视频编码方法的场景示意图;
图2A为本申请实施例提供的一种视频编码方法的场景示意图;
图2B为本申请实施例提供的一种视频编码方法的场景示意图;
图2C为本申请实施例提供的一种视频编码方法的场景示意图;
图2D为本申请实施例提供的一种视频编码方法的场景示意图;
图3为本申请实施例提供的一种视频编码设备的结构示意图;
图4为本申请实施例提供的一种视频编码方法的流程图;
图5为本申请实施例提供的一种N级感兴趣区域的示意图;
图6为本申请实施例提供的一种设备的结构示意图;
图7为本申请实施例提供的一种设备的结构示意图。
具体实施方式
下面将结合本申请中的附图,对本申请提供的实施例中的方案进行描述。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换,这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。
为了便于理解本申请的技术方案,首先对本申请涉及的一些技术术语进行介绍。
视频具体是指变化速率达到预设速率的图像帧序列。其中,预设速率具体可以是24帧每秒。也即连续的图像帧变化达到每秒24帧时,根据视觉暂留原理,人眼无法辨别单帧静态画面,此时连续的图像帧呈现平滑连续的视觉效果,因而可以称之为视频。
视频采集设备如摄像机采集的视频往往占用较多的存储空间,这给视频的传输和存储带来较大的挑战。基于此,业界引入了视频编码技术。所谓视频编码也称作视频压缩,本质上属于一种数据压缩方法。具体地,将原始视频数据按照设定标准或者设定协议进行处理,例如对视频进行预测、变换、量化,减少原始视频数据中的冗余,从而实现原始视频数据压缩,如此可以达到节省存储空间和传输带宽的目的。在播放和分析视频的时候,通过视频解码即可恢复原始视频数据。由于视频编码的过程中往往存在信息的丢失,因此重新得到的视频的信息量可能会少于原始视频。
其中,视频编码的一个关键即在于识别感兴趣区域(region of interest,ROI)。具体到本申请,ROI是指视频图像帧中以方框、圆、椭圆或不规则多边形等方式勾勒的需要处理的区域。例如,在安防领域的人物识别场景中,ROI可以是人体、人脸等区域。在安防领域的车辆识别场景中,ROI可以是车、车牌等区域。又例如,在教育行业的教学直播场景中,ROI可以是教师、学生、写字板等区域。
在识别ROI时,业界提出了一种基于深度神经网络(deep neural networks,DNN)识别ROI,进而根据该ROI进行视频编码的方法。其中,基于深度神经网络识别ROI可以分为两种方法,包括目标检测方法和语意分割方法。
目标检测方法要求DNN能够判断图像帧中目标(也称作对象,object)的种类,并在图像帧中标记出目标的位置,即同时完成目标分类和定位两项任务。其中,目标是指图像帧中被拍摄的物体,包括人或者车辆等等。在实际应用时,可以采用基于区域的卷积神经网络(regions with convolution neural networks,RCNN)、仅看一次(you only look once,YOLO)网络、单锚多盒检测器(single shot multibox detector,SSD)完成上述目标分类和定位两项任务。
语意分割方法要求DNN能够从像素级(或以上级别)识别图像帧,即标注图像帧中每个像素(或以上级别)所属的对象类别。在实际应用时,可以采用U形网络(U-Net)、全卷积网络(fully convolutional networks,FCN)或者基于掩膜的RCNN(mask RCNN)识别 图像帧中每个像素(或以上级别)所属的对象类别。例如识别该像素是否属于车牌,进而识别车牌ROI。
上述目标检测方法或者语意分割方法通常是识别两级ROI,即对象级ROI和对象的特征部位级ROI。其中,对象级ROI具体可以是人体、车辆等等,对象级的特征部位ROI可以是人脸、车牌等等。这种识别两级ROI,并基于两级ROI进行编码的方法虽然在一定程度上提升了压缩率,但是编码所得的视频流质量不高,难以满足对象识别等业务的需求。
基于此,本申请实施例提供了一种视频编码方法。考虑到视频采集设备采集的视频流往往占用较大的存储空间,为了降低传输压力和存储压力,减少传输成本和存储成本,本申请实施例在进行视频编码时,先从目标图像帧中提取N级ROI,这N级ROI至少包括对象区域、对象的特征部位区域以及用于隔离目标图像帧中的背景区域与上述特征部位区域的隔离区域(换句话说,当N=3时,所述隔离区域位于所述对象区域与所述对象的特征部位区域之间)。也即在提取对象级ROI和特征部位级ROI的基础上,还提取了隔离ROI。
在使用本申请实施例提高的视频编码方法之后。视频帧中的各个区域具有如下码率特点。
(1)非ROI区域的背景部分(背景区域)具有最低码率。基于最低码率在内的码流特征确定的编码策略可以拥有较大的量化参数。而量化参数与编码质量、压缩率强相关,基于较大的量化参数进行编码虽然视频质量被降低,但是可以提升视频压缩率。
(2)对象级ROI(对象区域)的码率高于背景区域的码率。例如,对象区域可以具有次低码率。基于次低码率在内的码流特征确定的编码策略可以拥有居中的量化参数,基于居中的量化参数进行编码可以实现压缩率和对象识别率的平衡。
(3)特征部位级ROI(对象的特征部位区域)具有较高码率(高于对象区域的码率)。在基于较高码率在内的码流特征确定的编码策略可以拥有较小的量化参数,基于较小的量化参数进行编码时可以保障编码质量,确保对象识别率。
(4)隔离ROI(隔离区域)具有较高码率(高于背景区域的码率)。在以上三个区域的基础上,本发明实施例额外提出了隔离ROI的概念。具体地,根据编码算法,在对目标图像帧中任一区域的像素进行编码时,会参考该区域附近的像素,并且二者的编码质量呈正相关。也就是说,在保持编码策略不变的情况下,一个区域附近像素的码率越高,该区域编码后的图像质量也越高。在识别图像中的对象时,主要是基于对象的特征部位区域实现。基于此,本申请实施例还提高了对象的特征部位区域附近的区域(即隔离区域,至少包括部分背景区域)的码率,从而提高对象的特征部位区域的编码质量。
考虑到背景区域大小在整个目标图像帧的占比较大,对象的特征部位区域附近区域在整个目标图像帧中所占的范围并不大,因此即使提高了对象的特征部位区域附近区域的码率,目标图像帧整体仍然具有较高的压缩率。换句话说,与不提高对象的特征部位区域附近区域的码率相比,整个图像的平均压缩率增加的不多。而对象的特征部位区域对于对象识别的贡献较大,通过提高对象的特征部位区域附近区域的码率,能够较大程度提升对象的特征部位区域的编码质量,进而明显改善目标图像帧的编码质量。也即,本申请实施例在略微增加图像帧大小的情况下,明显地改善了图像帧的编码质量,满足对象识别等业务的需求。
本申请提供的视频编码方法可以应用于不同领域的多种场景。具体地,上述视频编码方法可以应用于安防领域的人物/车辆识别场景。例如识别进出小区人员的人脸,实现人员管控。或者识别往来车辆的车牌,实现车辆管控。上述视频方法也可以用于视频会议场景,或者新媒体、在线教育等行业的视频直播场景,在此不再详细描述。为了方便理解,后文以安防场景为例对视频编码方法进行示例说明。
本申请提供的视频编码方法可以应用于如图1所示的应用场景中。该应用场景包括视频采集设备102、存储设备104和终端设备106,视频采集设备102通过网络与存储设备104连接,存储设备104通过网络和终端设备106连接。视频采集设备102采集视频数据,将视频数据传输至存储设备104进行存储。用户可以通过终端设备106发起视频获取操作,终端设备106响应于视频获取操作,获取视频数据,并播放视频数据以供用户查看。
其中,视频采集设备102可以是摄像机,例如:网络摄像头(internet protocol camera,IPC),或者是软件定义摄像头(software define camera,SDC)。摄像机由镜头、图像传感器(例如CMOS、CCD)组成。存储设备104可以是存储服务器或者云服务器。终端设备可以是智能手机、平板电脑、台式机等等。
本申请提供的视频编码方法具体由视频编码装置执行。视频编码装置是软件或者硬件,可以部署于上述视频采集设备102中、或者部署在存储设备104中,也可以部署于上述视频采集设备102、存储设备104以及终端设备106之外的视频编码设备中。
如图2A所示,视频编码装置200可以部署于前端的视频采集设备102中。该视频编码装置200具体包括通信模块202、提取模块206、确定模块208和编码模块210。其中,通信模块202用于获取视频采集设备200采集的原始图像帧,得到目标图像帧。提取模块206用于从目标图像帧中提取N级ROI,这N级ROI至少包括对象区域、对象的特征部位区域以及隔离区域。隔离区域用于隔离上述对象的特征部位和目标图像帧中的背景区域。
针对上述N级ROI预先设定有码流特征,该码流特征至少包括码率。针对N级ROI设定的码率满足如下预设关系:
Figure PCTCN2020108788-appb-000001
其中,bitrate即为码率,下标bg标识背景区域background,下标obj标识对象区域object,下标feat标识对象的特征部位区域feature,下标isol标识隔离区域isolation。其中,背景区域的码率最低,对象区域的码率小于对象的特征部位区域的码率,隔离区域的码率大于背景区域的码率。
进一步地,隔离区域的码率也可以大于对象区域的码率,如此可以进一步提高编码质量。需要说明的是,对象的特征部位区域以及隔离区域的码率大小可以根据实际需要而设置。在一些可能的实现方式中,可以设定隔离区域的码率大于对象的特征部位区域的码率,如下所示:
bitrate feat<bitrate isol             (2)
当然,在另一些可能的实现方式中,也可以设定隔离区域的码率小于或等于对象的特征部位区域的码率,如下所示:
bitrate isol≤bitrate feat             (3)
确定模块208根据针对上述N级ROI设定的码流特征确定目标图像帧的编码策略。编码模块210具体用于根据编码策略对目标图像帧进行编码,得到编码后的目标视频流,该目标视频流的图像帧中所述N级ROI的码率满足所述预设关系。
如此,目标视频流仍可以保持较好的压缩性能,减少传输至存储设备104以及由存储设备104传输至终端设备106的传输带宽,以及节省后端存储设备104和终端设备106的存储空间。同时目标视频流还可以获得较好的编码质量,满足对象识别等业务的需求。
如图2B所示,视频编码装置200也可以部署于后端的存储设备104中。由于输入视频编码装置200的是已编码的视频流,因此,在图2A所示视频编码装置200基础上,图2B所示的视频编码装置200还包括解码模块204。通信模块202用于获取原始视频流,解码模块用于解码原始视频流,获得原始图像帧。基于该原始图像帧可以获得目标图像帧。提取模块206从目标图像帧中提取N级ROI,确定模块208根据针对N级ROI设定的码流特征确定目标图像帧的编码策略。该编码策略包括编码参数。其中,编码参数可以包括量化参数,量化参数能够表征量化步长,量化步长越小,量化精度越高,编码质量也就越高。在有些情况下,编码参数还可以包括编码单元块参考趋势等等。编码模块210根据上述编码策略对目标图像帧进行编码,得到目标视频流。后端的存储设备104通过上述视频编码装置200对视频进行压缩,节省了存储空间,降低存储成本。并且,减少视频由后端的存储设备104传输至终端设备106的传输带宽,降低传输成本。此外,目标视频流具有较好的编码质量,能够满足对象识别等业务的需求。
如图2C或图2D所示,视频编码装置200也可以部署在视频编码设备108中。该视频编码设备108可以是一种盒子设备。如图3所示,该视频编码设备108包括输入接口1082、输出接口1084、处理器1086以及存储器1088。其中,存储器1088中存储有指令和数据。该指令具体可以包括视频编码装置200的指令,该数据具体可以包括通过输入接口1082输入的原始视频流。处理器1086可以调用视频编码装置200的指令对该原始视频流进行解码,然后再提取包括隔离区域在内的N级ROI,并基于针对这N级ROI设定的码流特征确定编码策略,基于该编码策略进行编码得到目标视频流,从而实现视频压缩。存储器1088中还存储上述压缩后的目标视频流。输出接口1084可以输出该目标视频流以便查看。
输入接口1082、输出接口1084可以是网口,例如以太网口等等。针对以太网口,视频编码设备108可以采用以太网卡作为网口驱动芯片。处理器1086可以是基于相机的片上系统(camera system on chip,camera SOC)。作为一个示例,该camera SOC可以是Hi3559。
视频编码设备108部署在视频采集设备102和存储设备104之间时,视频流从前端的视频采集设备102进入视频编码设备108进行视频编码,进而实现视频压缩。视频编码设备108通过提取N级ROI,根据针对N级ROI设定的码流特征确定编码策略,并基于此重新进行视频编码,在实现视频压缩的同时,保障视频编码质量。压缩后的视频流经过网络可以存储到后端的存储设备104,缩减了进入存储设备104的带宽,节省存储空间。
视频编码设备108部署在存储设备104和终端设备106之间时,用户请求视频数据,视频数据从后端的存储设备104进入视频编码设备108,在视频编码设备108中识别多级ROI,基于针对多级ROI设定的码流特征可以确定较为精准的编码策略,基于该编码策略 对目标图像帧进行编码,在保障编码质量基础上,尽可能压缩视频,使得终端设备106的带宽缩减,避免卡顿现象发生,提升了用户体验。
需要说明的是,与图2B类似,图2A和图2B中视频编码装置200的输入为视频流,基于此,视频编码装置200还包括解码模块,用于对视频流进行解码,然后按照本申请的方法进行编码,得到目标视频流。
为了使得本申请的技术方案更加易于理解,下面结合具体实施例对本申请的视频编码方法进行介绍。
参见图4所示的视频编码方法的流程图,该方法包括:
S402:视频编码装置200获取目标图像帧。
当视频编码装置200部署于视频采集设备102时,视频编码装置200可以直接获取原始图像帧。该原始图像帧是指视频采集设备102中具有感光功能的图像传感器如CMOS或CCD进行光电转换生成的图像帧。视频采集装置200获得原始图像帧后,可以根据该原始图像帧获得目标图像帧,以便基于目标图像帧进行编码。
在一些情况下,视频采集设备102中还包括图像信号处理器(image signal processor,ISP),视频采集设备102还可以通过ISP对原始图像帧进行处理,例如进行去坏点、去噪、旋转、锐化和/或颜色增强等处理。视频采集装置200还可以获取经过ISP处理的图像帧,根据经过ISP处理的图像帧获得目标图像帧,以便基于目标图像帧进行编码。
当视频编码装置200部署于存储设备104或者视频编码设备108时,视频编码装置200可以获取原始视频流。该原始视频流是指输入该视频编码装置200的视频流,该视频流可以是视频采集设备102采集原始图像帧,并对原始图像帧(当视频采集设备102包括ISP时,则为ISP处理后的图像帧)进行编码生成。基于此,视频编码装置200可以在获得原始视频流后,通过解码原始视频流,获得原始图像帧(当视频采集设备102包括ISP时,则为ISP处理后的图像帧)。根据原始图像帧(当视频采集设备102包括ISP时,则为ISP处理后的图像帧)可以获得目标图像帧,以便基于目标图像帧进行编码。
为了方便描述,本申请实施例将原始图像帧和ISP处理后的图像帧统称为参考图像帧。在一些实现方式中,视频编码装置200可以直接将参考图像帧作为目标图像帧进行编码,如此可以保障编码后的视频质量。当然,考虑到实时性的需求,在一些实现方式中,视频编码装置200也可以对参考图像帧进行降采样,把降采样得到的图像帧作为目标图像帧。
下面以对原始图像帧降采样进行示例说明。对原始图像帧进行降采样具体是指从空间维度进行降采样。为了便于理解,下面结合具体示例进行说明。在该示例中,原始图像帧的尺寸为1080*720,视频编码装置200可以对原始图像帧中4*4的像素块进行处理转换为一个新的像素点,得到目标图像帧。如此,目标图像帧的分辨率降低,尺寸缩减为270*180。在一些实现方式中,视频编码装置200还可以从时间维度进行降采样,以减少计算量,满足实时性的需求。
其中,视频编码装置200对像素块的处理可以是对像素值取均值作为新的像素点的像素值。当然,在一些可能的实现方式中,视频码装置200对像素块的处理也可以是对像素值取中位值、最大值或者最小值,本申请实施例对此不作限定。
需要说明的是,图像帧一般是由多个像素点组成,每个像素点具有像素值。图像帧中相邻像素点可以形成像素块(1个像素点也可以视为1*1的像素块),根据像素值将n*n(n大于1)的像素块转换成1*1的像素块可以实现图像帧的降采样。另外,在对图像帧进行编码时,也可以是多个相邻像素点形成的像素块为单元进行编码。其中,对图像帧进行编码的最小像素块即为编码单元块。编码单元块的尺寸可以和降采样时的像素块的尺寸相同,也可以不同。本实施例对此不作限定。
另外,图像帧中的每个像素点具有各自的ROI类型,其中,ROI类型具体用于指示像素点属于对象区域、对象的特征部位区域、隔离区域或者背景区域等等,相同ROI类型的像素点聚集形成对应的区域,即对象区域、对象的特征部位区域、隔离区域或者背景区域等等。
S404:视频编码装置200从目标图像帧中提取N级ROI。
其中,N级ROI至少包括对象区域、对象的特征部位区域以及隔离区域。为了方便描述,将对象区域称之为一级感兴趣区域,即为L1-ROI,对象的特征部位区域称之为二级感兴趣区域,即为L2-ROI,隔离区域称之为二级感兴趣区域,即为L3-ROI。其中,目标图像帧中还包括背景区域。
在具体实现时,视频编码装置200可以将目标图像帧中除ROI以外的区域视作背景区域。所谓L1-ROI是目标图像帧中对象对应的区域。其中,对象是指目标图像帧中需要检测或管控的实体,该实体可以是人体、车辆等等。L2-ROI是指目标图像帧中对象的特征部位对应的区域。其中,特征部位是指用于识别对象的部位。对于人体而言,特征部位可以是人脸或者头肩等等,对于车辆而言,特征部位可以是车牌。L3-ROI是指目标图像帧中隔离背景区域和L2-ROI的区域。考虑到背景区域和L2-ROI的位置关系,L3-ROI可以是环绕L2-ROI的区域。
为了方便理解,本申请实施例还以人物识别场景对上述背景区域、L1-ROI、L2-ROI和L3-ROI进行示例说明。如图5所示,视频编码装置200从目标图像帧中识别出人体,然后针对每个识别出的人体分别提取人脸区域(在有些情况下也可以是包括人脸的头肩区域)作为L2-ROI,该L2-ROI具体可以参见图5中最内层的虚线框所勾勒的区域。然后提取出环绕L2-ROI的区域作为L3-ROI,该L3-ROI具体可以参见图5中环绕上述最内层的虚线框所勾勒的区域。接着将人体区域中去除L2-ROI和L3-ROI区域以外的区域作为L1-ROI,该L1-ROI具体可以参见图5中与L3-ROI对应的虚线框相邻的虚线框所勾勒的区域。最后提取目标图像帧中去除L1-ROI、L2-ROI和L3-ROI以外的区域作为背景区域。
图5是以隔离区域L3-ROI完全包围对象的特征部位区域L2-ROI进行示例说明的。在一些可能的实现方式中,隔离区域L3-ROI也可以是部分包括对象的特征部位区域L2-ROI,实现背景区域与对象的特征部位区域L2-ROI的隔离。例如,在图5的示例中,隔离区域L3-ROI也可以仅包括与背景区域接触的部分,而不包括与对象区域L1-ROI接触的部分。
视频编码装置200在提取N级ROI时可以通过预训练的ROI提取模型实现。该ROI提取模型可以是基于主干为CNN的简化语义分割网络训练而成。ROI提取模型的输入为单帧目标图像帧,输出为以一定大小区域(例如16*16或者4*4)为粒度的ROI类别。其中,ROI类别具体包括L1、L2、L3或者非ROI(本实施例中指背景)。
在一些实现方式中,视频编码装置200可以先从目标图像帧中提取L1-ROI和L2-ROI,当然,在实际应用时,视频编码装置200还可以一并提取出背景区域。具体地,视频编码装置200可以利用ROI提取模型识别单帧目标图像帧中以一定大小区域为粒度的ROI类别是L1、L2或者是非ROI,从而提取出L1-ROI、L2-ROI以及背景区域。需要说明,考虑
到模型推理过程中可能产生残缺的部分,可以对上述L1-ROI、L2-ROI进行膨胀。所谓膨胀具体是指将待膨胀区域进行卷积处理,使得待膨胀区域向外部扩展。其中,对待膨胀区域进行卷积处理是基于预先设定的卷积核实现的。
在提取出L2-ROI之后,视频编码装置200可以对L2-ROI进行膨胀处理,即利用卷积核对L2-ROI进行卷积,使得L2-ROI向外部扩展,得到L2-ROI’。视频编码装置200可以将膨胀处理后的L2-ROI’与膨胀处理前的L2-ROI进行作差,从而获得L3-ROI。L3-ROI即为L2-ROI’与L2-ROI的差值。
以上是以N级ROI为3级ROI进行示例说明的。在实际应用时,N级ROI也可以是4级或4级以上ROI。例如,在安防领域的人物识别场景中,视频编码装置200还可以提取人脸的眼睛、鼻子、嘴巴等中的至少一个区域,得到增强区域。该增强区域也称作四级感兴趣区域,即L4-ROI。其中,L4-ROI具有较高码率,该较高码率至少不低于L2-ROI的码率。在L2-ROI基础上,基于较高码率的L4-ROI可以增强对人脸的识别能力,提高识别率。在该实施例中,L4-ROI位于L2-ROI之内;其他实施例中,L4-ROI也可以位于其他位置:例如L1-ROI与L3-ROI之间,或者L3-ROI与背景区域之间,此处不再详述。
S406:视频编码装置200根据针对所述N级感兴趣区域设定的码流特征确定所述目标图像帧的编码策略。
码流特征是指与视频流的压缩率和/或质量具有相关性的特征。码流特征至少包括码率。在本申请实施例中,针对N级ROI设定的码率满足预设关系,即背景区域的码率小于L1-ROI的码率和L3-ROI的码率,且L1-ROI的码率小于L2-ROI的码率。
换言之,针对背景区域设定的码率最低,针对L1-ROI设定的码率和针对L3-ROI设定高于背景区域的码率。例如L1-ROI可以为次低,L3-ROI的码率可以高于L1-ROI的码率。针对L2-ROI设定的码率较高,具体可以高于针对L1-ROI设定的码率。其中,针对L2-ROI设定的码率可以小于针对L3-ROI设定的码率。当然,针对L2-ROI设定的码率也可以大于或等于针对L3-ROI设定的码率。
在一些实现方式中,码流特征还包括对象的特征部位大小,如人脸/车牌的大小。当视频编码装置200的输入为视频流时,码流特征还包括输入码率。当目标图像帧中的对象处于运动状态时,码流特征还可以包括运动复杂度。
在一些实现方式中,码流特征还可以包括ROI占比。其中,ROI占比可以通过预设时间段内ROI大小相对于目标图像帧大小的比值所组成的比值序列确定。例如,可以将比值序列的均值或者中位值确定为ROI占比。当然,视频编码装置200也可以根据比值序列中较多比值落入的范围确定ROI占比。上述预设时间段具体是基于至少一个画面组(group of picture,GOP)确定的时间段。假设一个GOP为150帧,帧变化率为25帧每秒,则预设时间段可以是6秒。
编码策略包括编码参数。其中,编码参数可以包括量化参数(quantization parameter, QP)。其中,QP可以表征量化步长,QP值越小,量化步长越小,量化精度越高,编码质量也就越高。需要说明的是,QP通常是以与目标图像帧尺寸相匹配的量化参数图QP map的形式存在。视频编码装置200可以根据各区域的码流特征确定各区域对应的量化参数,然后根据各区域对应的量化参数形成整个目标图像帧对应的QP map,从而得到编码策略。
在一些可能的实现方式中,编码参数还包括编码单元(coding unit,CU)块参考趋势,该CU块参考趋势可以取值为帧内、帧间或者跳过(skip)。编码参数还可以包括GOP模式,该GOP模式可以取值为普通模式(nomalp)或者智能模式(smartp)。针对处于运动状态的对象,编码参数还可以包括运动补偿QP。
视频编码装置200在确定编码策略时,可以以针对所述N级ROI设定的码流特征为编码目标,例如以针对N级ROI设定的码率为编码目标,建立关于编码参数的方程组,通过求解方程组的形式,获得满足上述编码目标时的编码参数,作为编码策略。
在一些实现方式中,视频编码装置200可以根据针对所述N级感兴趣区域设定的码流特征,利用编码策略模型确定所述目标图像帧的所述编码参数,从而得到目标图像帧的编码策略。其中,编码策略模型中包括码流特征与编码参数的对应关系,视频编码装置200输入码流特征值编码策略模型,该编码策略模型可以基于上述对应关系确定与码流特征相匹配的编码参数,并输出该编码参数,从而获得编码策略。
其中,编码策略模型可以通过历史视频流作为训练样本进行模型训练得到。具体地,视频编码装置200可以获取海量的历史视频流作为训练视频流,然后利用编码策略库中的编码策略对所述训练视频流进行编码,确定各种编码策略编码后的视频流的压缩率和质量。其中,质量可以通过峰值信噪比(peak signal to noise ratio,PSNR)和结构相似性(structural similarity index,SSIM)中的一个或多个进行表征。视频编码装置200可以根据压缩率和质量选择与训练视频流相匹配的目标编码策略。例如,视频编码装置200可以将编码后压缩率较高,且质量也较高的视频流所采用的编码策略作为目标编码策略。
视频编码装置200可以将目标编码策略作为训练视频流的标签,如此,一个训练视频流及其目标编码策略可以形成一个训练样本,视频编码装置200可以利用上述训练样本进行模型训练,从而得到编码策略模型。
在一些可能的实现方式中,视频编码装置200可以对编码策略进行聚类,生成编码策略库。例如,可以按照场景、分辨率、时间或者天气等因素中的至少一种对编码策略进行聚类。其中,场景可以包括动画、监控、动作电影、普通电影、体育赛事、新闻和/或游戏等。分辨率可以包括720P、1080P、2K或者4K等等。时间可以包括白天或晚上。天气可以包括阴天、晴天、雨天、雾天、雪天等等。视频编码装置200可以按照上述类别,遍历编码策略库,基于每一类别下的每一种编码策略进行编码。
还需要说明的是,考虑到有些区域可能存在重叠的情况,例如不同级别的ROI发生重叠时,视频编码装置200可以根据针对码率较高的ROI设定的码流特征,利用编码策略模型确定重叠区域对应的编码参数。例如,L2-ROI和L3-ROI发生重叠时,可以基于针对L3-ROI设定的码流特征,利用编码策略模型确定该重叠区域对应的编码参数。
S408:视频编码装置200根据所述编码策略对所述目标图像帧进行编码,得到目标视频流。
具体地,视频编码装置200可以根据所述编码参数,尤其是编码参数中的QP map对所述目标图像帧进行量化,从而实现视频编码。需要说明的是,编码参数还可以包括CU块参考趋势、GOP模式和/或运动补偿QP。视频编码装置200在进行编码时,还可以基于CU块参考趋势确定采用帧内预测、帧间预测或者是跳过。当然,视频编码装置,还可以基于GOP模式确定采用nomalp或者smartp。针对处于运动状态的对象,视频编码装置200还可以基于运动补偿QP对目标图像帧进行量化。
由于上述编码策略是根据针对N级ROI设定的码率所确定的,因此,根据该编码策略编码所得的目标视频流的图像帧满足预设的码率关系,即目标视频流的图像帧中的背景区域的码率小于对象区域的码率,对象区域的码率小于对象的特征部位区域的码率和隔离区域的码率。
目标视频流的图像帧中的背景区域码率最低,虽然降低了背景区域的质量,但是提升了背景区域的压缩率。目标视频流的图像帧中的对象区域码率高于背景区域的码率,因而对象区域的质量和压缩率实现了均衡。目标视频流的图像帧中的对象的特征部位区域码率较高,虽然降低了对象的特征部位区域的压缩率,但是提升了对象的特征部位区域的质量,保障了对象识别率。
并且,本申请实施例还提高了对象的特征部位区域附近区域的码率(提高码率的区域形成上文所述的隔离区域),由于一个区域的像素在编码时会参考附近的像素,因此,对象的特征部位区域的编码质量受到隔离区域的影响,而隔离区域具有较高码率,因此,对象的特征部位区域的编码质量也得到提升,能够满足对象识别等业务的需求。
由于背景区域大小在整个目标图像帧的占比较大,即使提高了对象的特征部位区域附近区域的码率,目标图像帧整体仍然具有较高的压缩率。而对象的特征部位区域对于对象识别的贡献较大,通过提高对象的特征部位区域附近区域的码率,能够较大程度提升对象的特征部位区域的编码质量,进而明显改善目标图像帧的编码质量。即本申请实施例在略微增加图像帧大小的情况下,明显地改善了图像帧的编码质量。
以上主要是以在线视频的编码过程进行示例说明,本申请实施例提供的视频编码方法也可以用于对本地的视频文件进行压缩或者格式转换。具体地,在输入本地视频流后,对该本地视频流进行解封装,然后解码该本地视频流获得目标图像帧,接着从目标图像帧中提取包括对象级ROI、特征部位级ROI和隔离ROI在内的N级ROI,基于针对这N级ROI设定的码流特征确定编码策略,并基于该编码策略进行编码,从而实现本地视频文件压缩或格式转换。
以上结合图1至图5对本申请实施例提供的视频编码方法进行介绍,接下来,将结合附图对本申请提供的相关装置、设备进行介绍。
参见图2A所示的视频编码装置200的结构示意图,视频编码装置200包括:
通信模块202,用于获取目标图像帧;
提取模块206,用于从所述目标图像帧中提取N级感兴趣区域,所述N级感兴趣区域至少包括对象区域、对象的特征部位区域以及隔离区域,所述隔离区域用于隔离所述目标图像帧中的背景区域以及所述对象的特征部位区域,所述背景区域的码率小于所述对象区 域的码率,所述对象区域的码率小于所述对象的特征部位区域的码率以及所述隔离区域的码率;
确定模块208,用于根据针对所述N级感兴趣区域设定的码流特征确定所述目标图像帧的编码策略,所述码流特征至少包括码率,针对所述N级感兴趣区域设定的所述码率满足预设关系,所述预设关系包括所述背景区域的码率小于所述对象区域的码率以及所述隔离区域的码率,所述对象区域的码率小于所述对象的特征部位区域的码率;
编码模块210,用于根据所述编码策略对所述目标图像帧进行编码,得到目标视频流,所述目标视频流的图像帧中所述N级感兴趣区域的码率满足所述预设关系。
在实际应用时,参见图2B至图2D,视频编码装置200还可以包括解码模块204。其中,通信模块202具体用于获取原始视频流,该原始视频流是基于原始图像帧(当视频采集设备包括ISP时,则为经过ISP处理的图像帧)进行编码所得的视频流。解码模块204,用于解码原始视频流,得到原始图像帧(当视频采集设备包括ISP时,则为经过ISP处理的图像帧)。基于上述原始图像帧(或者经过ISP处理的图像帧)可以获得目标图像帧。提取模块206从所述目标图像帧中提取N级感兴趣区域,确定模块208根据针对所述N级感兴趣区域设定的码流特征确定所述目标图像帧的编码策略,编码模块210基于所述编码策略对所述目标图像帧进行编码,得到目标视频流,所述目标视频流的图像帧中所述N级感兴趣区域的码率满足所述预设关系。
在一些可能的实现方式中,所述提取模块206具体用于:
从目标图像帧中提取所述对象区域以及所述对象的特征部位区域;
对所述对象的特征部位区域进行膨胀处理,获得所述隔离区域,所述隔离区域为所述对象的特征部位区域在膨胀处理前后的差值。
在一些可能的实现方式中,所述编码策略包括编码参数,所述确定模块208具体用于:
根据针对所述N级感兴趣区域设定的码流特征,利用编码策略模型确定所述目标图像帧的所述编码参数;
所述编码模块210具体用于:
根据所述编码参数对所述目标图像帧进行量化,从而实现视频编码。
在一些可能的实现方式中,所述确定模块208具体用于:
当不同级别的感兴趣区域发生重叠时,根据针对码率较高的感兴趣区域设定的码流特征,利用编码策略模型确定重叠区域对应的编码参数。
在一些可能的实现方式中,所述装置200还包括:
训练模块,用于获取训练视频流,利用编码策略库中的编码策略对所述训练视频流进行编码,确定编码后的视频流的压缩率和质量,根据所述压缩率和所述质量确定与所述训练视频流匹配的目标编码策略,利用所述训练视频流以及所述目标编码策略进行模型训练,获得编码策略模型。
在一些可能的实现方式中,所述装置200还包括:
降采样模块,用于对参考图像帧进行降采样,得到目标图像帧。
在一些可能的实现方式中,所述预设关系还包括:所述对象区域的码率小于所述隔离区域的码率。
根据本申请实施例的视频编码装置200可对应于执行本申请实施例中描述的视频编码方法,并且视频编码装置200中的各个模块的上述和其它操作和/或功能分别为了实现图4中的各个方法的相应流程,为了简洁,在此不再赘述。
上述视频编码装置200的功能可以通过独立的视频编码设备108实现,视频编码设备108的具体实现可以参见图3相关内容描述。此外,上述编码装置200的功能也可以通过视频采集设备102或者存储设备104实现。
图6至图7还提供了一种设备。图6所示的设备在实现视频采集功能的基础上,还用于实现视频编码装置200的功能。图7所示的设备在实现视频存储功能的基础上还用于实现视频编码装置200的功能。
设备600包括总线601、处理器602、通信接口603、存储器604和图像传感器605。中央处理器602、存储器604和通信接口603、图像传感器605之间通过总线601通信。总线601可以是外设部件互连标准(peripheral component interconnect,PCI)总线或扩展工业标准结构(extended industry standard architecture,EISA)总线等。总线可以分为地址总线、数据总线、控制总线等。为便于表示,图6中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。通信接口603用于与外部通信。
其中,图像传感器605为光电转换器件。其利用光电转换功能,将感光面上的光像转换为与光像成相应比例关系的电信号,进而得到原始图像帧。基于该原始图像帧可以获得目标图像帧。处理器602可以为中央处理器(central processing unit,CPU)。存储器604可以包括易失性存储器(volatile memory),例如随机存取存储器(random access memory,RAM)。存储器604还可以包括非易失性存储器(non-volatile memory),例如只读存储器(read-only memory,ROM),快闪存储器,HDD或SSD。
存储器604中存储有可执行代码,处理器602执行该可执行代码以执行前述视频编码方法。
具体地,在实现图2A所示实施例的情况下,且图2A实施例中所描述的视频编码装置200的各模块为通过软件实现的情况下,执行图2A中的提取模块206、确定模块208和编码模块210功能所需的软件或程序代码存储在存储器604中。通信模块202功能可以通过总线601和通信接口603实现,其中,总线601用于内部通信,通信接口603用于和外部通信。图像传感器605经过光电转换生成的原始图像帧可以通过总线601传输至处理器602进行降采样处理,得到目标图像帧,该目标图像帧可以存储在存储器604,处理器602执行存储器604中存储的各模块对应的程序代码,对目标图像帧执行本申请实施例提供的视频编码方法,得到目标视频流。
需要说明的是,在一些可能的实现方式中,处理器602还可以包括图像信号处理器(image signal processor,ISP)。ISP可以对图像传感器605生成的原始图像帧进行处理,如进行去燥、锐化等处理。处理器602对经过ISP处理后的图像帧进行降采样得到目标图像帧。处理器602执行存储器604中存储的各模块对应的程序代码,对目标图像帧执行本申请实施例提供的视频编码方法,得到目标视频流。需要说明的是,存储器604中还可以存储上述目标视频流,并通过通信接口603向其他设备传输目标视频流。
设备700包括总线701、处理器702、通信接口703和存储器704。处理器702、存储器704和通信接口703之间通过总线701通信。设备700在实现图2B所示实施例的情况下,且图2B实施例中所描述的各模块为通过软件实现的情况下,执行图2B中的解码模块204、提取模块206、确定模块208和编码模块210功能所需的软件或程序代码存储在存储器704中。通信模块202功能可以通过总线701和通信接口703实现,其中,总线701用于内部通信,通信接口703用于和外部通信。
具体地,通信接口703可以接收其他设备输入的视频流,本申请实施例称之为原始视频流。处理器702执行存储器704中存储的各模块对应的程序代码,对原始视频流执行本申请实施例提供的视频编码方法。具体包括对原始视频流进行解码,得到原始图像帧(或者经过ISP处理的图像帧),然后对上述图像帧进行将采样得到目标图像帧,接着再从目标图像帧中提取包括对象区域、对象的特征部位区域和隔离区域在内的N级ROI,基于针对N级ROI设定的码流特征确定目标图像帧的编码策略,根据目标图像帧的编码策略对目标图像帧进行编码,得到目标视频流。
本申请实施例还提供了一种计算机可读存储介质,包括指令,当其在计算机设备上运行时,使得计算机设备执行上述应用于视频编码装置200的视频编码方法。
本申请实施例还提供了一种计算机可读存储介质,包括指令,当其在计算机设备上运行时,使得计算机设备执行上述应用于视频编码装置200的视频编码方法。
本申请实施例还提供了一种计算机程序产品,所述计算机程序产品被计算机设备执行时,所述计算机设备执行前述视频编码方法的任一方法。该计算机程序产品可以为一个软件安装包,在需要使用前述视频编码方法的任一方法的情况下,可以下载该计算机程序产品并在计算机上执行该计算机程序产品。另外需说明的是,以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外,本申请提供的装置实施例附图中,模块之间的连接关系表示它们之间具有通信连接,具体可以实现为一条或多条通信总线或信号线。
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件的方式来实现,当然也可以通过专用硬件包括专用集成电路、专用CPU、专用存储器、专用元器件等来实现。一般情况下,凡由计算机程序完成的功能都可以很容易地用相应的硬件来实现,而且,用来实现同一功能的具体硬件结构也可以是多种多样的,例如模拟电路、数字电路或专用电路等。但是,对本申请而言更多情况下软件程序实现是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在可读取的存储介质中,如计算机的软盘、U盘、移动硬盘、ROM、RAM、磁碟或者光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,训练设备,或者网络设备等)执行本申请各个实施例所述的方法。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。
所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、训练设备或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、训练设备或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的训练设备、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(Solid State Disk,SSD))等。

Claims (16)

  1. 一种视频编码方法,其特征在于,所述方法包括:
    从目标图像帧中提取N级感兴趣区域,所述N级感兴趣区域至少包括对象区域、对象的特征部位区域以及隔离区域,所述隔离区域用于隔离所述目标图像帧中的背景区域以及所述对象的特征部位区域;
    根据针对所述N级感兴趣区域设定的码流特征确定所述目标图像帧的编码策略,所述码流特征至少包括码率,针对所述N级感兴趣区域设定的所述码率满足预设关系,所述预设关系包括所述背景区域的码率小于所述对象区域的码率以及所述隔离区域的码率,所述对象区域的码率小于所述对象的特征部位区域的码率;
    根据所述编码策略对所述目标图像帧进行编码,得到目标视频流,所述目标视频流的图像帧中所述N级感兴趣区域中的各级感兴趣区域的码率满足所述预设关系。
  2. 根据权利要求1所述的方法,其特征在于,所述从目标图像帧中提取N级感兴趣区域,包括:
    从目标图像帧中提取所述对象区域以及所述对象的特征部位区域;
    对所述对象的特征部位区域进行膨胀处理,获得所述隔离区域,所述隔离区域为所述对象的特征部位区域在膨胀处理前后的差值。
  3. 根据权利要求1或2所述的方法,其特征在于,所述编码策略包括编码参数,所述根据针对所述N级感兴趣区域设定的码流特征确定所述目标图像帧的编码策略,包括:
    根据针对所述N级感兴趣区域设定的码流特征,利用编码策略模型确定所述目标图像帧的所述编码参数;
    所述根据所述编码策略对所述目标图像帧进行编码,包括:
    根据所述编码参数对所述目标图像帧进行量化,从而实现视频编码。
  4. 根据权利要求3所述的方法,其特征在于,所述根据针对所述N级感兴趣区域设定的码流特征,利用编码策略模型确定所述目标图像帧的编码参数,包括:
    当不同级别的感兴趣区域发生重叠时,根据针对码率较高的感兴趣区域设定的码流特征,利用编码策略模型确定重叠区域对应的编码参数。
  5. 根据权利要求3所述的方法,其特征在于,所述方法还包括:
    获取训练视频流;
    利用编码策略库中的编码策略对所述训练视频流进行编码,确定编码后的视频流的压缩率和质量;
    根据所述压缩率和所述质量确定与所述训练视频流匹配的目标编码策略;
    利用所述训练视频流以及所述目标编码策略进行模型训练,获得编码策略模型。
  6. 根据权利要求1至5任一项所述的方法,其特征在于,所述方法还包括:
    对参考图像帧进行降采样,得到所述目标图像帧。
  7. 根据权利要求1至5任一项所述的方法,其特征在于,所述预设关系还包括:所述对象区域的码率小于所述隔离区域的码率。
  8. 一种视频编码装置,其特征在于,所述装置包括:
    提取模块,用于从目标图像帧中提取N级感兴趣区域,所述N级感兴趣区域至少包括 对象区域、对象的特征部位区域以及隔离区域,所述隔离区域用于隔离所述目标图像帧中的背景区域以及所述对象的特征部位区域;
    确定模块,用于根据针对所述N级感兴趣区域设定的码流特征确定所述目标图像帧的编码策略,所述码流特征至少包括码率,针对所述N级感兴趣区域设定的所述码率满足预设关系,所述预设关系包括所述背景区域的码率小于所述对象区域的码率以及所述隔离区域的码率,所述对象区域的码率小于所述对象的特征部位区域的码率;
    编码模块,用于根据所述编码策略对所述目标图像帧进行编码,得到目标视频流,所述目标视频流的图像帧中所述N级感兴趣区域中的各级感兴趣区域的码率满足所述预设关系。
  9. 根据权利要求8所述的装置,其特征在于,所述提取模块具体用于:
    从目标图像帧中提取所述对象区域以及所述对象的特征部位区域;
    对所述对象的特征部位区域进行膨胀处理,获得所述隔离区域,所述隔离区域为所述对象的特征部位区域在膨胀处理前后的差值。
  10. 根据权利要求8或9所述的装置,其特征在于,所述编码策略包括编码参数,所述确定模块具体用于:
    根据针对所述N级感兴趣区域设定的码流特征,利用编码策略模型确定所述目标图像帧的所述编码参数;
    所述编码模块具体用于:
    根据所述编码参数对所述目标图像帧进行量化,从而实现视频编码。
  11. 根据权利要求10所述的装置,其特征在于,所述确定模块具体用于:
    当不同级别的感兴趣区域发生重叠时,根据针对码率较高的感兴趣区域设定的码流特征,利用编码策略模型确定重叠区域对应的编码参数。
  12. 根据权利要求10所述的装置,其特征在于,所述装置还包括:
    训练模块,用于获取训练视频流,利用编码策略库中的编码策略对所述训练视频流进行编码,确定编码后的视频流的压缩率和质量,根据所述压缩率和所述质量确定与所述训练视频流匹配的目标编码策略,利用所述训练视频流以及所述目标编码策略进行模型训练,获得编码策略模型。
  13. 根据权利要求8至12任一项所述的装置,其特征在于,所述装置还包括:
    降采样模块,用于对参考图像帧进行降采样,得到所述目标图像帧。
  14. 根据权利要求8至12任一项所述的装置,其特征在于,所述预设关系还包括:所述对象区域的码率小于所述隔离区域的码率。
  15. 一种设备,其特征在于,所述设备包括处理器和存储器;
    所述处理器用于执行所述存储器中存储的指令,以使得所述设备执行如权利要求1至7中任一项所述的视频编码方法。
  16. 一种计算机可读存储介质,包括指令,当其在设备上运行时,使得设备执行如权利要求1至7中任一项所述的视频编码方法。
PCT/CN2020/108788 2020-02-21 2020-08-13 一种视频编码方法、装置、设备及介质 WO2021164216A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010108310.0A CN113301336A (zh) 2020-02-21 2020-02-21 一种视频编码方法、装置、设备及介质
CN202010108310.0 2020-02-21

Publications (1)

Publication Number Publication Date
WO2021164216A1 true WO2021164216A1 (zh) 2021-08-26

Family

ID=77317602

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/108788 WO2021164216A1 (zh) 2020-02-21 2020-08-13 一种视频编码方法、装置、设备及介质

Country Status (2)

Country Link
CN (1) CN113301336A (zh)
WO (1) WO2021164216A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113923476A (zh) * 2021-09-30 2022-01-11 支付宝(杭州)信息技术有限公司 一种基于隐私保护的视频压缩方法及装置
CN114531599A (zh) * 2022-04-25 2022-05-24 中国医学科学院阜外医院深圳医院(深圳市孙逸仙心血管医院) 一种用于医疗图像存储的图像压缩方法
WO2023024519A1 (zh) * 2021-08-27 2023-03-02 华为技术有限公司 一种视频处理方法、装置、设备以及系统
CN116033189A (zh) * 2023-03-31 2023-04-28 卓望数码技术(深圳)有限公司 基于云边协同的直播互动视频分区智能控制方法和系统
CN116156196A (zh) * 2023-04-19 2023-05-23 探长信息技术(苏州)有限公司 一种用于视频数据的高效传输方法

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113784084B (zh) * 2021-09-27 2023-05-23 联想(北京)有限公司 一种处理方法及装置
CN114565966A (zh) * 2022-04-26 2022-05-31 全时云商务服务股份有限公司 一种人脸视频图像处理方法及装置

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090322915A1 (en) * 2008-06-27 2009-12-31 Microsoft Corporation Speaker and Person Backlighting For Improved AEC and AGC
US20100067766A1 (en) * 2008-09-18 2010-03-18 Siemens Medical Solutions Usa, Inc. Extracting location information using difference images from a non-parallel hole collimator
CN101931800A (zh) * 2009-06-24 2010-12-29 财团法人工业技术研究院 利用有限可变比特率控制的感兴趣区域编码方法与系统
US20130121588A1 (en) * 2011-11-14 2013-05-16 Yukinori Noguchi Method, apparatus, and program for compressing images, and method, apparatus, and program for decompressing images
CN103179405A (zh) * 2013-03-26 2013-06-26 天津大学 一种基于多级感兴趣区域的多视点视频编码方法
CN107483920A (zh) * 2017-08-11 2017-12-15 北京理工大学 一种基于多层级质量因子的全景视频评估方法及系统
CN107580217A (zh) * 2017-08-31 2018-01-12 郑州云海信息技术有限公司 编码方法及其装置

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090322915A1 (en) * 2008-06-27 2009-12-31 Microsoft Corporation Speaker and Person Backlighting For Improved AEC and AGC
US20100067766A1 (en) * 2008-09-18 2010-03-18 Siemens Medical Solutions Usa, Inc. Extracting location information using difference images from a non-parallel hole collimator
CN101931800A (zh) * 2009-06-24 2010-12-29 财团法人工业技术研究院 利用有限可变比特率控制的感兴趣区域编码方法与系统
US20130121588A1 (en) * 2011-11-14 2013-05-16 Yukinori Noguchi Method, apparatus, and program for compressing images, and method, apparatus, and program for decompressing images
CN103179405A (zh) * 2013-03-26 2013-06-26 天津大学 一种基于多级感兴趣区域的多视点视频编码方法
CN107483920A (zh) * 2017-08-11 2017-12-15 北京理工大学 一种基于多层级质量因子的全景视频评估方法及系统
CN107580217A (zh) * 2017-08-31 2018-01-12 郑州云海信息技术有限公司 编码方法及其装置

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023024519A1 (zh) * 2021-08-27 2023-03-02 华为技术有限公司 一种视频处理方法、装置、设备以及系统
CN113923476A (zh) * 2021-09-30 2022-01-11 支付宝(杭州)信息技术有限公司 一种基于隐私保护的视频压缩方法及装置
CN113923476B (zh) * 2021-09-30 2024-03-26 支付宝(杭州)信息技术有限公司 一种基于隐私保护的视频压缩方法及装置
CN114531599A (zh) * 2022-04-25 2022-05-24 中国医学科学院阜外医院深圳医院(深圳市孙逸仙心血管医院) 一种用于医疗图像存储的图像压缩方法
CN114531599B (zh) * 2022-04-25 2022-06-21 中国医学科学院阜外医院深圳医院(深圳市孙逸仙心血管医院) 一种用于医疗图像存储的图像压缩方法
CN116033189A (zh) * 2023-03-31 2023-04-28 卓望数码技术(深圳)有限公司 基于云边协同的直播互动视频分区智能控制方法和系统
CN116156196A (zh) * 2023-04-19 2023-05-23 探长信息技术(苏州)有限公司 一种用于视频数据的高效传输方法

Also Published As

Publication number Publication date
CN113301336A (zh) 2021-08-24

Similar Documents

Publication Publication Date Title
WO2021164216A1 (zh) 一种视频编码方法、装置、设备及介质
JP2020508010A (ja) 画像処理およびビデオ圧縮方法
Korshunov et al. Video quality for face detection, recognition, and tracking
WO2022067656A1 (zh) 一种图像处理方法及装置
WO2023016155A1 (zh) 图像处理方法、装置、介质及电子设备
CN111726633A (zh) 基于深度学习和显著性感知的压缩视频流再编码方法
WO2023005740A1 (zh) 图像编码、解码、重建、分析方法、系统及电子设备
CN109547803B (zh) 一种时空域显著性检测及融合方法
TWI539407B (zh) 移動物體偵測方法及移動物體偵測裝置
CN111491167B (zh) 图像编码方法、转码方法、装置、设备以及存储介质
CN116803079A (zh) 视频和相关特征的可分级译码
CN111970509B (zh) 一种视频图像的处理方法、装置与系统
Liu et al. Video quality assessment using space–time slice mappings
TWI512685B (zh) 移動物體偵測方法及其裝置
WO2022253249A1 (zh) 特征数据编解码方法和装置
CN111626178A (zh) 一种基于新时空特征流的压缩域视频动作识别方法和系统
CN113507611B (zh) 图像存储方法、装置、计算机设备和存储介质
CN114157870A (zh) 编码方法、介质及电子设备
Yuan et al. Object shape approximation and contour adaptive depth image coding for virtual view synthesis
Ko et al. An energy-quality scalable wireless image sensor node for object-based video surveillance
Chen et al. Quality-of-content (QoC)-driven rate allocation for video analysis in mobile surveillance networks
Katakol et al. Distributed learning and inference with compressed images
WO2023193629A1 (zh) 区域增强层的编解码方法和装置
US11095901B2 (en) Object manipulation video conference compression
Pourreza et al. Recognizing compressed videos: Challenges and promises

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20920088

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20920088

Country of ref document: EP

Kind code of ref document: A1