CN114567778B

CN114567778B - Video coding method and system

Info

Publication number: CN114567778B
Application number: CN202210450047.2A
Authority: CN
Inventors: 黄震坤; 岑裕
Original assignee: Beijing Yunzhong Rongxin Network Technology Co ltd
Current assignee: Beijing Yunzhong Rongxin Network Technology Co ltd
Priority date: 2022-04-24
Filing date: 2022-04-24
Publication date: 2022-07-05
Anticipated expiration: 2042-04-24
Also published as: CN114567778A

Abstract

The invention relates to the technical field of multimedia video image information processing, and discloses a video coding method and a system, wherein the video coding method comprises the following steps: converting an input video stream into an RGB image; carrying out target detection and motion detection on the RGB image so as to identify target pixels and motion pixels of the RGB image; performing fusion processing on the target pixel and the motion pixel to determine an interested pixel and a non-interested pixel in the RGB image; determining an interested region and a non-interested region according to the distribution of the interested pixels and the non-interested pixels; and distributing coding rates for the interested regions and the non-interested regions according to the set target code rate. The invention ensures that the QP of the region of interest is smaller, and improves the definition of the region of interest.

Description

Video coding method and system

Technical Field

The present invention relates to the field of multimedia video image information processing technologies, and in particular, to a video encoding method and system.

Background

Currently, more and more conferences are transferred from offline to online, the network conference generally requires to achieve high-definition image quality, network bandwidth is continuously changed due to network complexity, more and more challenges are brought to video compression and video transmission, high-definition and ultra-high-definition video compression coding is an indispensable technical means, and meanwhile, the performance and complexity of video compression coding directly influence the application range and potential of high-definition and ultra-high-definition videos.

Therefore, under the condition of keeping a certain video quality, it is a generally pursued goal to increase the compression ratio of video coding and reduce the complexity of video coding. In the prior art, coding methods such as HEVC, H.264, VVC and the like are sequentially generated, but the coding speed still cannot meet the actual requirements of high-definition and ultra-high-definition video compression. Chinese patent publication No. CN113115043A discloses a video encoder, a video encoding system, and a video encoding method, which mainly adopt a scheme in which each frame of image is jointly completed by using a plurality of video encoders, thereby reducing encoding time, reducing encoding delay, and realizing real-time encoding of a high-definition video source.

The real-time audio and video technology is a terminal service, and provides full-scene, full-interaction and full-real-time audio and video services with high concurrency, low time delay, high definition smoothness, safety and reliability for the industry. However, in the case of the real-time audio-video technology, in the case of limited bandwidth, decoding and playing are performed under the condition of still maintaining high-definition video coding. If the encoder increases the QP, it will cause picture blurring, and region-of-interest-based encoding is one of the methods to solve such problems. For example, chinese patent publication No. CN106162177A discloses a video encoding method and apparatus, which determines an area of interest by identifying a moving object and performs high fidelity encoding by smoothing filtering; also for example, chinese patent publication No. CN103297754A discloses a surveillance video adaptive region-of-interest coding system, which uses ROI detection and h.264 coding to realize compromise between data compression based on h.264 protocol and high-quality storage of key information. Therefore, in the process of video compression and transmission, under the condition that the code rate is not changed, how to maintain the definition of the region of interest to reduce the network bandwidth occupation enables a user to enjoy the watching interest of the ultra-high-definition video at low bandwidth/network speed becomes a problem to be solved urgently.

Disclosure of Invention

In view of the above defects or shortcomings in the prior art, the present invention provides a video encoding method and system, which extract an interested area in a video by combining target detection and motion detection, and then allocate an encoding rate by combining game theory.

In an aspect of the present invention, there is provided a video encoding method, including:

converting an input video stream into an RGB image;

performing target detection and motion detection on the RGB image to identify target pixels and motion pixels of the RGB image;

performing fusion processing on the target pixel and the motion pixel to determine an interested pixel and a non-interested pixel in the RGB image;

determining an interested area and a non-interested area according to the distribution of the interested pixels and the non-interested pixels;

and distributing coding rates for the interested regions and the non-interested regions according to the set target code rate.

Further, the step of allocating coding rate to the interested region and the non-interested region according to the set target coding rate comprises:

calculated according to the formula

Code rate of region of interest at minimum

：

Wherein D1 is the R-D function of the region of interest, and D2 is the R-D function of the region of no interest;

a weight representing the overall coding quality; m represents the number of coding tree units of the region of interest; n represents the number of coding tree units of the non-interesting region;

indicating the coding complexity of the ith coding tree unit,

representing the coding complexity of the (i-1) th coding tree unit;

representing the number of bits per pixel of the ith coding tree unit;

the number of pixels of the ith coding tree unit;

indicating the set target code rate, and indicating the target code rate,

representing a code rate of the region of interest;

is a first constant which is a function of the first,

is a second constant;

with initial setting

，

With initial setting

；

Represent

The natural logarithm of the number of the pairs,

represent

The natural logarithm of (d);

representing the bit number or total pixel number occupied by the compressed RGB image;

representing true consumption

；

Is composed of

Consumed in time

。

Further, the motion detection of the RGB image includes:

taking a Gaussian mixture model GMM as a background model of a static scene without an invasive object; and taking the pixels in the current RGB image which are not matched with the background model as the motion pixels.

Further, the fusion process includes:

if the pixel in the RGB image belongs to the target pixel and the motion pixel at the same time, the pixel is judged as the interested pixel.

Further, the step of determining the regions of interest and the regions of non-interest based on the distribution of the pixels of interest and the pixels of non-interest comprises:

if the proportion of the interested pixels in all the pixels of the coding block exceeds or equals to a set proportion threshold value, the coding block is an interested area, otherwise, the coding block is a non-interested area.

In another aspect of the present invention, there is provided a video encoding system including:

a conversion module configured to convert an input video stream into an RGB image;

a detection module configured to perform target detection and motion detection on the RGB image to identify target pixels and motion pixels of the RGB image;

a fusion module configured to perform fusion processing on the target pixel and the motion pixel to determine a pixel of interest and a non-pixel of interest in the RGB image;

a determination module configured to determine regions of interest and regions of non-interest from the distribution of the pixels of interest and the pixels of non-interest;

and the code rate allocation module is configured to allocate coding code rates to the interested region and the non-interested region according to the set target code rate.

Further, the code rate allocation module is further configured to:

calculated according to the following formula

Code rate of region of interest at minimum

：

indicating the coding complexity of the ith coding tree unit,

representing the coding complexity of the i-1 st coding tree unit;

representing the number of bits per pixel of the ith coding tree unit;

the number of pixels of the ith coding tree unit;

indicates the set target code rate and, therefore,

representing a code rate of the region of interest;

is a first constant which is a function of the first,

is the second timeCounting;

with initial setting

，

With initial setting

；

To represent

The natural logarithm of the number of the pairs,

to represent

The natural logarithm of (d);

representing true consumption

；

Is composed of

Consumed in real time

。

Further, the detection module is further configured to:

taking a Gaussian mixture model GMM as a background model of a static scene without an invasive object; and taking pixels in the current RGB image which are not matched with the background model as motion pixels.

Further, the fusion module is further configured to:

if a pixel in the RGB image belongs to both the target pixel and the motion pixel, the pixel is determined to be the pixel of interest.

Further, the determination module is further configured to:

if the proportion of the interested pixel in all the pixels of the coding block exceeds or equals to a set proportion threshold value, the coding block is an interested area, otherwise, the coding block is a non-interested area.

According to the video coding method and system, the interesting region in the video is extracted by adopting a mode of combining target detection and motion detection, and the coding code rate is distributed by adopting a mode of combining game theory, so that the QP of the interesting region is smaller, and the definition of the interesting region is improved.

Drawings

Other features, objects and advantages of the present invention will become more apparent upon reading of the detailed description of non-limiting embodiments thereof, made with reference to the following drawings:

fig. 1 is a flowchart of a video encoding method according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a video coding system according to an embodiment of the present invention;

fig. 3 is a schematic composition diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the description of the invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be understood that although the terms first, second, third, etc. may be used to describe the acquisition modules in embodiments of the present invention, these acquisition modules should not be limited to these terms. These terms are only used to distinguish the acquisition modules from each other.

The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination" or "in response to a detection", depending on the context. Similarly, the phrase "if determined" or "if detected (a stated condition or event)" may be interpreted as "upon determining" or "in response to determining" or "upon detecting (a stated condition or event)" or "in response to detecting (a stated condition or event)" depending on the context.

It should be noted that the terms "upper," "lower," "left," "right," and the like used in the description of the embodiments of the present invention are illustrated in the drawings, and should not be construed as limiting the embodiments of the present invention. In addition, in this context, it is also to be understood that when an element is referred to as being "on" or "under" another element, it can be directly formed on "or" under "the other element or be indirectly formed on" or "under" the other element through an intermediate element.

One embodiment of the present invention provides a video coding method, which can greatly reduce network bandwidth occupation by combining a region of interest ROI with a video coding technology, so that a user can enjoy watching of an ultra high definition video at a low bandwidth/network speed.

Referring to fig. 1, the video encoding method of the present embodiment includes two parts, namely, target detection and rate allocation, and specifically includes the following steps:

step S101, converting an input video stream into an RGB image;

specifically, in this embodiment, an original yuv video stream is acquired from a camera, and a video file is defined as input. A 264 encoder used in WebRTC and open source OpenH264 provided by cisco will be described as an example. In the encoder, input.yuv video data input is converted into an RGB image.

Step S102, carrying out target detection and motion detection on the RGB image to identify target pixels and motion pixels of the RGB image;

object detection is an image segmentation based on object geometry and statistical features. The method combines the segmentation and the identification of the target into a whole, and the accuracy and the real-time performance of the method are important capabilities of the whole system. The present embodiment adopts the YOLOv4 model for target detection. The YOLOv4 model designs a powerful and efficient detection model, and the model can be trained by 1080 Ti and 2080 Ti, which is an ultrafast and accurate model. In the detection model training stage, the detection model verifies the effects of some most advanced Bag-of-freebes and Bag-of-Specials methods, and a plurality of SOTA methods are modified to make the single GPU training more efficient, such as CBN, PAN, SAM and the like. A complete YOLOv4 model includes: CSPDarknet53 (backbone) + SPP + PAN (Neck, i.e. feature enhancement module) + YoloV 3. The YOLOv4 model uses "comp" techniques of CutMix, Mosaic data enhancement, DropBlock regularization, label smoothing, CIoU-loss, CmBN, self-confrontation training, each object assigned to multiple anchors. The "special" skills used include: mish activation, cross-phase space Connectivity (CSP), multiple-input-weight residual connectivity, SPP-block, SAM-block, PAN, DIoU-NMS. The input to YOLOv4 is the original image and the output is the detected target pixel.

The motion detection of the embodiment adopts a gaussian mixture model to extract the motion region. The gaussian mixture model is a probability model that can be used to represent a distribution (distribution) having K sub-distributions, in other words, the gaussian mixture model represents a probability distribution of the observed data in the population, which is a mixture distribution composed of K sub-distributions. The gaussian mixture model does not require the observation data to provide information about the sub-distributions to calculate the probability of the observation data in the overall distribution. The gaussian mixture model can be regarded as a model formed by combining K single gaussian models, which are Hidden variables (Hidden variables) of the mixture model. In general, any probability distribution can be used for a mixture model, where a Gaussian mixture model is used because of its good mathematical properties and good computational performance.

The invention adopts a Gaussian Mixture Model (GMM) to detect the motion region. In a monitoring system, a shooting background is a fixed scene with less change, and a static scene without an invasive object has some conventional characteristics and can be described by a background model. The GMM is a feature that simulates the background by using a weighted sum of a plurality of gaussian models mixed together, i.e. as a background model. Taking pixels in the current RGB image, which are not matched with the background model, as motion pixels, namely identifying intrusion objects; and taking the pixel matched with the background model in the current RGB image as a background pixel.

Step S103, carrying out fusion processing on the target pixel and the motion pixel to determine an interested pixel and a non-interested pixel in the RGB image;

specifically, a binary fusion mode is adopted as a region fusion method, that is, if a pixel in an RGB image belongs to both a target pixel and a motion pixel, the pixel is determined as an interested pixel. In other words, for a pixel, if the pixel belongs to both the target pixel detected by the target detection model and the motion pixel detected by the gaussian mixture model, the pixel is determined as the pixel of interest, otherwise, the pixel is determined as the non-pixel of interest.

Step S104, determining an interested area and a non-interested area according to the distribution of the interested pixels and the non-interested pixels;

specifically, since video encoding employs block-based compression, the block-based compression unit is not a single pixel, but is a 4 × 4, 8 × 8, or 16 × 16 block. In h.264, the macroblock 16 × 16 scheme is used for encoding and compression. There are situations where the pixel of interest cannot fully fill the macroblock, in which case a scaling threshold needs to be determined. When the proportion of the interested pixels of a macro block reaches or exceeds a set proportion threshold value, the macro block is considered to be the interested macro block or the interested area. The judgment rule of this embodiment is: if the pixel of interest percentage of a coding block exceeds or equals to 80% of the whole macroblock pixels, the coding block is defined as the macroblock of interest/region of interest.

And step S105, distributing coding rate to the interested region and the non-interested region according to the set target code rate.

The present embodiment employs a game theory based model in the rate allocation scheme.

The coding quality of the interested region is used as a leader, the coding quality of the non-interested region is used as a follower, the leader determines the code rate allocated to the interested region under the set target code rate, and the follower determines the code rate allocated to the non-interested region. For the interested region, the utility of the interested region not only depends on the interested region, but also affects the coding quality of the whole RGB image, and the non-interested region can only achieve the optimal utility by using the rest code rate.

Specifically, it is calculated according to the following formula

Code rate of region of interest at minimum

(e.g., as for equation (1))

Taking the partial derivative of R1, calculating R1 when the partial derivative is equal to 0):

(region of interest)

(region of non-interest)

，

Is updated to

，

Is updated to

a weight representing the overall coding quality;

m represents the number of coding tree units of the region of interest;

n represents the number of coding tree units of the non-interesting region;

indicating the coding complexity of the ith coding tree unit,

representing the coding complexity of the i-1 st coding tree unit;

is a parameter that is continuously updated according to the video content;

representing the ith coding tree unit

；

The number of pixels of the ith coding tree unit;

with initial setting

，

With initial setting

；

Basically stabilized around 1, practically set to 1;

initial value of (2)

The default setting is set to 3.2003 so that,

of (2) is calculated

The default setting is 1.367.

Indicating the set target code rate, and indicating the target code rate,

representing a code rate of the region of interest;

is a first constant which is a function of the first,

is a second constant; in this embodiment

Set to 0.1 and 0.05, respectively.

To represent

The natural logarithm of the number of the pairs,

to represent

Natural logarithm of (d);

representing true consumption

；

Is composed of

Consumed in time

。

The video coding method provided by the embodiment can enable the QP of the region of interest to be smaller, and improves the definition of the region of interest.

Referring to fig. 2, another embodiment of the present invention further provides a video encoding system 200, which includes a conversion module 201, a detection module 202, a fusion module 203, a determination module 204, and a rate allocation module 205. The video coding system 200 is configured to perform the method steps in the above-described method embodiments.

Specifically, the method comprises the following steps:

a conversion module 201 configured to convert an input video stream into an RGB image;

a detection module 202 configured to perform target detection and motion detection on the RGB image to identify target pixels and motion pixels of the RGB image;

a fusion module 203 configured to perform fusion processing on the target pixel and the motion pixel to determine a pixel of interest and a non-pixel of interest in the RGB image;

a determination module 204 configured to determine regions of interest and regions of non-interest based on the distribution of the pixels of interest and the pixels of non-interest;

and the code rate allocation module 205 is configured to allocate coding code rates to the interested region and the non-interested region according to the set target code rate.

Further, the code rate allocation module 205 is further configured to calculate the code rate according to the following equation

Code rate of region of interest at minimum

(e.g., as for equation (1))

wherein D1 is the R-D function of the interested region, D2 is the R-D function of the uninterested region;

indicating the coding complexity of the ith coding tree unit,

representing the coding complexity of the i-1 st coding tree unit;

representing the number of bits per pixel of the ith coding tree unit;

the number of pixels of the ith coding tree unit;

indicating the set target code rate, and indicating the target code rate,

representing a code rate of the region of interest;

is a first constant which is a function of the first,

is a second constant;

with initial setting

，

With initial setting

；

To represent

The natural logarithm of the number of the pairs,

to represent

Natural logarithm of (d);

representing true consumption

；

Is composed of

Consumed in time

。

Further, the detection module 202 is configured to: taking a Gaussian mixture model GMM as a background model of a static scene without an invasive object; and taking the pixels in the current RGB image which are not matched with the background model as the motion pixels.

Further, the fusion module 203 is configured to: if the pixel in the RGB image belongs to the target pixel and the motion pixel at the same time, the pixel is judged as the interested pixel.

Further, the determination module 204 is configured to: if the proportion of the interested pixels in all the pixels of the coding block exceeds or equals to a set proportion threshold value, the coding block is an interested area, otherwise, the coding block is a non-interested area.

It should be noted that, the video coding system 200 provided in this embodiment is corresponding to a technical solution that can be used to implement each method embodiment, and the implementation principle and technical effect are similar to those of the method, and are not described herein again.

The invention further provides electronic equipment for executing the method embodiment. Referring now specifically to fig. 3, a schematic diagram of a structure suitable for implementing the electronic device 300 in the present embodiment is shown. The electronic device 300 in the present embodiment may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), a wearable electronic device, and the like, and a stationary terminal such as a digital TV, a desktop computer, a smart home device, and the like. The electronic device shown in fig. 3 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 3, the electronic device 300 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 301 that may perform various appropriate actions and processes to implement the methods of the various embodiments as described herein, according to a program stored in a Read Only Memory (ROM) 302 or a program loaded from a storage device 308 into a Random Access Memory (RAM) 303. In the RAM 303, various programs and data necessary for the operation of the electronic apparatus 300 are also stored. The processing device 301, the ROM 302, and the RAM 303 are connected to each other via a bus 304. An input/output (I/O) interface 305 is also connected to bus 304.

Generally, the following devices may be connected to the I/O interface 305: input devices 306 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, or the like; an output device 307 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage devices 308 including, for example, magnetic tape, hard disk, etc.; and a communication device 309. The communication means 309 may allow the electronic device 300 to communicate with other devices, wireless or wired, to exchange data. While fig. 3 illustrates an electronic device 300 having various means, it is to be understood that not all illustrated means are required to be implemented or provided, and that more or fewer means may be alternatively implemented or provided.

The above description is that of the preferred embodiment of the invention only. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents is encompassed without departing from the spirit of the disclosure. For example, the above features and (but not limited to) features having similar functions disclosed in the present invention are mutually replaced to form the technical solution.

Claims

1. A video encoding method, comprising:

converting an input video stream into an RGB image;

carrying out target detection and motion detection on the RGB image so as to identify target pixels and motion pixels of the RGB image;

determining regions of interest and regions of non-interest according to the distribution of the pixels of interest and the pixels of non-interest;

allocating coding rates to the interested region and the non-interested region according to a set target code rate, which specifically comprises the following steps:

calculated according to the formula

Code rate of region of interest at minimum

：

Wherein, D1Is an R-D function of the region of interest, D2 is an R-D function of the region of non-interest,

a weight representing the overall coding quality;

wherein the content of the first and second substances,

wherein

，

As described above

Indicating the coding complexity of the ith coding tree unit,

representing the coding complexity of the (i-1) th coding tree unit;

representing the ith coding tree unit

The number of pixels of the ith coding tree unit;

is a first constant which is a function of the first,

is a second constant;

with initial setting

，

With initial setting

；

To represent

The natural logarithm of the sum of the coefficients,

to represent

The natural logarithm of (d);

representing true consumption

；

Is composed of

Consumed in time

；

Representing the number of coding tree units of the non-interesting area;

the number of coding tree elements representing the region of non-interest,

representing the set target code rate;

wherein the content of the first and second substances,

wherein the content of the first and second substances,

as described above

Indicating the coding complexity of the jth coding tree unit,

representing the coding complexity of the j-1 th coding tree unit;

representing the jth coding tree unit

The number of the pixels of the jth coding tree unit;

is a first constant which is a function of the first,

is a second constant;

with initial setting

Having an initial setting value

；

To represent

The natural logarithm of the number of the pairs,

to represent

The natural logarithm of (d);

representing true consumption

；

Is composed of

Consumed in time

。

2. A video coding method according to claim 1, wherein the motion detection of the RGB image comprises:

3. The video coding method according to claim 1, wherein the fusing the target pixel and the motion pixel comprises:

and if the pixel in the RGB image belongs to the target pixel and the motion pixel at the same time, judging the pixel as the pixel of interest.

4. The video coding method according to claim 1, wherein the step of determining regions of interest and regions of non-interest based on the distribution of the pixels of interest and the pixels of non-interest comprises:

if the ratio of the interested pixel in all the pixels of the coding block exceeds or is equal to a set proportion threshold value, the coding block is an interested area, otherwise, the coding block is a non-interested area.

5. A video coding system, comprising:

a code rate allocation module configured to allocate coding code rates to the region of interest and the non-region of interest according to a set target code rate, including:

calculated according to the following formula

Code rate of region of interest at minimum

：

Wherein D1 is the R-D function of the region of interest, D2 is the R-D function of the region of non-interest,

a weight representing the overall coding quality;

wherein the content of the first and second substances,

wherein

，

As described above

Indicating the coding complexity of the ith coding tree unit,

representing the coding complexity of the i-1 st coding tree unit;

representing the ith coding tree unit

The number of pixels of the ith coding tree unit;

is a first constant which is a function of the first,

is a second constant;

with initial setting

，

With initial setting

；

To represent

The natural logarithm of the number of the pairs,

to represent

The natural logarithm of (d);

representing the bit number or total pixel number occupied by the compressed RGB image; representing true consumption

；

Is composed of

Consumed in time

；

Representing the number of coding tree units of the non-interesting area;

the number of coding tree elements representing the region of non-interest,

representing the set target code rate;

wherein the content of the first and second substances,

wherein the content of the first and second substances,

as described above

Indicating the coding complexity of the jth coding tree unit,

representing the coding complexity of the j-1 th coding tree unit;

representing the jth coding tree unit

The number of the pixels of the jth coding tree unit;

is a first constant which is a function of the first,

is a second constant;

with initial setting

，

With initial setting

；

To represent

The natural logarithm of the number of the pairs,

represent

The natural logarithm of (d);

representing true consumption

；

Is composed of

Consumed in time

。

6. The video coding system of claim 5, wherein the detection module is further configured to:

taking a Gaussian mixture model GMM as a background model of a static scene without an invasive object; and taking pixels in the RGB image which are not matched with the background model as motion pixels.

7. A video coding system according to claim 5, wherein the fusion module is further configured to:

and if the pixel in the RGB image belongs to the target pixel and the motion pixel at the same time, determining the pixel as the pixel of interest.

8. The video coding system of claim 5, wherein the determining module is further configured to:

if the proportion of the interested pixel in all the pixels of the coding block exceeds or is equal to a set proportion threshold value, the coding block is an interested area, otherwise, the coding block is a non-interested area.