CN114157870A

CN114157870A - Encoding method, medium, and electronic device

Info

Publication number: CN114157870A
Application number: CN202111453826.XA
Authority: CN
Inventors: 阮小飞; 吴胜华; 尚峰; 胡春林
Original assignee: ARM Technology China Co Ltd
Current assignee: ARM Technology China Co Ltd
Priority date: 2021-12-01
Filing date: 2021-12-01
Publication date: 2022-03-08

Abstract

The application relates to the technical field of video processing, and discloses an encoding method, an encoding medium and electronic equipment, which can improve video encoding efficiency, avoid the waste of computing resources, reduce video transmission delay and reduce bandwidth and code rate. The coding method comprises the following steps: acquiring an image to be processed; dividing an image to be processed into a plurality of blocks; detecting each block, and detecting a target area containing a target object in each block; and coding the image in the target area in each block according to the first coding parameter, and coding the image in the area except the target area in each block image according to the second coding parameter, wherein the image coding quality corresponding to the first coding parameter is higher than the image coding quality corresponding to the second coding parameter. The method is particularly used in the process of encoding videos such as high-definition live videos.

Description

Encoding method, medium, and electronic device

Technical Field

The present application relates to the field of video processing technologies, and in particular, to an encoding method, a medium, and an electronic device.

Background

With the growth of 5G technology, ultra-high-definition video transmission is developed vigorously in more and more industries, such as Virtual Reality (VR), Augmented Reality (AR), live video, and the like. In the immersive video experience field such as VR/AR, high requirements on resolution, real-time performance and delay are provided, for example, 1080P, 4K and the like are required, the frame rate is 60FPS and 120FPS, and the delay is about 30 ms. The bandwidth demand on the network is growing sharply to transmit such massive data. Current video encoding and decoding techniques such as h.265 use a series of techniques to compress video before transmission, which can save a lot of bandwidth.

At present, the video coding and decoding standards such as H.265, AV1 and the like introduce a Tile blocking coding mechanism to independently and parallelly code each Tile in a video frame, thereby improving the parallel processing capability and the coding efficiency. By means of tile blocking, only the blocked video of the concerned area on the picture can be transmitted, and the requirement on video transmission bandwidth is reduced. In addition, based on the coding mechanism of ROI (region of interest), the ROI in the image is coded with high quality, and the region outside the ROI is coded with low quality, so that the bandwidth is saved and the code stream is controlled. The ROI is generally obtained by performing model inference on the video frame using a visual algorithm such as a recent object detection (object detection) deep learning algorithm to obtain an ROI of an object, such as a human face region. Under the method of jointly using tile block coding and ROI coding, the ROI of an object may be segmented into several different tile blocks. The part in the ROI region within each tile is higher quality encoded and the part in other regions (regions other than the ROI region) in each tile image is lower quality encoded.

However, each ROI region usually contains not only a foreground object (i.e., an object of interest, such as a human face, a human head, a human body, or an animal) in the image but also a background of an edge of the object in the image, and particularly when the ROI region is large in size, the size of the background contained in the ROI region may be large. Therefore, even if the video encoder adopts a Tile block coding mechanism and an ROI coding mechanism to code different blocks in each frame of image, when the ROI area contains more backgrounds, the video encoder and decoder needs to use more computing resources to perform high-quality coding on the backgrounds, which causes the waste of computing resources, and also causes the problems of the waste of video transmission bandwidth, large delay and the like.

Disclosure of Invention

The embodiment of the application provides an encoding method, a medium and an electronic device, which can improve video encoding efficiency, avoid the waste of computing resources, reduce video transmission delay and reduce bandwidth and code rate.

In a first aspect, an embodiment of the present application provides an encoding method, including: acquiring an image to be processed; dividing an image to be processed into a plurality of blocks; detecting each block, and detecting a target area containing a target object in each block; and coding the image in the target area in each block according to a first coding parameter, and coding the image in the area except the target area in each block image according to a second coding parameter, wherein the image coding quality corresponding to the first coding parameter is higher than the image coding quality corresponding to the second coding parameter.

The blocks in the image to be processed may be Tile blocks. The detection of each block may specifically be region of interest (ROI) detection, and the target region including the target object in each block is an ROI region in each block. As an example, in the case where a person is included in the image to be processed, the target object may be a head of the person, and as in fig. 1 and 2, the target object may be a head of the person including a hat. It can be understood that the target object in the image to be processed is segmented into the blocks, and the inclusion of the target object in each block means that a part of each block that includes the target object, that is, a segment that includes the target object.

In some embodiments, the quantization parameter represented by the first coding parameter is smaller than the quantization parameter represented by the second coding parameter. For example, for Luma (Luma) encoding, the Quantization Parameter (QP) takes values of 0-51, and for Chroma (Chroma) encoding, the QP takes values of 0-39. It can be understood that the quantization parameter QP reflects the spatial detail compression, and the smaller the value is, the finer the quantization is, the higher the image quality is, and the longer the generated code stream is. In some embodiments, the first encoding parameter (e.g., encoding parameter 1, below) is a value with a smaller QP corresponding to higher quality encoding of the ROI region, and the second encoding parameter (e.g., encoding parameter 2, below) is a value with a larger QP corresponding to lower quality encoding of the ROI region. In this way, the encoding method provided by the embodiment of the present application detects each block individually to obtain the ROI region in each block, so that the ROI region on each block contains fewer backgrounds except for the target object, thereby avoiding high-quality encoding of the background in the image as much as possible, improving encoding efficiency, and reducing waste of computing resources in the encoding process.

In a possible implementation of the first aspect, the detecting each of the partitions to detect a target area including a target object in each of the partitions includes: detecting an image to be processed, and detecting an original region (such as an original ROI region hereinafter, e.g. ROI region a) containing a target object in the image to be processed; dividing the original region into a plurality of blocks to obtain a plurality of sub-regions (such as sub-ROI regions hereinafter); the image in the sub-region in each block is detected (e.g., ROI detection), and a target region (ROI region) including the target object in each block is detected. It can be understood that, by performing detection on the image in the sub-ROI region of each segment, the background included in the sub-ROI region except for the target object can be further removed to obtain the final target region on each segment, so that the target region on each segment includes as little background as possible.

In a possible implementation of the first aspect, the detecting the image to be processed and detecting an original region including the target object in the image to be processed includes: adjusting the image to be processed into a first image size, inputting the first image size into a first object detection model (hereinafter referred to as a first-stage coarse-tuning object detection model), and outputting an original area in the image to be processed by the first object detection model; in addition, the detecting an image in a sub-region in each block to detect a target region including a target object in each block includes: the image in the sub-ROI region in each tile is adjusted to a second image size and input to a second object detection model (hereinafter, a secondary fine-tune object detection model), and the target region in each tile is output by the second object detection model. Wherein the first image size is larger than the second image size. The model structures of the first object detection model and the second object detection model can adopt open source model structures like YOLO, MobileNet-SSD, CornetNet and the like, or adopt specially designed model structures and the like. It can be understood that the first object detection model is used for detecting an object ROI on the whole image, and then the image of the ROI is partitioned by Tile and sent to the second-level fine object detection model, so that the detection range of the second-level fine object detection model can be narrowed. The images of the ROIs in the multiple sub-blocks can be stacked into batch, and the batch is sent to the second object detection model for reasoning task, so that the calculation resources of NPU hardware can be fully utilized, and the reasoning efficiency is improved. In addition, a relatively accurate ROI in each block can be detected through a two-stage object detection model, so that the ROI in each block contains as little background as possible except a concerned target object.

In a possible implementation of the first aspect, the image to be processed is any one frame of image in a video to be processed; when the image to be processed is an I frame in the video to be processed, each block in the image to be processed is coded according to a first coding parameter and a second coding parameter based on intra-frame coding; when the image to be processed is a P frame or a B frame in the image to be processed, each block in the image to be processed is coded according to the first coding parameter and the second coding parameter based on inter-frame coding. It is understood that, for a frame of image, the image in the finally calculated target region (i.e. ROI region) is encoded according to the first encoding parameter, and the regions other than the target region in the image are encoded according to the second encoding parameter. Thus, the video coding efficiency can be further improved.

In a possible implementation of the first aspect, the method further includes: under the condition that the video to be processed is coded into a video code stream based on the first coding parameter and the second coding parameter, the video code stream is sent to the cloud equipment; the cloud device is used for sending the video code stream to the receiving device, and the receiving device is used for decoding the video code stream and playing the video to be processed. For example, the video to be processed is a video in a live scene. It is understood that encoding the video to be processed based on the first encoding parameter and the second encoding parameter means encoding blocks in each frame of image in the video to be processed in parallel, specifically encoding the image in the ROI region (i.e., the target region) according to the first encoding parameter for each block, and encoding the image in the region except the ROI region according to the second encoding parameter. In this way, in the encoding method provided in the embodiment of the present application, while Tile parallel encoding is used in the encoding process of a frame of video image, the ROI region of the more focused object is used to encode different portions of each block using appropriate encoding parameters. Therefore, high-quality coding of excessive backgrounds in each frame of image can be avoided, waste of computing resources is avoided, coding efficiency of the video is improved, and video transmission delay, bandwidth requirements and transmission code rate are reduced. Therefore, the live video broadcast in the high-definition live video scene is smooth and clear, and the time delay is low.

In a second aspect, this application provides an electronic device, including: the acquisition module is used for acquiring an image to be processed; the dividing module is used for dividing the image to be processed acquired by the acquiring module into a plurality of blocks; the detection module is used for detecting each block obtained by the division of the division module and detecting a target area containing a target object in each block; and the encoding module is used for encoding the image in the target area in each block detected by the detection module according to a first encoding parameter and encoding the image in the area except the target area in each block image according to a second encoding parameter, wherein the image encoding quality corresponding to the first encoding parameter is higher than the image encoding quality corresponding to the second encoding parameter. In some embodiments, the obtaining module may be implemented by an ISP in the electronic device, the dividing module may be implemented by a CPU in the electronic device, the detecting module may be implemented by an NPU in the electronic device, and the encoding module may be implemented by a VPU in the electronic device. For example, the electronic device may be the following high definition live video capture device 31, the detection module may be the NPU 312, and the encoding module may be the VPU 313.

In a possible implementation of the second aspect, the detecting module is specifically configured to detect an image to be processed, and detect an original region including a target object in the image to be processed; dividing an original region into a plurality of blocks to obtain a plurality of sub-regions; and detecting the image in the sub-area in each block, and detecting a target area containing the target object in each block.

In a possible implementation of the second aspect, the detection module is specifically configured to adjust the image to be processed to a first image size, input the first image size into the first object detection model, and output an original region in the image to be processed by the first object detection model; and adjusting the image in the sub-ROI area in each segment to a second image size and inputting the image to a second object detection model, and outputting the target area in each segment by the second object detection model.

In a possible implementation of the second aspect, the image to be processed is any one frame of image in a video to be processed; when the image to be processed is an I frame in the video to be processed, each block in the image to be processed is coded according to a first coding parameter and a second coding parameter based on intra-frame coding; when the image to be processed is a P frame or a B frame in the image to be processed, each block in the image to be processed is coded according to the first coding parameter and the second coding parameter based on inter-frame coding.

In a possible implementation of the second aspect, the apparatus further includes: the transmitting module is used for transmitting the video code stream to the cloud equipment under the condition that the video to be processed is coded into the video code stream based on the first coding parameter and the second coding parameter; the cloud device is used for sending a video code stream to the receiving device, the receiving device is used for decoding the video code stream and playing a video to be processed, and the video to be processed is a live video. For example, the sending module may be implemented by a network module in the electronic device, such as the network module 314 when the electronic device is the following high definition live video capture device 31.

In a third aspect, this application provides a readable medium having stored thereon instructions that, when executed on an electronic device, cause the electronic device to perform the encoding method of the first aspect and any possible implementation manner thereof.

In a fourth aspect, this application provides an electronic device, including: a memory for storing instructions for execution by one or more processors of the electronic device, and a processor, which is one of the processors of the electronic device, for performing the encoding method of the first aspect and any possible implementation thereof.

Drawings

FIG. 1 illustrates a schematic diagram of a region of interest and a partition in an image, according to some embodiments of the present application;

FIG. 2 illustrates a schematic diagram of a region of interest and a partition in an image, according to some embodiments of the present application;

FIG. 3 illustrates a schematic diagram of a video coding scene, according to some embodiments of the present application;

fig. 4 illustrates a schematic structural diagram of a high definition live video capture device according to some embodiments of the present application;

FIG. 5 illustrates a schematic diagram of an encoding method, according to some embodiments of the present application;

fig. 6 illustrates a schematic diagram of an encoding method, according to some embodiments of the present application.

Detailed Description

Illustrative embodiments of the present application include, but are not limited to, encoding methods, apparatus, media and devices.

For convenience of description, some nouns or terms in the embodiments of the present application will be explained first.

1. Tile (chunking):

an image can be divided into tiles, that is, an image is divided into rectangular regions from the horizontal and vertical directions, and one rectangular region is a Tile. The main purpose of Tile division is to enhance the parallel processing capability of video coding and control code stream, only transmit concerned block video coding data, and when in use, complex thread synchronization is not needed. The Tile division does not require the horizontal and vertical boundaries to be evenly distributed, and can be flexibly controlled according to the requirements of parallel computing and error control.

2. Region of interest (ROI):

in the field of image processing, a region of interest (ROI) is an image region selected from an image, which is a focus of image analysis, and is specifically defined to further process a target object (such as a human face, a human head, a human body, or an animal) in the region, so as to reduce processing time and increase accuracy. The ROI may be a region delineated from the image to be processed by a rectangle, a circle, an ellipse, an irregular polygon, or the like. For example, in

And

and obtaining the ROI of the image by various object detection algorithms commonly used in machine vision software, or detecting the ROI of the image by using a deep learning algorithm, and further performing the next processing of the image. The ROI region is usually rectangular, and the target object selected by the ROI region box is usually irregular, so that the original ROI region of the rectangle contains not only the target object but also some background around the target object (i.e. the background where the inner edge of the rectangular box is located).

3. Intra-frame coding:

the intra-frame coding is space domain coding, image compression is carried out by utilizing the spatial redundancy of images, and an independent image is processed and does not span multiple images. Spatial domain coding depends on the similarity between adjacent pixels in an image and the dominant spatial domain frequency of the pattern area.

4. And (3) interframe coding:

inter-frame coding is temporal coding, which uses temporal redundancy between a set of consecutive pictures to perform image compression. If a frame of picture is available to the decoder, the decoder only needs to use the difference between two frames of pictures to get the next frame of picture. For example, the images with smooth motion have large similarity and small difference, while the images with severe motion have small similarity and large difference. Temporal coding relies on the similarity between successive image frames to "predict" the current image using the received processed image information as much as possible.

5. And (3) a compression algorithm:

as an example, the video compression algorithm includes steps 1) -4).

1) Grouping: several frames of images are divided into a Group of Pictures (GOP), and the number of frames is not preferably large to prevent motion variation, and a GOP is a Group of continuous Pictures, that is, a sequence.

2) Defining a frame: each frame image in each group is defined as three types, i.e., I frame, B frame, and P frame.

3) Predicting a frame: the I frame is used as a basic frame, the P frame is predicted by the I frame, and the B frame is predicted by the I frame and the P frame.

4) Data transmission: and finally, storing and transmitting the I frame data and the predicted difference information.

The I frame is also called an intra-frame (also called a key frame), the P frame is also called an inter-frame (forward reference frame), and the B frame is also called a bidirectional prediction frame (bidirectional reference frame).

The coding method provided by the application can be applied to the fields of automobiles, embedded equipment, video monitoring, smart home, medical treatment, Internet of things and the like.

More specifically, the encoding method may be applied to some full high-definition video transmission scenarios, such as compression encoding of video images related to video transmission scenarios such as Virtual Reality (VR), Augmented Reality (AR), Mixed Reality (MR), 1080P, 4K, and the like.

Embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

In general, in ROI-based video coding scenes, conventional coding methods independently detect a ROI region of a complete image of a frame using an AI object detection algorithm, which may be referred to as an original ROI region. In the case of parallel encoding of different blocks (i.e., tiles) in a frame image using a Tile block encoding scheme, an original ROI region is divided into different blocks, a portion of each block that is in the original ROI region is high-quality encoded, and a portion of each block that is not in the original ROI region is low-quality encoded. It will be appreciated that the quality of the image after low quality encoding is lower, e.g. the resolution is lower, than the quality of the image after high quality encoding.

As an example, as shown in fig. 1, image a is divided into four partitions a1, a2, A3, and a4 of two rows and two columns by a Tile partition coding scheme. Assume that the target object in image a is a person's head (including the person's hat and face). In the conventional encoding method, the complete head of the person is identified in the image a, and the ROI region a (original ROI region) corresponding to the complete head of the person is determined. That is, the ROI area a focuses on the entire head of the person in the image a. Subsequently, the above-mentioned conventional encoding method may directly divide the ROI region a into the partitions a1, a2, A3 and a4, and obtain the ROI regions a1, a2, A3 and a4, respectively, and then high-quality encode the images in the ROI regions a1, a2, A3 and a4 in the respective partitions, that is, high-quality encode all the portions in the original ROI region a in the image a.

However, since the target object (i.e. the object to be focused) in an image is usually irregular in shape, and the ROI region is usually regular in shape such as a rectangle, the ROI region needs to include all the edge points of the target object in order to make the encoded image include as many features as possible of the target object. For example, if the lower edge of the hat of the person shown in fig. 1 is longer than the chin of the person, the right edge of the hat is longer than the right face of the person, and the ROI area a is rectangular, the ROI area a needs to contain more background besides the head of the person, such as more background under the chin and more background on the right side of the right face, in order for the encoded image to contain the entire head of the person.

As an example, the object segment (i.e., partial head) on the block a4 shown in fig. 1 is a part of the target object (i.e., the whole human head), so the lower frame of the ROI region a4 in the block a4 exceeds the straight line n where the chin edge point is located, and the right frame of the ROI region a4 exceeds the straight line q where the right face edge point is located, i.e., the ROI region a4 contains more background besides the human face.

Therefore, the ROI of the video is coded by using a conventional ROI video coding method, so that a part of background is contained in an ROI area, and high-quality coding is also used for the part of background, thereby causing the problems of video bandwidth waste, large transmission code rate and the like.

In order to solve the above problem, in an encoding method provided in an embodiment of the present application, when a video encoder encodes different blocks (i.e., tiles) in a frame of image by using a Tile block encoding mechanism, ROI detection is performed on each block, that is, an AI object detection algorithm is performed on each block to obtain an ROI area. Since the ROI detection is performed on a block in which the region of the object of interest (i.e., the target object) is located, for example, it is only necessary to detect the region of the human head assuming that the ROI detection is performed on the block a4 in fig. 1. The edge of the ROI area on each block is made to cling to the edge of the head of the person in each block, so that a more accurate ROI area in each block is obtained, and the image in the ROI area of each block contains less background except the target object. Furthermore, when the video encoder performs high-quality encoding on the ROI in each block, because the image in the ROI in each block contains less background, compared with the encoding scheme shown in fig. 1, the encoding method provided by the present application can avoid performing high-quality encoding on excessive background, thereby improving encoding efficiency, avoiding waste of computing resources, and having a certain benefit in terms of reducing video transmission delay, and reducing bandwidth and code rate.

As an example, as shown in fig. 2, the encoding method provided in the embodiment of the present application may divide the image a into four blocks a1, a2, A3, and a4 of two rows and two columns, detect ROI regions a1, a2, A3, and a4, and detect ROI regions b1, b2, b3, and b4 on the basis of the ROI regions a1, a2, A3, and a4, respectively.

Specifically, in the embodiment of the present application, the ROI-based AI object detection algorithm employs a 2-level model mechanism, which includes a first-level coarse object detection model and a second-level fine object detection model, and implements ROI detection from coarse (coarse) to fine (fine).

It can be understood that the coding method provided by the present application designs a 2-level neural network object detection model, which includes a first-level coarse object detection model (i.e., a first-level coarse object detection model or a first-level model) and a second-level fine object detection model (i.e., a second-level fine object detection model). The primary coarse-tuning object detection model aims at obtaining a rough object edge frame ROI on an original video, and the secondary fine-tuning object detection model carries out fine tuning on the basis of the ROI obtained by the primary coarse-tuning object detection model to obtain an accurate ROI.

In addition, the model structure in the 2-level model mechanism can adopt an open source model structure like YOLO, MobileNet-SSD and CornetNet, or a specially designed model structure. The training data set can adopt an open source data set or self-collected data and carry out labeling. And (5) converging the model on the data set in the training process, wherein the detection precision meets the application requirement.

Specifically, as shown in fig. 2, for an image a, the encoding method provided in the embodiment of the present application is as follows: adjusting the original image to a fixed size such as 300 x 300 (pixels), sending the original image into a primary coarse object detection model, and obtaining an ROI rectangular frame of a target object, namely obtaining an ROI area; dividing the image A into different blocks according to preset tile block parameters, such as four blocks A1, A2, A3 and A4 for dividing the image A into two rows and two columns; then, the images (i.e., ROI regions a1, a2, a3, a4) contained in the ROI region detected by the primary object detection model within each segment are individually adjusted to a fixed size, e.g., 48 × 48 (pixels), and then fed into the secondary fine-tuning object detection model to obtain a more accurate ROI region. In this manner, the ROI region b1 on the block a1, the ROI region b2 on the block a2, the ROI region b3 on the block A3, and the ROI region b4 on the block a4 are detected.

It can be understood that when a frame of image is divided into a plurality of blocks by the Tile block coding mechanism, the division of the target object in the image onto the plurality of blocks usually generates a plurality of segments (i.e. segments of the target object).

As an example, the division of the head of the person (i.e., the target object) into the blocks a1, a2, A3, a4 shown in fig. 2 forms a segment, respectively, for example, the right portion of the lower half face of the head of the person is the segment divided into the block a 4.

For example, as shown in fig. 2, the segment on block a4 is a face including a nose to a chin, and the ROI region b4 focused on the segment contains less background outside the face. Specifically, in fig. 2, the straight line n where the edge point of the lower frame of the ROI region b4 that exceeds the chin is smaller, and the straight line q where the right frame exceeds the right face edge point is smaller. It is apparent that the ROI region b4 shown in fig. 2 contains less background than the ROI region a4 shown in fig. 1 except for a human face.

Similarly, comparing ROI regions b1, b2, b3 with regions a1, a2, a3, respectively, ROI regions b1, b2, b3 and b4 all contain less background outside the human face, i.e., ROI regions b1, b2, b3 and b4 contain less background overall than ROI region a.

Therefore, the encoding method provided by the embodiment of the application can be used for detecting each block in the image independently to obtain the ROI area in each block, so that the ROI area on each block in the image contains less background except for the segmentation of the target object, thereby avoiding high-quality encoding of the background in the image as much as possible, improving the encoding efficiency and reducing the waste of computing resources in the encoding process.

In some embodiments, the encoding method provided by the embodiments of the present application may be applied to electronic devices that support video compression formats such as the h.265 standard and the AV1 standard. More specifically, in some embodiments, electronic devices suitable for use with the present application include, but are not limited to: a high definition live video capture device, a television or set-top box, a wearable device, an AR device, a VR device, an MR device, a digital direct broadcast system, a wireless communication device, a Personal Digital Assistant (PDA), a laptop or desktop computer, a cell phone, a tablet computer, a digital camera, a digital recording device, a video game device, a cellular or satellite radio telephone, and the like.

In the following embodiments, an electronic device that executes the encoding method in the application is taken as an example of a high-definition live video acquisition device.

As shown in fig. 3, a video encoding and decoding scene provided in the embodiment of the present application includes a high definition video live broadcast acquisition device 31 and a high definition video live broadcast terminal device 32.

The high-definition video live broadcast acquisition device 31 acquires high-definition videos such as 1080P and 4K, obtains video code streams for the acquired video codes, distributes the video code streams to the high-definition video live broadcast terminal device 32 through networks such as a cloud, and then decodes and displays the video code streams through the high-definition video live broadcast terminal device 32.

Specifically, the high-definition live Video capture device 31 includes an Image Signal Processor (ISP) 311, a Video codec (VPU, also called Video Processor) 312, a Neural-Network Processing Unit (NPU)313, a network module 314, a memory 315, a display Unit 316, a Central Processing Unit (CPU)317, and a camera 318.

Wherein camera 318 is used to capture images. The object generates an optical image through the lens and projects the optical image to the photosensitive element. The light sensing elements convert the optical signals into electrical signals, which are then passed to the ISP 311 for conversion into digital image signals. It can be understood that the high-definition live video capture device 31 can implement a video shooting function through the ISP 311, the camera 318, the VPU 312, the display unit 316, and the like.

The VPU 312 is a core engine of a video processing platform, and has a hardware codec function, which can reduce the CPU load of the device. For example, in the embodiment of the present application, the VPU 312 is configured to encode a video to obtain a video code stream, so as to implement video compression encoding.

The NPU 313 is a framework specially designed for deep learning reasoning and can be used for processing multimedia data of video and image classes. For example, the NPU is used in the present application to detect ROI regions of an object using AI techniques, including an object ROI of a complete image and object ROIs of respective segments (i.e., ROI regions of object segments).

The display unit 316 is used for displaying human-computer interaction interfaces, images, videos and the like, for example, videos collected by the camera 318.

The memory 315 is used for storing instructions, data, and the like executed during execution of the encoding method by the CPU 317, ISP 311, VPU 312, NPU 313, and the like.

The CPU 317 is used to encode video in cooperation with ISP 311, VPU 312, NPU 313, memory 315, etc. As an example, the CPU 317 is configured to control the ISP 311, VPU 312, NPU 313 to perform a video encoding method.

The network module 314 is configured to send the video code stream obtained by encoding to the cloud. The network module 314 may provide a solution for wireless communication including a Wireless Local Area Network (WLAN) (e.g., a wireless fidelity (Wi-Fi) network), a mobile communication network, and the like, which is applied to the high-definition live video capture device 31. The network module 314 may communicate with networks and other devices via wireless communication techniques. The cloud device on the cloud end can be a server and the like.

It should be understood that the schematic structure of the embodiment of the present invention does not form a specific limitation on the high definition live video capture device 31. In other embodiments of the present application, the high definition live video capture device 31 may include more or fewer components than those shown, or some components may be combined, some components may be separated, or a different arrangement of components may be used. For example, the high definition live video capture device 31 may further include a sensor module, an audio module, and the like. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

In some embodiments, as shown in fig. 4, a schematic diagram of a possible structure of the high definition video live capturing device 31 is shown. The apparatus shown in fig. 4 includes a network module 314, a video codec (VPU)312, a neural Network Processor (NPU)313, a display unit 316, a Central Processing Unit (CPU)317 and a memory 315, and a bus 319 connecting these devices.

The high-definition live video terminal device 32 includes a network module 321, a VPU 322, a memory 323, a display unit 324, and a CPU 325.

The network module 321 receives the video code stream from the high-definition live video capture device 31 through the cloud, and the video code stream is decoded by the VPU 322 and then live broadcast by the display unit 324.

And the memory 323 is used to store instructions executed during execution of the encoding method by a processor such as the CPU 325, the VPU 322, or the like. The CPU 325 is used for decoding and playing the video in cooperation with the VPU 322, the memory 323, and the like.

It is to be understood that the configuration illustrated in the embodiment of the present invention does not constitute a specific limitation to the high-definition video live terminal device 32.

In the following, according to some embodiments of the present application, in combination with the description of the above scenario and the description of the high definition live video capture device 31, a workflow of each unit in the high definition live video capture device 31 to execute an encoding method will be described as an example of an encoding process of a frame of image. The technical details described in the above scenario are still applicable in this process, and some details are not repeated here in order to avoid repetition. As shown in fig. 5, a schematic flowchart of a method for encoding a frame of image provided in the embodiments of the present application is shown, where the method specifically includes:

step 501: the high-definition live video capture device 31 pre-configures Tile blocking parameters, for example, the Tile blocking parameters are N rows and M columns, which means that an image is divided into N rows and M columns of Tile blocks.

For example, in conjunction with fig. 2, for the image a, the parameters of the Tile blocks configured in advance may be two rows and two columns (N ═ M ═ 2), that is, the image a may be divided into four blocks a1, a2, A3, and a 4. Each block is a sub-image, which may also be referred to as a block sub-image.

In some embodiments, for a frame of an image, the parameters of Tile tiles may include coordinates of each Tile relative to the image, such as coordinates of the four vertices of each rectangular Tile on the image.

As an example, the step 501 may be executed by the CPU 317 in the high definition live video capture device 31.

Step 502: the high-definition live video acquisition device 31 operates a primary rough adjustment object detection model through the NPU 313, inputs the whole image and deduces to obtain the ROI-A.

As can be appreciated, ROI-A is a roughly estimated ROI area over the entire image that is input into the primary coarse object detection model. Taking fig. 2 as an example, when the input image a of the object detection model is roughly adjusted at one stage, the obtained ROI area a (i.e., ROI-a) is inferred.

Taking fig. 2 as an example, before inputting the image a in the primary coarse object detection model, the NPU 313 or ISP 311 may also adjust the image a to a fixed size, such as 300 × 300.

In some embodiments, when the NPU 313 detects that the ROI regions of the entire image of the image to be processed are rectangular, the position of one ROI region on the image can be identified by the coordinates of the four vertices of each rectangular ROI region.

Step 503: the high-definition video live broadcast acquisition device 31 divides the ROI-A into a plurality of sub ROIs according to the configured Tile blocks, and extracts the image of each sub ROI.

As an example, the step 503 may be executed by the CPU 317 in the high definition live video capture device 31 and the ISP 311.

For example, in conjunction with fig. 2, as for image a, the high definition live video capture device 31 directly divides the ROI region a (i.e., ROI-a) into partitions, resulting in corresponding ROI regions a1, a2, A3, and a4 on partitions a1, a2, A3, and a4, respectively. At this time, ROI regions a1, a2, a3 and a4 are the above-mentioned multiple sub-ROIs. Then, the high definition live video capture device 31 may extract the images in the ROI regions a1, a2, a3 and a4 in the image a, respectively.

ROI regions b1, b2, b3 and b4, i.e., ROI regions in which the target object segment is aggregated on each block, are detected on the images in the ROI regions a1, a2, a3 and a4, respectively. Accordingly, the NPU 312 may take other regions on the partitions a1, a2, A3, and a4 except for the ROI regions (i.e., ROI regions b1, b2, b3, and b4) as non-ROI regions.

In some embodiments, when the NPU 313 detects that the sub-ROIs on the respective segments of the image to be processed are rectangular, the position of an ROI region on the corresponding segment can also be identified by the coordinates of the four vertices of the ROI region of each rectangle, which are of course coordinates relative to the segment or relative to the image.

It is understood that in the embodiment of the present application, when a plurality of target objects (e.g., a plurality of human heads) are included in one image, a plurality of ROIs-a can be detected by the object detection model through the first-stage coarse adjustment. Accordingly, some segments of the image may include segments of different target objects, and sub-ROIs belonging to different ROIs-a on the segments, i.e., sub-ROIs where segments of different target objects are located, may be obtained by dividing. For example, one image includes a person 1 and a person 2, and the ROI-a where the person 1 is located and the ROI-a where the person 2 is located can be detected by the first-stage coarse object detection model. At this time, if both the segment of the person 1 and the segment of the person 2 are included in one block on the image, sub-ROIs of the two segments in different ROIs-a can be divided.

Step 504: the high-definition video live broadcast acquisition device 31 adjusts the image of each sub-ROI to the same size and stacks the image as the input of a secondary fine-tuning object detection model, the NPU 313 operates the secondary fine-tuning object detection model, and the accurate ROI-B on the image of each sub-ROI is obtained through reasoning.

Taking fig. 2 as an example, the high definition live video capture device 31 may adjust the images in the ROI areas a1, a2, a3, and a4 in the image a to the same size, such as 48 × 48, respectively. Also, the ROI regions B1, B2, B3 and B4 corresponding to the patches a1, a2, A3 and a4, respectively, i.e., the precise ROI-B on each patch, can be detected on the basis of the images in the ROI regions a1, a2, A3 and a4 by the two-level fine-tuning object detection model.

Wherein, the region except ROI-B on each block of the image to be processed is called non-ROI region. Then the regions in the image to be processed other than ROI-B on all patches are non-ROI regions in the image.

In some embodiments, the present application embodiment trains the deep learning model with a pre-collected and labeled training data set before using the first-level coarse object detection model and the second-level fine object detection model, so that the model can perform the above steps. Wherein the images in the training dataset may be of a predetermined size (e.g., 100 x 100, or 48 x 48) and pre-marked with the ROI.

Specifically, in some embodiments, images in the training dataset are input to an object detection deep learning model (e.g., a first-level coarse object detection model and a second-level fine object detection model) that inferentially outputs ROI position and score (confidence, etc.) information. The loss function of the model can be the sum of the positioning loss and the classification loss (specifically, refer to MobileNet-SSD), the model is continuously fed with a new data set for training, after many iterations, until the model converges, the loss function is stabilized at a lower value, and at the same time, accuracy (accuracy) on the evaluation data set reaches the expected target.

As an example, the object detection deep learning model can employ an open source model such as YOLO, MobileNet-SDD, CornerNet, etc., or a specially designed network architecture. Wherein the training data set of the primary coarse object detection model may be an open source data set, such as the WIDER FACE data set for face detection. It is also possible to use the self-collected data set and label the object frame ROI. The training data set of the first-level coarse object detection model can be used along with the training data set of the second-level fine object detection model, the whole image is cut into pieces through random Tile, the marked ROI also needs to be cut according to the Tile, and in addition, in order to obtain a more accurate ROI, the marked ROI can be marked again manually after being segmented, so that an accurate ROI area is obtained.

Step 505: the high-definition live video capture device 31 encodes the image in the ROI-B in each Tile block in the image according to the encoding parameter 1 through the VPU 312, and encodes the image in the non-ROI area in each Tile block according to the encoding parameter 2.

Wherein, the non-ROI area in each Tile block is the area except the ROI-B area in the image.

It is understood that the encoding parameter 1 may be an encoding parameter corresponding to the ROI region for high quality encoding of the image data, i.e., the encoding parameter 1 corresponds to a higher image encoding quality. And the encoding parameter 2 may be an encoding parameter corresponding to a non-ROI region for performing lower quality encoding on the image data, i.e., the encoding parameter 2 corresponds to lower image encoding quality.

It can be understood that adjusting the encoding parameters mainly adjusts the Quantization Parameter (QP), and for luminance (Luma) encoding, the QP takes a value of 0-51, and for chrominance (Chroma) encoding, the QP takes a value of 0-39. The quantization parameter QP reflects the spatial detail compression situation, and the smaller the value is, the finer the quantization is, the higher the image quality is, and the longer the generated code stream is. If the QP is small, most details in the image are preserved; QP increases, some details are lost, the code rate decreases, but image distortion increases and quality degrades. For example, for luma coding, QP represents the finest quantization when it takes the minimum value of 0; conversely, QP takes a maximum value of 51, indicating that quantization is coarsest. Of course, the quantization parameter represented by the coding parameter provided in the embodiment of the present application includes, but is not limited to, the above examples.

In some embodiments, coding parameter 1 is a smaller value of QP corresponding to higher quality coding for the ROI region, and coding parameter 2 is a larger value of QP corresponding to lower quality coding for the ROI region. That is, the coding parameter 1 represents a quantization parameter smaller than the quantization parameter represented by the coding parameter 2. The QP value can be selected according to the actual bitrate requirement and the system bandwidth requirement.

It can be understood that the first-stage coarse object detection model is used for detecting an object ROI on the whole image, and the image of the ROI is partitioned by Tile and then sent to the second-stage fine object detection model, so that the detection range of the second-stage fine object detection model can be reduced. Images of the ROI in the partitioned tiles can be stacked into batch, and the batch is sent into a secondary fine-tuning object detection model to perform inference tasks, so that the calculation resources of NPU hardware can be fully utilized, and the inference efficiency is improved. In addition, a relatively accurate ROI in each Tile can be detected through a two-stage object detection model, so that the ROI in each Tile contains as little background as possible except a target object of interest. Therefore, high-quality coding on excessive background can be avoided, so that the coding efficiency can be improved, the waste of computing resources can be avoided, and certain benefits can be realized in the aspects of reducing video transmission delay and reducing bandwidth and code rate.

In general, for video coding, mainly including intra-frame coding (I-frame) and inter-frame coding (P/B-frame), the primary and secondary object detection models may act on each original frame to obtain an accurate ROI corresponding to each original frame, and if the current frame is processed according to the I-frame, the current frame is coded according to the I-frame and the ROI corresponding to the current frame. If the current frame is processed according to P/B frame, the ROI corresponding to the P/B frame and the current frame is coded.

Further, referring to the encoding method shown in fig. 5, the following describes a workflow of each unit in the high definition live video capture device 31 executing the encoding method, taking an encoding process of a plurality of frames of video images in a video as an example. As shown in fig. 6, a schematic flow chart of a method for encoding multiple frames of video images provided in the embodiment of the present application is shown, where the method specifically includes:

step 601: the high-definition video live broadcast acquisition device 31 inputs video.

Wherein, the video is collected in real time by the high-definition live broadcast collecting device through the camera 318. As an example, the video may be a live video.

Step 602: the high-definition video live broadcast acquisition device 31 acquires a frame of image.

Step 603: the high-definition live video capture device 31 pre-configures Tile blocking parameters, for example, the Tile blocking parameters are N rows and M columns, which means that an image is divided into N rows and M columns of Tile blocks.

Step 604: the NPU 313 operates a primary coarse adjustment object detection model, inputs the whole image and deduces to obtain ROI-A.

Step 605: the high-definition video live broadcast acquisition device 31 divides the ROI-A into a plurality of sub ROIs according to the configured Tile blocks, and extracts the image of each sub ROI.

Step 606: the high-definition video live broadcast acquisition device 31 adjusts the image of each sub-ROI to the same size and stacks the image as the input of a secondary fine-tuning object detection model, the NPU 313 operates the secondary fine-tuning object detection model, and the accurate ROI-B on the image of each sub-ROI is obtained through reasoning.

Step 607: the high-definition live video capture device 31 performs intra-frame or inter-frame coding on a current frame through the VPU 312 according to a coding standard, codes an image in the ROI-B in each Tile block in the image according to the coding parameter 1, and codes an image in a non-ROI area in each Tile block according to the coding parameter 2.

The detailed description in the steps 603-607 can refer to the steps 501-506 in the above embodiment, and the same parts are not repeated. The difference is that when encoding data is used, intra-frame encoding is used if the current frame is an I-frame, and inter-frame encoding is used if the current frame is a P-or B-frame. More specifically, intra-coding is employed for Tile partitions belonging to I-frames, and inter-coding is employed for Tile partitions belonging to P-or B-frames.

Step 608: the high-definition live video capture device 31 obtains the next frame of image, and repeats step 603 and step 607.

Further, in a live video scene, after the high-definition live video acquisition device 31 encodes the acquired video, the network module 314 may send the encoded video code stream to the cloud device. Subsequently, the cloud device may send the received video stream code to the high-definition live video terminal device 32. However, the high-definition live video terminal device 32 receives the video stream code through the network module 321, decodes the video stream code through the VPU 322, and then broadcasts the decoded video through the display unit 324 to complete live video broadcasting.

In the encoding method provided by the embodiment of the application, while Tile parallel encoding is adopted in the encoding process of a frame of video image, the ROI area of a more focused object is used for encoding different parts on each block by adopting suitable encoding parameters. Therefore, high-quality coding of excessive backgrounds in each frame of image can be avoided, waste of computing resources is avoided, coding efficiency of the video is improved, and video transmission delay, bandwidth requirements and transmission code rate are reduced. Therefore, the live video broadcast in the high-definition live video scene is smooth and clear, and the time delay is low.

Embodiments of the mechanisms disclosed herein may be implemented in hardware, software, firmware, or a combination of these implementations. Embodiments of the application may be implemented as computer programs or program code executing on programmable systems comprising at least one processor, a storage system (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device.

Program code may be applied to input instructions to perform the functions described herein and generate output information. The output information may be applied to one or more output devices in a known manner. For purposes of this application, a processing system includes any system having a processor such as, for example, a Digital Signal Processor (DSP), a microcontroller, an Application Specific Integrated Circuit (ASIC), or a microprocessor.

The program code may be implemented in a high level procedural or object oriented programming language to communicate with a processing system. The program code can also be implemented in assembly or machine language, if desired. Indeed, the mechanisms described in this application are not limited in scope to any particular programming language. In any case, the language may be a compiled or interpreted language.

In some cases, the disclosed embodiments may be implemented in hardware, firmware, software, or any combination thereof. The disclosed embodiments may also be implemented as instructions carried by or stored on one or more transitory or non-transitory machine-readable (e.g., computer-readable) storage media, which may be read and executed by one or more processors. For example, the instructions may be distributed via a network or via other computer readable media. Thus, a machine-readable medium may include any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer), including, but not limited to, floppy diskettes, optical disks, read-only memories (CD-ROMs), magneto-optical disks, read-only memories (ROMs), Random Access Memories (RAMs), erasable programmable read-only memories (EPROMs), electrically erasable programmable read-only memories (EEPROMs), magnetic or optical cards, flash memory, or a tangible machine-readable memory for transmitting information (e.g., carrier waves, infrared digital signals, etc.) using the internet in an electrical, optical, acoustical or other form of propagated signal. Thus, a machine-readable medium includes any type of machine-readable medium suitable for storing or transmitting electronic instructions or information in a form readable by a machine (e.g., a computer).

In the drawings, some features of the structures or methods may be shown in a particular arrangement and/or order. However, it is to be understood that such specific arrangement and/or ordering may not be required. Rather, in some embodiments, the features may be arranged in a manner and/or order different from that shown in the illustrative figures. In addition, the inclusion of a structural or methodical feature in a particular figure is not meant to imply that such feature is required in all embodiments, and in some embodiments, may not be included or may be combined with other features.

It should be noted that, in the embodiments of the apparatuses in the present application, each unit/module is a logical unit/module, and physically, one logical unit/module may be one physical unit/module, or may be a part of one physical unit/module, and may also be implemented by a combination of multiple physical units/modules, where the physical implementation manner of the logical unit/module itself is not the most important, and the combination of the functions implemented by the logical unit/module is the key to solve the technical problem provided by the present application. Furthermore, in order to highlight the innovative part of the present application, the above-mentioned device embodiments of the present application do not introduce units/modules which are not so closely related to solve the technical problems presented in the present application, which does not indicate that no other units/modules exist in the above-mentioned device embodiments.

It is noted that, in the examples and descriptions of this patent, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, the use of the verb "comprise a" to define an element does not exclude the presence of another, same element in a process, method, article, or apparatus that comprises the element.

While the present application has been shown and described with reference to certain preferred embodiments thereof, it will be understood by those of ordinary skill in the art that various changes in form and details may be made therein without departing from the spirit and scope of the present application.

Claims

1. A method of encoding, comprising:

acquiring an image to be processed;

dividing the image to be processed into a plurality of blocks;

detecting each block, and detecting a target area containing a target object in each block;

and coding the image in the target area in each block according to a first coding parameter, and coding the image in the area except the target area in each block image according to a second coding parameter, wherein the image coding quality corresponding to the first coding parameter is higher than the image coding quality corresponding to the second coding parameter.

2. The method according to claim 1, wherein the detecting each of the blocks to detect a target area containing a target object in each of the blocks comprises:

detecting the image to be processed, and detecting an original region containing the target object in the image to be processed;

dividing the original region into a plurality of blocks to obtain a plurality of sub-regions;

and detecting the image in the sub-area in each block, and detecting a target area containing the target object in each block.

3. The method of claim 2,

the detecting the image to be processed and detecting an original region containing the target object in the image to be processed includes:

adjusting the image to be processed into a first image size, inputting the first image size into a first object detection model, and outputting the original area in the image to be processed by the first object detection model;

the detecting an image in a sub-region in each of the blocks to detect a target region including the target object in each of the blocks includes:

and adjusting the image in the sub-area in each block to be in a second image size, inputting the image to a second object detection model, and outputting the target area in each block by the second object detection model.

4. The method according to claim 3, wherein the image to be processed is any one frame of image in the video to be processed;

when the image to be processed is an I frame in a video to be processed, each block in the image to be processed is coded according to the first coding parameter and the second coding parameter based on intra-frame coding;

and when the image to be processed is a P frame or a B frame in the image to be processed, each block in the image to be processed is coded according to the first coding parameter and the second coding parameter based on inter-frame coding.

5. The method of claim 4, further comprising:

under the condition that the video to be processed is coded into a video code stream based on the first coding parameter and the second coding parameter, the video code stream is sent to cloud equipment;

the cloud device is used for sending the video code stream to a receiving device, and the receiving device is used for decoding the video code stream and playing the video to be processed.

6. An electronic device, comprising:

the acquisition module is used for acquiring an image to be processed;

the dividing module is used for dividing the image to be processed acquired by the acquiring module into a plurality of blocks;

the detection module is used for detecting each block obtained by the division of the division module and detecting a target area containing a target object in each block;

and the encoding module is used for encoding the image in the target area in each block detected by the detection module according to a first encoding parameter and encoding the image in the area except the target area in each block image according to a second encoding parameter, wherein the image encoding quality corresponding to the first encoding parameter is higher than the image encoding quality corresponding to the second encoding parameter.

7. The apparatus of claim 6,

the detection module is specifically configured to detect the image to be processed, and detect an original region including the target object in the image to be processed;

8. The apparatus of claim 7,

the detection module is specifically configured to adjust the image to be processed to a first image size, input the image to be processed into a first object detection model, and output the original region in the image to be processed by the first object detection model; and adjusting the image in the sub-area in each block to a second image size, inputting the image to a second object detection model, and outputting the target area in each block by the second object detection model.

9. The apparatus according to claim 8, wherein the image to be processed is any frame of image in a video to be processed;

10. The apparatus of claim 9, further comprising:

the sending module is used for sending the video code stream to cloud equipment under the condition that the video to be processed is coded into the video code stream based on the first coding parameter and the second coding parameter;

the cloud device is used for sending the video code stream to a receiving device, the receiving device is used for decoding the video code stream and playing the video to be processed, and the video to be processed is live video.

11. A readable medium, characterized in that it has stored thereon instructions which, when executed on an electronic device, cause the electronic device to carry out the encoding method of any one of claims 1 to 5.

12. An electronic device, comprising: a memory for storing instructions for execution by one or more processors of the electronic device, and a processor, which is one of the processors of the electronic device, for performing the encoding method of any one of claims 1 to 5.