WO2023005740A1

WO2023005740A1 - Image encoding, decoding, reconstruction, and analysis methods, system, and electronic device

Info

Publication number: WO2023005740A1
Application number: PCT/CN2022/106507
Authority: WO
Inventors: 张翠姗; 李正光; 林建良; 金鉴; 孟丽丽; 陈思恩; 林维斯; 陈卓
Original assignee: 阿里巴巴（中国）有限公司; 南洋理工大学
Priority date: 2021-07-28
Filing date: 2022-07-19
Publication date: 2023-02-02
Also published as: CN113660486A

Abstract

Provided in the embodiments of the present application are image encoding, decoding, reconstruction, and analysis methods, a system, and an electronic device, the image encoding method comprising: acquiring an original image; extracting high-level semantic information of the original image, the high-level semantic information being used for an image analysis task of the original image; extracting a low-level feature of the original image, the low-level feature and the high-level semantic information being used for an image reconstruction task of the original image; encoding the high-level semantic information to generate a first code stream; and encoding the low-level feature to generate a second code stream. In the embodiments of the present application, an encoding scheme is used for fusing user-vision-oriented image reconstruction tasks and machine-vision-oriented image analysis tasks.

Description

Image coding, decoding, reconstruction, analysis method, system and electronic equipment

This application claims the priority of the Chinese patent application with the application number 202110860294.5 and the title of the invention "image coding, decoding, reconstruction, analysis method, system and electronic equipment" submitted on July 28, 2021, the entire content of which is incorporated by reference in In this application.

technical field

The embodiments of the present application relate to the field of image technology, and specifically relate to an image encoding, decoding, reconstruction, and analysis method, system, and electronic equipment.

Background technique

As a kind of data with visual effects, images can be divided into static images and image frames of dynamic videos. After the image is generated (for example, after the image is collected or produced), in order to facilitate the digital transmission of the image, the image may be encoded using image coding technology. Image coding, also known as image compression, refers to the technology of representing the information contained in the image with fewer bits under the condition of satisfying a certain image quality. After the image is encoded and transmitted to the receiving end, the receiving end can decode the image to realize image reconstruction. Based on the reconstructed image, the user can watch the image at the receiving end to meet the user's image viewing needs; its typical application scenarios are image display, video playback, etc.

With the rise of technologies and requirements such as the Internet of Things, smart cities, and smart offices, images need to meet the image analysis needs of machine vision (also known as computer vision) in addition to traditional user viewing needs. However, there is currently no image coding scheme that is compatible with user vision-oriented image reconstruction tasks and machine vision-oriented image analysis tasks. Therefore, how to provide a new type of image coding scheme to integrate image reconstruction tasks and image analysis tasks has become an important issue in this field. Technical problems urgently needed to be solved by those skilled in the art.

Contents of the invention

In view of this, the embodiments of the present application provide an image encoding, decoding, reconstruction, analysis method, system, and electronic device to integrate image reconstruction tasks oriented to user vision and image analysis tasks oriented to machine vision.

To achieve the above purpose, the embodiments of the present application provide the following technical solutions.

In the first aspect, the embodiment of the present application provides an image coding method, including:

get the original image;

extracting high-level semantic information of the original image, the high-level semantic information being used for an image analysis task of the original image; and,

extracting low-level features of the original image, the low-level features and the high-level semantic information are used for image reconstruction tasks of the original image;

Encoding the high-level semantic information to generate a first code stream; and,

Encoding the low-level features to generate a second code stream.

In a second aspect, an embodiment of the present application provides an image coding system, including:

A semantic extractor is used to extract high-level semantic information of the original image, and the high-level semantic information is used for image analysis tasks of the original image;

a feature extractor, configured to extract low-level features of the original image, and the low-level features and the high-level semantic information are used for image reconstruction tasks of the original image;

a first encoder, configured to encode the high-level semantic information to generate a first code stream;

The second encoder is configured to encode the low-level features to generate a second code stream.

In a third aspect, the embodiment of the present application provides an image decoding method, including:

Obtain a first code stream corresponding to the high-level semantic information of the original image, and a second code stream corresponding to the low-level features of the original image;

Decoding the first code stream to obtain the high-level semantic information, the high-level semantic information is used to perform an image analysis task of the original image;

Decoding the second code stream to obtain the low-level features;

Perform image reconstruction on the original image according to the low-level features and the high-level semantic information to obtain a predicted image.

In a fourth aspect, the embodiment of the present application provides an image decoding system, including:

The first decoder is configured to decode the first code stream corresponding to the high-level semantic information of the original image to obtain the high-level semantic information; the high-level semantic information is used to perform an image analysis task of the original image;

The second decoder is configured to decode the second code stream corresponding to the low-level features of the original image to obtain the low-level features;

The predictor performs image reconstruction on the original image according to the low-level features and the high-level semantic information to obtain a predicted image.

In the fifth aspect, the embodiment of the present application provides an image reconstruction method, including:

Obtain high-level semantic information and low-level features of the original image;

performing image reconstruction on the original image according to the low-level features and the high-level semantic information to obtain a predicted image; and,

Acquiring difference information between the predicted image and the original image, where the difference information is used to enhance the predicted image.

In a sixth aspect, the embodiment of the present application provides an image analysis method, including:

Based on the image analysis task of the original image, the target code stream is obtained from multiple code streams; the multiple code streams at least include at least one first code stream corresponding to at least one high-level semantic information of the original image, and the code stream corresponding to the original image The second code stream corresponding to the low-level feature; wherein, a high-level semantic information of the original image corresponds to a first code stream, and the target code stream is the high-level semantic information in the at least one first code stream suitable for the image analysis task the first code stream of

Decoding the target code stream to obtain high-level semantic information applicable to the image analysis task;

The image analysis task is performed according to the decoded high-level semantic information.

In a seventh aspect, an embodiment of the present application provides an electronic device, including at least one memory and at least one processor, the memory stores one or more computer-executable instructions, and the processor invokes the one or more computer-executable instructions. Executing instructions to execute the image coding method described in the first aspect above, or the image decoding method described in the third aspect above, or the image reconstruction method described in the fifth aspect above, or the image reconstruction method described in the above first aspect The image analysis method described in the six aspects.

In the eighth aspect, the embodiment of the present application provides a storage medium, the storage medium stores one or more computer-executable instructions, and when the one or more computer-executable instructions are executed, the above-mentioned first aspect is implemented. The image encoding method, or the image decoding method described in the third aspect above, or the image reconstruction method described in the fifth aspect above, or the image analysis method described in the sixth aspect above.

The image coding method provided in the embodiment of the present application can extract high-level semantic information and low-level features from the original image respectively. Since the high-level semantic information can express the semantics of the original image concept level, the high-level semantic information can be used for the image analysis task of the original image. While high-level semantic information and low-level features can be combined for image reconstruction tasks from raw images. Furthermore, in the embodiment of the present application, the high-level semantic information and low-level features are respectively encoded to generate the first code stream and the second code stream, so that the high-level semantic information and low-level features can be transmitted to the receiving end in the form of code streams. The embodiment of the present application realizes the use of a set of coding schemes to integrate image reconstruction tasks oriented to user vision and image analysis tasks oriented to machine vision.

Description of drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only It is an embodiment of the present application, and those skilled in the art can also obtain other drawings according to the provided drawings without creative work.

FIG. 1 is a block diagram of an image transmission system provided by an embodiment of the present application.

Fig. 2A is a flow chart of the image encoding and decoding method provided by the embodiment of the present application.

FIG. 2B is a schematic structural diagram of an image coding system provided by an embodiment of the present application.

FIG. 2C is a schematic structural diagram of an image encoding and decoding system provided by an embodiment of the present application.

FIG. 3A is another flow chart of the image encoding and decoding method provided by the embodiment of the present application.

FIG. 3B is another schematic structural diagram of an image coding system provided by an embodiment of the present application.

FIG. 3C is another schematic structural diagram of the image encoding and decoding system provided by the embodiment of the present application.

FIG. 4A is a flowchart of a method for obtaining a predicted image according to an embodiment of the present application.

FIG. 4B is a schematic structural diagram of the predictor provided by the embodiment of the present application.

FIG. 4C is a schematic structural diagram of a convolutional network provided by an embodiment of the present application.

FIG. 4D is another structural schematic diagram of the predictor provided by the embodiment of the present application.

FIG. 5A is a schematic structural diagram of a feature extractor provided in an embodiment of the present application.

FIG. 5B is another schematic structural diagram of the feature extractor provided by the embodiment of the present application.

FIG. 6 is another schematic structural diagram of the image encoding and decoding system provided by the embodiment of the present application.

FIG. 7 is another schematic structural diagram of the image encoding and decoding system provided by the embodiment of the present application.

FIG. 8 is a flow chart of the image analysis method provided by the embodiment of the present application.

FIG. 9A is a schematic diagram of effect comparison provided by the embodiment of the present application.

FIG. 9B is a schematic diagram of another effect comparison provided by the embodiment of the present application.

FIG. 9C is a schematic diagram of yet another effect comparison provided by the embodiment of the present application.

FIG. 9D is a schematic diagram of another effect comparison provided by the embodiment of the present application.

FIG. 10 is a diagram of an application example provided by the embodiment of the present application.

FIG. 11 is a block diagram of an electronic device provided by an embodiment of the present application.

Detailed ways

The following will clearly and completely describe the technical solutions in the embodiments of the application with reference to the drawings in the embodiments of the application. Apparently, the described embodiments are only some of the embodiments of the application, not all of them. Based on the embodiments in this application, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the scope of protection of this application.

FIG. 1 exemplarily shows a block diagram of an image transmission system 100 provided by an embodiment of the present application. As shown in FIG. 1 , the image transmission system 100 may include: an image sending end 110 and an image receiving end 120 . Wherein, the sending end 110 can set the image coding system 111 to realize image coding by using the image coding scheme provided by the embodiment of the present application; the receiving end 120 can set the image decoding system 121 to use the image decoding scheme provided by the embodiment of the present application, Implement image decoding.

The image transmission system 100 shown in FIG. 1 can be applied to image transmission between any devices, including but not limited to: image transmission between terminals, image transmission between terminals and servers, and image transmission between servers, Wherein, the terminal may include smart hardware (for example, smart hardware with image acquisition capabilities such as a smart camera), user equipment such as a mobile phone, and a computer. In the embodiment of this application, the device is not fixed to be the sending end and the receiving end, but is adjusted according to the sending and receiving angles of the image. For example, a certain device becomes the sending end when sending an image, but it will become the receiver. In some embodiments, one of the terminal and the server becomes the sending end when sending an image, so as to use the image coding scheme provided by the embodiment of the present application to realize image coding; when one of the terminal and the server receives the image, it becomes the receiving end to use The image decoding solution provided in the embodiment of the present application realizes image decoding.

In the embodiment of the present application, when encoding the image, the sending end 110 needs to support the image reconstruction task oriented to user vision and the image analysis task oriented to machine vision of the receiving end 120 . The user vision-oriented image reconstruction task can be understood as implementing image reconstruction at the receiving end 120 so that the user can watch the image at the receiving end 120 . The image analysis task for machine vision can be understood as analyzing and processing the image from the perspective of the computer at the receiving end; for example, the receiving end implements image classification, object detection, instance segmentation, etc. in the image. In some embodiments, the application scenarios of machine vision-oriented image analysis tasks include, but are not limited to: license plate recognition and road planning in intelligent transportation systems, object detection and road tracking in automatic driving systems, and human face recognition in smart medical systems. And facial expression detection and analysis, abnormal behavior detection, etc.

In the embodiment of the present application, the sending end 110 can use the image coding system 111 to implement image encoding, and support user vision-oriented image reconstruction tasks and machine vision-oriented image analysis tasks; the receiving end 120 can use the image decoding system 121 to implement image decoding , and specifically perform an image reconstruction task oriented to the user's vision, and provide processing data for the receiving end 120 to perform an image analysis task. In some embodiments, FIG. 2A exemplarily shows a flow chart of the image encoding and decoding method provided by the embodiment of the present application. In the process of the method, the image encoding system 111 may perform an image encoding process, and the image decoding system 121 may perform an image decoding process. As shown in FIG. 2A , the process flow of the method may include the following steps.

In step S20, the image coding system acquires the original image.

The original image can be regarded as an image that needs to be encoded, for example, an image collected by the sending end 110 using a camera, or an image created by drawing or programming. The original image can be input into the image encoding system 111, and the original image is encoded by the image encoding system 111.

In step S21, the image coding system extracts high-level semantic information of the original image.

In step S22, the image coding system extracts low-level features of the original image.

The high-level semantic information of the image may be the semantics expressed by the image, for example, the objects expressed by the image (such as people, plants, animals, manufactured objects, etc.), the types of objects, the relationship between objects, and so on. The low-level features of an image can be understood as detailed feature information such as the color, texture, and texture of the image.

In some embodiments, the low-level features of an image can express the rich visual detail information of the image, but the semantic information at the conceptual level is less; semantic information. As an optional implementation, the low-level features may be the features of the visual layer of the image, and the high-level semantic information may be the information of the conceptual layer of the image.

In the embodiment of the present application, the high-level semantic information can express the semantics of the original image at the conceptual level, so the high-level semantic information can be used for the image analysis task of the original image. Since the low-level features express the rich visual detail information of the original image, and the high-level semantic information expresses the semantics of the original image at the conceptual level, the low-level features and high-level semantic information can be combined for the image reconstruction task of the original image.

In step S23, the image encoding system encodes the high-level semantic information to generate a first code stream.

In step S24, the image encoding system encodes the low-level features to generate a second code stream.

In the embodiment of the present application, the image analysis task and the image reconstruction task of the original image can be performed by the receiving end 120, so that after the image coding system obtains the high-level semantic information and low-level features, it can encode the high-level semantic information and low-level features respectively, In order to obtain the first code stream corresponding to the high-level semantic information and the second code stream corresponding to the low-level features. The first code stream and the second code stream can be transmitted from the sending end 110 to the receiving end 120 .

In some embodiments, high-level semantic information and low-level features can be encoded using the same encoding method, for example, high-level semantic information and low-level features can be encoded using the same encoder in an image encoding system. The same encoder may be a lossless encoder to achieve lossless encoding of high-level semantic information and lossless encoding of low-level features.

In step S30, the image decoding system decodes the first code stream to obtain high-level semantic information, and the high-level semantic information is used to perform the image analysis task of the original image.

After the image decoding system acquires the first code stream transmitted by the sending end 110, it may decode the first code stream to obtain high-level semantic information. In some embodiments, the high-level semantic information obtained by the image decoding system may provide processing data for the receiver 120 to perform image analysis tasks. In some further embodiments, the receiving end 120 can set an image analysis logic (such as an image analysis model) for performing image analysis tasks, and the image decoding system can import the decoded high-level semantic information into the image analysis logic, and the image analysis logic is based on the high-level Semantic information performs specific image analysis tasks, thereby meeting the image analysis needs of machine vision-oriented raw images. As an optional implementation, image analysis logic can perform image analysis tasks such as image classification, object detection, and instance segmentation. As a possible alternative implementation, the image analysis logic may also be set in an external device communicatively connected with the receiving end 120 .

In step S31, the image decoding system decodes the second code stream to obtain low-level features.

In step S32, the image decoding system performs image reconstruction on the original image according to the low-level features and high-level semantic information to obtain a predicted image.

Since both low-level features and high-level semantic information cannot fully express the information of the original image, in the embodiment of the present application, the low-level features and high-level semantic information can be combined for the image reconstruction task of the original image. Based on this, after decoding the first code stream and the second code stream, the image decoding system can perform image reconstruction on the original image based on the decoded high-level semantic information and low-level features, so as to realize the image reconstruction task of the original image.

In some embodiments, the image decoding system uses high-level semantic information as guidance information for image reconstruction, and reconstructs specific image details expressed by low-level features to obtain a predicted image. In the embodiment of the present application, in the reconstruction process of the original image, the high-level semantic information and low-level features of the original image are combined. Since the high-level semantic information can understand the boundaries of objects in the original image and the relationship between objects (such as the occlusion relationship between objects) etc.) for accurate expression, so image reconstruction is performed based on high-level semantic information, and the embodiment of the present application can ensure that the basic structure of the predicted image is similar to or even consistent with the original image; at the same time, in the reconstruction process of the original image, the embodiment of the present application Combined with the rich image details expressed by low-level signals, it is possible to accurately reconstruct the specific details in the original image (such as the color, texture, texture, etc.) of the object in the original image, thereby ensuring the accuracy of the local details of the predicted image; Therefore, the embodiments of the present application combine the high-level semantic information and low-level features of the original image to reconstruct the original image, so that the reconstructed predicted image can have higher accuracy.

Judging from the image coding scheme provided by the embodiment of the present application, the embodiment of the present application can extract high-level semantic information and low-level features from the original image respectively. Since the high-level semantic information can express the semantics of the original image concept level, the high-level semantic information can be used for the image analysis task of the original image. While high-level semantic information and low-level features can be combined for image reconstruction tasks from raw images. Furthermore, in the embodiment of the present application, the high-level semantic information and low-level features are respectively encoded to generate the first code stream and the second code stream, so that the high-level semantic information and low-level features can be transmitted to the receiving end in the form of code streams. The embodiment of the present application realizes the use of a set of coding schemes to integrate image reconstruction tasks oriented to user vision and image analysis tasks oriented to machine vision.

Based on the method flow shown in FIG. 2A, from the perspective of the image coding system 111, FIG. 2B exemplarily shows a schematic structural diagram of the image coding system 111 provided by the embodiment of the present application. As shown in FIG. 2B , the image encoding system 111 may include: a semantic extractor 210 , a feature extractor 211 , a first encoder 212 and a second encoder 213 .

As shown in FIG. 2B , when the original image is input into the image coding system 111 , the semantic extractor 210 is used to extract high-level semantic information of the original image, and the feature extractor 211 is used to extract low-level features of the original image.

In some embodiments, the semantic extractor 210 and the feature extractor 211 may be different network layers in the convolutional neural network, and the layer of the semantic extractor 210 in the convolutional neural network is higher than that of the feature extractor 211 . The higher the level of the network layer in the convolutional neural network, the more the processing result of the network layer tends to the semantic information of the image. Conversely, the lower the level of the network layer, the more the processing result of the network layer tends to the low-level detail features of the image. .

As an optional implementation, the convolutional neural network can include a backbone network, and the backbone network can include a standard convolutional layer at a low level, and a high-level semantic information extraction layer at a high level; when an image is input to a convolutional neural network for processing In the embodiment of the present application, the processing result of the standard convolution layer can be used as the low-level feature of the image, and the processing result of the high-level semantic information extraction layer can be used as the high-level semantic information of the image. It should be noted that the structure of the convolutional neural network shown in this paragraph is only an optional structure, and the embodiment of the present application can also use convolutional neural networks with other structures, and use the low-level network layer in the convolutional neural network The output result of the network layer is used as the low-level feature, and the output result of the high-level network layer is used as the high-level semantic information.

In the embodiment of the present application, the high-level semantic information can express the semantics of the original image at the conceptual level, so the high-level semantic information can be used for the image analysis task of the original image. That is to say, after the receiving end 120 obtains the high-level semantic information of the original image, the receiving end 120 can analyze and process the original image according to the high-level semantic information, so as to realize image analysis tasks such as image classification, object detection, and instance segmentation of the original image. Since both low-level features and high-level semantic information cannot fully express the information of the original image, in the embodiment of the present application, the low-level features and high-level semantic information can be combined for the image reconstruction task of the original image. In some embodiments, since the low-level features express the rich visual detail information of the original image, and the high-level semantic information expresses the semantics of the original image at the conceptual level, the embodiments of the present application can carry out the original image based on the low-level features and high-level semantic information of the original image. Image Reconstruction to achieve image reconstruction tasks from raw images. For example, after the receiving end 120 obtains the low-level features and high-level semantic information of the original image, the receiving end 120 may reconstruct the original image according to the low-level features and high-level semantic information of the original image.

Referring back to FIG. 2B , after the semantic extractor 210 extracts the high-level semantic information of the original image, the first encoder 212 may encode the high-level semantic information to generate a first code stream. The first code stream can be transmitted to the receiving end 120 . After the feature extractor 211 extracts the low-level features of the original image, the second encoder 213 may encode the low-level features to generate a second code stream. The second code stream can be transmitted to the receiving end 120 .

In some embodiments, the first encoder 212 and the second encoder 213 may be the same encoder, that is, the embodiment of the present application may use the same encoder to encode high-level semantic information and low-level features respectively. For example, the first encoder 212 and the second encoder 213 may be the same lossless encoder. In a possible example implementation, a lossless encoder such as a FLIF (Free Lossless Image Format, free lossless image format) encoder. It should be noted that the lossless encoder is only the same encoder form that the first encoder 212 and the second encoder 213 can choose, and the embodiment of the present application can also support the first encoder 212 and the second encoder 213 in other forms the same encoder. In some other embodiments, the embodiment of the present application may also support that the first encoder 212 and the second encoder 213 are different encoders.

In some further embodiments, after the image coding system 111 generates the first code stream and the second code stream, the sending end 110 may transmit the first code stream and the second code stream to the receiving end 120 . Therefore, the receiving end 120 can decode the first code stream and the second code stream, use the high-level semantic information to realize the image analysis task, and use the high-level semantic information and low-level features to realize the image reconstruction task.

Based on the method flow shown in FIG. 2A , FIG. 2C further shows a schematic structural diagram of the image encoding and decoding system provided by the embodiment of the present application on the basis of FIG. 2B . It can be understood that the decoding process of the image decoding system 121 shown in FIG. 2C may be an inverse process of the encoding process of the image encoding system 111 . As shown in FIG. 2C , the image decoding system 121 may include: a first decoder 220 , a second decoder 221 and a predictor 222 .

Wherein, the first decoder 220 is used to decode the first code stream to obtain high-level semantic information of the original image. The high-level semantic information obtained by the first decoder 220 can be used for the image analysis task of the original image. In some further embodiments, the high-level semantic information output by the first decoder 220 can be imported into image analysis logic (such as an image analysis model) for performing image analysis tasks, and the image analysis logic performs specific image analysis tasks based on the high-level semantic information. The image analysis logic can be set at the receiving end 120 , or can be set at an external device communicatively connected with the receiving end 120 .

The second decoder 221 is used to decode the second code stream to obtain low-level features of the original image.

In some embodiments, the first decoder 220 and the second decoder 221 may be the same decoder, for example, the same lossless decoder. Lossless decoders such as FLIF decoders.

The high-level semantic information obtained by the first decoder 220 and the low-level features obtained by the second decoder 221 can be imported into the predictor 222 . The predictor 222 is used to perform image reconstruction on the original image according to low-level features and high-level semantic information to obtain a predicted image. The predicted image obtained by the predictor 222 can be displayed to the user at the receiving end 120, so as to meet the viewing requirement of the original image oriented to the user's vision.

The image encoding and decoding system provided by the embodiment of the present application can use a set of corresponding encoding framework and decoding framework to realize the fusion of user vision-oriented image reconstruction tasks and machine vision-oriented image analysis tasks.

In some further embodiments, if the image decoding system 121 simply uses high-level semantic information and low-level features to reconstruct the original image, there may be a large deviation between the reconstructed image and the original image, so the embodiment of the present application may first The image coding system 111 uses high-level semantic information and low-level features to perform image reconstruction to obtain a predicted image of the original image, and then further determines the difference information between the predicted image and the original image, and further transmits the difference information to the image decoding system 121 . Furthermore, the image decoding system 121 can further use the difference information to perform image enhancement on the reconstructed image on the basis of combining high-level semantic information and low-level features for image reconstruction, so as to make the final reconstructed image more accurate. Based on this, FIG. 3A exemplarily shows another flowchart of the image encoding and decoding method provided by the embodiment of the present application. As shown in FIG. 2A and FIG. 3A , the method flow shown in FIG. 3A further includes the following steps on the basis of the method flow shown in FIG. 2A .

In step S25, the image coding system performs image reconstruction on the original image according to the low-level features and high-level semantic information to obtain a predicted image.

In step S26, the image coding system determines difference information between the predicted image and the original image, and the difference information is used to enhance the predicted image in the image reconstruction task.

In step S27, the image encoding system encodes the difference information to generate a third code stream.

In the embodiment of the present application, in addition to performing steps S20 to S24, the image coding system further performs steps S25 to S27.

In some embodiments, the image coding system can use high-level semantic information as guidance information for image reconstruction to reconstruct specific image details expressed by low-level features, so as to obtain a predicted image. The image coding system can compare the predicted image with the original image to determine the difference information between the predicted image and the original image. The difference information can perform image enhancement on the predicted image reconstructed at the receiving end in the image reconstruction task at the receiving end to Make the enhanced image closer to the original image. In order to enable the difference information to be transmitted to the receiving end, after the image coding system obtains the difference information, it may encode the difference information to generate a third code stream. The third code stream may be transmitted from the sending end to the receiving end.

In some embodiments, the above difference information may be residual information between the predicted image and the original image.

In some embodiments, the image coding system may perform lossy coding on the difference information to generate the third code stream. Lossy coding such as VVC (Versatile Video Coding, multifunctional video coding). As an optional implementation, since there is a certain coding loss in lossy coding, the embodiment of the present application can determine the coding loss of the difference information based on the negative correlation between the image quality requirement (QP) of the reconstructed image and the coding loss of the difference information, so that A third code stream suitable for image quality requirements is obtained.

As an optional implementation, in this embodiment of the present application, the encoding loss of the difference information may be controlled based on the image quality requirements of the reconstructed image. For example, if the image quality requirement of the reconstructed image is higher, the embodiment of the present application can control the coding loss of the difference information to be lower; if the image quality requirement is lower, the embodiment of the present application can control the coding loss of the difference information to be higher. In some further embodiments, the embodiment of the present application can set the image quality requirements based on the network bandwidth, for example, the network bandwidth and the image quality requirements are positively correlated, that is, the higher the network bandwidth, the higher the image quality requirements; thus, the present application In this embodiment, according to different image quality requirements, third code streams with different data sizes can be obtained by encoding, so as to adapt to different network bandwidth conditions.

In step S33, the image decoding system decodes the third code stream to obtain difference information.

In step S34, the image decoding system performs image enhancement processing on the predicted image according to the difference information, so as to obtain an enhanced image for display.

In the embodiment of the present application, the image decoding system further performs steps S33 to S34 in addition to steps S30 to S32.

In some embodiments, after the image decoding system performs image reconstruction based on low-level features and high-level semantic information to obtain the predicted image, it can further use the difference information obtained by decoding the third code stream to perform image enhancement processing on the predicted image to obtain enhanced image. The enhanced image can be used as the final reconstructed image in the embodiment of the present application, and displayed to the user for viewing.

In some embodiments, the image decoding system can perform lossy decoding on the third code stream. For example, VVC decoding is performed on the third code stream.

The image encoding and decoding method provided by the embodiment of the present application can determine the difference information between the predicted image and the original image, and then use the difference information to enhance the reconstructed predicted image at the receiving end, so that the final reconstructed enhanced image can be more accurate. Close to the original image, improving the accuracy of image reconstruction.

In some further embodiments, the image encoding system 111 may decide whether to encode and transmit the difference information based on the network bandwidth of the sending end 110 .

If the network bandwidth is lower than the set bandwidth threshold, due to the limited network bandwidth, in order to ensure that the user can obtain a continuous image playback experience at the receiving end, the image encoding system 111 may not perform encoding and transmission of difference information; for example, the image encoding system 111 After the difference information is obtained, the execution of step S27 can be cancelled, and the difference information is not encoded. Furthermore, the image coding system 111 may wait until the network bandwidth of the sending end is higher than the bandwidth threshold, and then perform encoding and transmission of the difference information, so as to transmit the difference information to the receiving end.

It should be noted that if the network bandwidth of the sender is higher than the bandwidth threshold, due to the better network bandwidth, more code stream information is allowed to be transmitted between the sender and the receiver, so as to support the receiver to continuously play images. To provide a high-definition image playback experience, at this time, the embodiment of the present application can perform encoding and transmission of difference information. For example, after obtaining the difference information, the image coding system 111 may directly encode the difference information to generate a third code stream, and the sending end transmits the third code stream to the receiving end.

As an implementation example, the receiving end may play video images on the receiving end based on the information continuously transmitted by the sending end. If the current network bandwidth is low, the embodiment of the present application can reduce the definition of the video image played by the receiving end. At this time, the sending end may not encode and transmit the difference information, but only transmit the first code stream and the second code stream to the The receiving end; thus, the receiving end performs image reconstruction based on low-level features and high-level semantic information at this time and displays the reconstructed image, which can ensure that the receiving end can continuously play video images while reducing the definition of video images. If the current network bandwidth is relatively high, the embodiment of the present application can improve the clarity of the video image played by the receiving end. At this time, the sending end can encode and transmit the difference information, and the first code stream, the second code stream and the third code stream can be encoded and transmitted. Stream transmission to the receiving end; thus, the receiving end can perform image reconstruction based on low-level features and high-level semantic information, and use the difference information to perform image enhancement on the reconstructed image, so that the enhanced image has higher definition and ensures that the receiving end can continuously Play high-definition video images.

Based on the method flow shown in FIG. 3A , from the perspective of the image coding system 111 , FIG. 3B exemplarily shows another structural diagram of the image coding system 111 provided by the embodiment of the present application. As shown in FIG. 2B and FIG. 3B , the image encoding system 111 may further include: a predictor 222 , a comparator 214 and a third encoder 215 .

In the embodiment of the present application, the high-level semantic information extracted by the semantic extractor 210 and the low-level features extracted by the feature extractor 211 can be imported into the predictor 222 . The predictor 222 can be used to perform image reconstruction on the original image according to low-level features and high-level semantic information to obtain a predicted image.

The predicted image obtained by the predictor 222 can be input into the comparator 214, and the comparator 214 can be used to compare the predicted image with the original image to obtain difference information between the predicted image and the original image. The difference information can be used to enhance the predicted image in the image reconstruction task at the receiving end. In some embodiments, comparator 214 may include a subtractor. The subtractor can perform residual processing on the predicted image and the original image to obtain residual information between the predicted image and the original image, and the residual information can be used as the above-mentioned difference information.

The difference information obtained by the comparator 214 can be input into the third encoder 215, and the third encoder 215 can be used to encode the difference information to generate a third code stream. In some embodiments, the third encoder 215 may comprise a lossy encoder. In a possible example implementation, the lossy coder is, for example, a VVC (Versatile Video Coding, multifunctional video coding) coder. In the optional implementation of encoding the difference information using a lossy encoding method, the embodiment of the present application can determine the encoding loss of the difference information based on the negative correlation between the image quality requirement (QP) and the encoding loss of the difference information, so as to obtain The third code stream adapted to the image quality requirements.

In some further embodiments, after the image coding system 111 generates the third code stream, the sending end 110 may transmit the third code stream to the receiving end 120 .

Based on the method flow shown in FIG. 3A , FIG. 3C further shows another schematic structural diagram of the image encoding and decoding system provided by the embodiment of the present application on the basis of FIG. 3B . After the image coding system 111 generates the first code stream, the second code stream and the third code stream, the first code stream, the second code stream and the third code stream may be transmitted to the image decoding system 121 of the receiving end 120 . Therefore, the image decoding system 121 can realize image decoding based on the structure shown in FIG. 3C , so as to be compatible with the image analysis task and image reconstruction task of the original image. Referring to FIG. 3B , FIG. 2C and FIG. 3C , the image decoding system 121 shown in FIG. 3C further includes: a third decoder 223 and an image intensifier 224 .

Wherein, the third decoder 223 is used to decode the third code stream to obtain difference information (eg residual information) between the original image and the predicted image. In some embodiments, the third decoder 223 may be different from the first decoder 220 and the second decoder 221 , for example, the third decoder 223 may include a lossy decoder. Lossy decoders such as VVC decoders.

The image enhancer 224 may be configured to perform image enhancement processing on the predicted image obtained by the predictor 222 according to the difference information, so as to obtain an enhanced image. Since the difference information expresses the difference between the predicted image and the original image, the embodiment of the present application further introduces the difference information to perform image enhancement processing on the predicted image, which can make the enhanced image after image enhancement closer to the original image, thereby improving the Accuracy for image reconstruction tasks. In the architecture shown in FIG. 3C , the enhanced image obtained by the image intensifier 224 can be used as the final image displayed to the user, so as to meet the viewing requirements of the original image oriented to the user's vision.

In some embodiments, after the image coding system 111 generates the first code stream, the second code stream and the third code stream, the sending end 110 can transmit the first code stream, the second code stream and the third code stream to Receiver 120. Furthermore, the image decoding system 121 can determine whether to decode the first code stream to support the image analysis task based on the task instruction of the receiving end 120, or decode the first code stream, the second code stream and the third code stream to perform image reconstruction task.

In some further embodiments, if the task instruction indicates an image analysis task, the image decoding system 121 can decode the first code stream to obtain high-level semantic information, and transmit the high-level semantic information to the image analysis logic to support the execution of the image analysis tasks.

If the task instruction indicates an image reconstruction task, the image decoding system 121 can respectively decode the first code stream, the second code stream and the third code stream to obtain high-level semantic information, low-level features and difference information; thus, the image decoding system 121 The high-level semantic information and low-level features can be used for image reconstruction to obtain a predicted image, and the difference information can be further used to perform image enhancement processing on the predicted image to obtain an enhanced image.

If the task instruction indicates both the image analysis task and the image reconstruction task, the image decoding system 121 can respectively decode the first code stream, the second code stream and the third code stream, and use the high-level semantic information to perform the image analysis task. Semantic information, low-level features and difference information perform image reconstruction tasks.

In some further embodiments, the task instruction is adapted to the user's current needs, for example, the user automatically triggers different task instructions at the receiving end. In some other embodiments, the task instruction may also be set by default by the system. It can be understood that the code stream selected for decoding based on the task instruction described in this paragraph is also applicable to the situation where the image coding system shown in FIG. 2C transmits the first code stream and the second code stream. In this case, the image The decoding system 121 may not perform the decoding and image enhancement processing of the third code stream, and other processes may refer to the descriptions of the above corresponding parts in the same way, and will not be expanded here.

The image encoding and decoding system provided by the embodiment of the present application can use a set of corresponding encoding framework and decoding framework to realize the fusion of image reconstruction tasks oriented to user vision and image analysis tasks oriented to machine vision, and the embodiments of the present application are used in image reconstruction The difference information between the original image and the first reconstructed predicted image is used in the task, which can further improve the accuracy of image reconstruction.

Based on the image reconstruction task of the original image, the embodiment of the present application provides an optional image reconstruction method, including the following steps: obtaining high-level semantic information and low-level features of the original image; performing image processing on the original image according to the low-level features and high-level semantic information reconstructing to obtain a predicted image; and acquiring difference information between the predicted image and the original image, and the difference information is used to enhance the predicted image.

In some embodiments, in the optional implementation of obtaining the predicted image, the embodiment of the present application may use high-level semantic information as guidance information for image reconstruction, and reconstruct specific image details expressed by low-level features to obtain the predicted image. As an optional implementation, FIG. 4A exemplarily shows a flowchart of a method for obtaining a predicted image according to an embodiment of the present application. The flow of the method can be implemented by the predictor 222, and the predictor 222 referred to here can be set in the image encoding system, and can also be set in the image decoding system. Referring to FIG. 4A , the method flow may include the following steps.

In step S40, low-level features are processed to obtain target features for further processing with high-level semantic information.

In some embodiments, the embodiment of the present application may upsample low-level features to obtain target features for further processing (for example, stacking processing) with high-level semantic information. In an optional implementation, the upsampling may be nearest neighbor upsampling with a set multiple, for example, nearest neighbor upsampling by 8 times. As an optional implementation, the embodiment of the present application can process (for example, upsampling) the low-level features based on the resolution of the high-level semantic information, so that the resolution of the target feature after processing the low-level features is the same as the resolution of the high-level semantic information The same; of course, the embodiment of the present application may also support that the resolution of the target feature processed by the low-level feature is different from that of the high-level semantic information, and is not limited to the resolution of the two being the same.

In step S41, convolution processing is performed after stacking target features and high-level semantic information to obtain convolution features.

In some embodiments, the target features obtained in step S40 can be stacked with high-level semantic information to obtain stacked features, and the stacked features can be input into a convolutional network for convolution processing to obtain convolutional features output by the convolutional network. In some further embodiments, the convolutional network may first perform multiple convolutions on the stacked features to obtain the first convolutional features; then the convolutional network may perform multiple filtering processes on the first convolutional features to obtain Filtering features; the filtering features can be further subjected to multiple convolutions by the convolutional network to obtain second convolutional features; then, the second convolutional features can be output as convolutional features by the activation function of the convolutional network.

In step S42, the convolutional features are combined with low-level features to obtain a predicted image.

After the convolutional features are obtained, since the convolutional features can express richer image information, the embodiment of the present application can combine the convolutional features with low-level features to obtain a predicted image. In some embodiments, convolutional features may be added to low-level features to obtain a predicted image. Of course, the combination of convolutional features and low-level features is not limited to addition, and the embodiments of the present application may also support other combinations.

As an optional implementation for the predictor 222 to implement the process shown in FIG. 4A , FIG. 4B exemplarily shows a schematic structural diagram of the predictor 222 provided in the embodiment of the present application. As shown in FIG. 4B , the predictor 222 may include: an upsampler 410 , a stacker 420 , a convolutional network 430 and an adder 440 .

The upsampler 410 can be used to upsample low-level features to obtain target features for further processing with high-level semantic information. In some embodiments, the upsampler 410 may be used to upsample low-level features to obtain target features with the same resolution as high-level semantic information. Of course, the embodiment of the present application may also support that the resolution of the target feature obtained by upsampling the low-level feature is different from that of the high-level semantic information. In some embodiments, the upsampler may perform nearest neighbor upsampling by a set factor (for example, 8 times nearest neighbor upsampling) on low-level features.

The stacker 420 can stack the high-level semantic information and the target features obtained by the upsampler 410 to obtain stacked features. The stacked features may be input into a convolutional network 430 . The convolutional network 430 can perform convolution processing on the stacked features to obtain convolutional features.

The adder 440 can be used to add the low-level features and the convolutional features output by the convolutional network 430 to obtain a predicted image.

The predictor 222 provided in the embodiment of the present application can combine high-level semantic information and low-level features of the original image to reconstruct the original image, so that the reconstructed predicted image has higher accuracy.

As an optional implementation of the convolutional network 430, FIG. 4C exemplarily shows a schematic structural diagram of the convolutional network 430 provided by the embodiment of the present application. As shown in FIG. 4C , the convolutional network 430 may include: a first set of convolutional layers 431 , a plurality of convolutional filter banks 432 , a second set of convolutional layers 433 , and an activation function 434 .

Wherein, the first group of convolutional layers 431 may include multiple layers of convolutional layers connected in sequence, and the convolution configuration of each convolutional layer in the first group of convolutional layers 431 may be different.

A convolution filter bank 432 may include multiple layers of convolution filters, and the filtering configuration of each layer of convolution filters in a convolution filter bank 432 may be the same.

The second group of convolutional layers 433 may include multiple layers of convolutional layers connected in sequence, and the convolution configuration of each convolutional layer in the second group of convolutional layers 433 may be different.

In some embodiments, the convolution configuration of the convolutional layer may include one or more of the number of output filters, the size of the convolution kernel, the convolution step size, whether to set a normalization function, whether to set a linear rectification unit, etc. item. Of course, this embodiment of the present application can also support the same convolution configuration of some convolution layers in the first group of convolution layers 431 and the second group of convolution layers 433 , or even the same convolution configuration of all convolution layers.

In the embodiment of the present application, the stacked features after stacking the target features and high-level semantic information can be input to the first set of convolutional layers 431, and the first set of convolutional layers 431 can perform convolution processing on the stacked features through multi-layer convolutional layers , to get the first convolution feature. The first set of convolutional layers 431 can output the first convolutional features to multiple convolutional filter banks 432 , and the multiple convolutional filter banks 432 can filter the first convolutional features to obtain filtered features. A plurality of convolutional filter banks 432 can output filtering features to the second set of convolutional layers 433, and the second set of convolutional layers 433 can perform convolution processing on the filtering features through multiple convolutional layers to obtain the second convolutional features . The second set of convolutional layers 433 can output the second convolutional features to the activation function 434 , and the activation function 434 outputs the corresponding convolutional features.

In some further embodiments, the number of convolutional layers set in the first group of convolutional layers 431 and the second group of convolutional layers 433, and the convolution configuration of each convolutional layer can be determined according to actual conditions. No limit. Similarly, the number of convolution filters set in one convolution filter bank 432 and the filtering configuration of each convolution filter may be determined according to actual conditions, which are not limited by this embodiment of the present application.

As an optional implementation of configuring the convolutional network 430, the embodiment of the present application can define the number of convolutional layers in the first set of convolutional layers 431 and the second set of convolutional layers 433, and the output filtering of each convolutional layer The number of filters, the size of the convolution kernel, the convolution step size, whether to set a normalization function, and whether to set a linear rectification unit, so as to configure the first set of convolutional layers 431 and the second set of convolutional layers 433. As an optional implementation, FIG. 4D exemplarily shows another schematic structural diagram of the predictor 222 provided in the embodiment of the present application. As shown in FIG. 4C and FIG. 4D , the predictor 222 shown in FIG. 4D is specific to the convolutional network 430 structure.

As shown in FIG. 4D, the first set of convolutional layers 431 may include 4 sequentially connected convolutional layers, namely: Conv60 7x7 s1 Norm ReLU, Conv120 3x3 s2 Norm ReLU, Conv240 3x3 s2 Norm ReLU, Conv480 3x3 s2 Norm ReLU . For the first convolutional layer, Conv60 indicates that the number of output filters of the convolutional layer is 60, 7x7 indicates the size of the convolution kernel of the output filter, and s1 indicates that the convolution step of the output filter is 1 (correspondingly, s2 means that the convolution step size is 2, and so on), Norm means to set the normalization function (correspondingly, if there is no Norm in the convolution configuration, it means that the normalization function is not set in the convolution layer), ReLU means to set Linear rectification unit (correspondingly, if there is no ReLU in the convolution configuration, it means that the linear rectification unit is not set in the convolution layer). That is to say, based on the configuration of the first convolutional layer above, the first convolutional layer includes 60 output filters, a normalization function and a linear rectification unit, and the convolution kernel size of each output filter is 7x7 , with a convolution step size of 1. The specific configurations of other convolutional layers in the first group of convolutional layers 431 can be explained in the same way.

The plurality of convolutional filter banks 432 may include nine sequentially connected convolutional filter banks. A convolution filter bank can include 2 layers of convolution filters, each layer of convolution filters can include 480 convolution filters, the convolution kernel size of each convolution filter is 3x3, and the convolution step size is 1 .

The second group of convolutional layers 433 may include 4 sequentially connected convolutional layers, namely: ConvT240 3x3 s2 Norm ReLU, ConvT120 3x3 s2 Norm ReLU, ConvT60 3x3 s2 Norm ReLU, Conv3 7x7 s1. Among them, ConvT indicates deconvolution, and ConvT240 indicates that the number of output filters of the convolutional layer performing deconvolution is 240. It should be noted that, for the last convolutional layer Conv3 7x7 s1 in the second group of convolutional layers 433, since Norm and ReLU do not exist in the convolution configuration of the last convolutional layer, the last convolutional layer Multilayer does not set normalization function and linear rectification unit. The configuration meaning of each convolutional layer in the second group of convolutional layers 433 can be explained similarly with reference to the previous description, and will not be expanded here.

It should be noted that the specific structure of the convolutional network 430 shown in FIG. 4D is only an optional structure. In the embodiment of the present application, the structure of the convolutional network 430 shown in FIG. 4D can also be deformed, adjusted or replaced, as long as The convolutional network 430 provided in the embodiment of the present application can perform convolution processing on the stacked features of target features and high-level semantic information, and the obtained convolutional features can express richer image information.

In some further embodiments, the feature extractor 211 of the image coding system 111 may use multiple convolutional layers and activation functions to extract low-level features of the original image. As an optional implementation, the multi-layer convolutional layer in the feature extractor 22 can extract features from the original image, and then output as low-level features by the activation function. FIG. 5A shows a schematic structural diagram of the feature extractor 211 provided by the embodiment of the present application. As shown in FIG. 5A , the feature extractor 211 may include sequentially connected multi-layer convolutional layers and activation functions. The convolutional configurations of multiple convolutional layers can be different or partially the same. After the original image is input to the feature extractor 211, the multi-layer convolutional layer can extract features from the original image, and the features extracted by the multi-layer convolutional layer can be output as low-level features through an activation function.

In an implementation example, FIG. 5B shows another schematic structural diagram of the feature extractor 211 provided by the embodiment of the present application. As shown in FIG. 5A and FIG. 5B , in the feature extractor 211 shown in FIG. 5B , the multi-layer convolutional layer may include 5 layers of convolutional layers, which are: Conv60 7x7 s1 Norm ReLU, Conv120 3x3 s2 Norm ReLU, Conv240 3x3 s2 Norm ReLU, Conv480 3x3 s2 Norm ReLU, Conv3 3x3 s1 Norm ReLU. It should be noted that the specific number of multi-layer convolutional layers and the specific configuration of each convolutional layer shown in FIG. 5B are only an optional example, which is not limited by this embodiment of the present application.

In some further embodiments, the low-level features can compactly express the detailed features of the original image, for example, the low-level features can represent the compact features of the original image with a set scale. In an implementation example, the low-level features may represent compact features with a size of 1/64 of the original image (eg, the low-level features may be represented as 1/8 of the original image in both width and height dimensions).

As an optional implementation, in order to enable the low-level features to compactly express the detailed features of the original image and ensure the feature size of the low-level features, the embodiment of the present application can be used in the first convolutional layer and the last layer of the feature extractor. Use the ReflectionPad (mirror padding) of the first set size before the activation function, for example, use a ReflectionPad with a size of 3; and use the second set size before the remaining network layers of the feature extractor (such as the remaining convolutional layers) ReflectionPad, for example, a ReflectionPad whose length and width are 1.

In some embodiments, when the feature extractor 211 extracts low-level features of the original image, this embodiment of the present application may use ChannelNorm technology to obtain compact low-level features that express fine details of the image. As an optional implementation, the embodiment of the present application can determine the low-level features of the original image corresponding to each channel based on the channel normalization technology, so as to determine the low-level features of the original image according to the low-level features of the original image corresponding to each channel. For example, the feature extractor 211 can use multi-layer convolutional layers based on channel normalization technology to determine the low-level features of the original image corresponding to each channel; furthermore, the low-level features of the original image corresponding to each channel can be used in the The activation functions are combined to obtain the low-level features of the original image.

In an implementation example, taking the original image as a two-dimensional image with a two-dimensional position as an example, for calculating the low-level features corresponding to the original image in the current channel, the embodiment of the present application can determine the two-dimensional position of the current channel of the original image The corresponding unit pixel at , the mean value corresponding to all channels of the original image at the two-dimensional position, and the mean square error corresponding to all the channels at the two-dimensional position; thus, the embodiment of the present application can be based on the corresponding The unit pixel, the mean value and mean square error corresponding to all channels of the original image at the two-dimensional position, and the set offset of the current channel determine the low-level features of the original image corresponding to the current channel.

Taking the total number of channels of the original image as M, and calculating the low-level features corresponding to the c-th channel of the original image as an example, the embodiment of the present application may use the following formula to calculate the corresponding low-level features of the original image in the c-th channel based on the channel normalization technology Low-level features.

in

f _chw 'indicates the low-level feature corresponding to the cth channel of the original image, c indicates the cth channel of the original image, c belongs to 1 to M, and f _chw indicates the unit corresponding to the cth channel of the original image at the h and w positions pixel, μ _hw represents the mean value of the M channels at the h and w positions of the original image,

Indicates the mean square error of the M channels at the h and w positions of the original image, and α _c and β _c represent the learned offset of the c-th channel.

The embodiment of the present application uses the channel normalization technology to determine the low-level features of the original image corresponding to each channel, and then combine the low-level features of the original image corresponding to the current channel to obtain the low-level features of the original image, which can ensure that the original image is rich in low-level features. Under the premise of detailed information, the size of the low-level features can be significantly reduced; further, the embodiments of the present application can reduce the encoding and decoding overhead of the low-level features, and the transmission overhead of the second code stream corresponding to the low-level features between the sending end and the receiving end.

In some further embodiments, when extracting the high-level semantic information, the image coding system 111 may extract corresponding high-level semantic information based on the type of the image analysis task. The image coding system 111 may set a semantic extractor corresponding to the type of the image analysis task, so as to extract high-level semantic information corresponding to the type of the image analysis task from the original image. As an optional implementation, one high-level semantic information can support at least one type of image analysis tasks, that is, one high-level semantic information can support one type or multiple types of image analysis tasks at the same time. For example, high-level semantic information can include any one of instance segmentation map information supporting instance segmentation tasks, image classification tasks, and object detection tasks, and stick figure information supporting human body pose recognition tasks.

FIG. 6 shows another schematic structural diagram of the image encoding and decoding system provided by the embodiment of the present application. As shown in FIG. 6 , the semantic extractor 210 set in the image coding system 111 may be an instance segmenter 610 , and the instance segmenter 610 may extract instance segmentation map information from an original image based on an instance segmentation technique. In the embodiment of the present application, the instance segmentation map information may be used as high-level semantic information of the original image.

The instance segmentation map information can be imported into the predictor 222 and the FLIF encoder 620 , and the FLIF encoder 620 can be the same lossless encoder used by the first encoder 212 and the second encoder 213 . The FLIF encoder 620 can perform lossless encoding on the instance segmentation map information to generate a first code stream. The first code stream can be transmitted to the image decoding system 121 .

The feature extractor 211 extracts low-level features of the original image, and the low-level features can be imported into the predictor 222 and the FLIF encoder 620 . The FLIF encoder 620 can perform lossless encoding on low-level features to generate a second code stream. The second code stream can be transmitted to the image decoding system 121 .

In the image coding system 111, the predictor 222 can perform image reconstruction according to instance segmentation map information and low-level features to obtain a predicted image. The subtracter 630 may determine residual information of the predicted image and the original image. The residual information can be imported into the VVC encoder 640 . The VVC encoder 640 can perform lossy encoding on the residual information to generate a third code stream. The third code stream can be transmitted to the image decoding system 121 . Wherein, the subtractor 630 may be an optional form of the comparator 214 , and the VVC encoder 640 may be a lossy encoder used by the third encoder 215 .

In the image decoding system 121, the FLIF decoder 650 can decode the first code stream and the second code stream to obtain instance segmentation map information and low-level features. The FLIF decoder 650 may be the same lossless decoder used by the first decoder 220 and the second decoder 221 . The instance segmentation map information obtained by the FLIF decoder 650 can be imported into instance segmentation logic, image classification logic, and object detection logic to implement image analysis tasks such as instance segmentation, image classification, and object detection of the original image. Meanwhile, the instance segmentation map information and low-level features obtained by the FLIF decoder 650 can be imported into the predictor 222 in the image decoding system 121 . The predictor 222 may perform image reconstruction on the original image according to the instance segmentation map information and low-level features to obtain a predicted image. The predicted image obtained by the predictor 222 can be imported into the image enhancer 224 .

In the image decoding system 121, the VVC decoder 660 can decode the third code stream to obtain residual information of the predicted image and the original image. The VVC decoder 660 may be a lossy decoder used by the third decoder 223 . The residual information may be introduced into the image enhancer 224 . The image enhancer 224 may perform image enhancement processing on the predicted image according to the residual information to obtain an enhanced image. The enhanced image can be displayed to the user to meet the user's vision-oriented image reconstruction requirements.

FIG. 6 uses the instance segmentation map information as high-level semantic information, and specifically illustrates the implementation process of the image encoding system and the image decoding system provided by the embodiment of the present application in the image reconstruction task and the instance segmentation task. It can be understood that, on the basis of the image coding system shown in FIG. 2B and the image decoding system shown in FIG. 2C , the embodiment of the present application can also use instance segmentation map information as high-level semantic information.

In other possible implementations, the high-level semantic information of the original image can also be the stick figure information of the original image, which is used to support the image analysis task of human body gesture recognition; in this case, the embodiment of the present application can use the The instance segmenter 610 in the architecture is replaced by a semantic extractor that supports the extraction of stickman graph information, and the instance segmentation graph information is replaced by stickman graph information. For other processes, refer to the description in Figure 6 in the same way. Of course, the embodiment of the present application can also use stick figure information as high-level semantic information on the basis of the image coding system shown in FIG. 2B and the image decoding system shown in FIG. 2C .

In some further embodiments, the image coding system 111 may set multiple semantic extractors 210 to extract multiple high-level semantic information from the original image, so as to support different types of image analysis tasks. FIG. 7 shows another schematic structural diagram of the image encoding and decoding system provided by the embodiment of the present application. As shown in FIG. 6 and FIG. 7 , the image encoding system 111 shown in FIG. 7 may be provided with multiple semantic extractors (for example, n semantic extractors 2101 to 210 n shown in FIG. 7 ).

Multiple semantic extractors are used to extract high-level semantic information from the original image one by one to obtain multiple high-level semantic information (for example, high-level semantic information 1 to n). For example, the semantic extractor 2101 can extract the high-level semantic information 1 of the original image, and so on, the semantic extractor 210n can extract the high-level semantic information n of the original image. A high-level semantic information extracted by a semantic extractor can be used to support at least one type of image analysis task of the original image, for example, any high-level semantic information in the high-level semantic information 1 to n can support one or more types of image analysis of the original image Task.

As an implementation example, the plurality of high-level semantic information may include instance segmentation map information and stickman map information, wherein the instance segmentation map information may support one or more of instance segmentation, image classification, and object detection of the original image. Similar to image analysis tasks, the stick figure information can support the human body pose recognition task of the original image.

After multiple semantic extractors extract multiple high-level semantic information from the original image, the FLIF encoder 620 may perform lossless encoding on the multiple high-level semantic information to obtain multiple first code streams. For example, after the semantic extractors 2101 to 210n extract high-level semantic information from the original image one by one to obtain the high-level semantic information 1 to n, the FLIF encoder 620 can perform lossless encoding on the high-level semantic information 1 to n respectively to generate the first code stream 1 to n.

In the image coding system 111, one high-level semantic information among multiple high-level semantic information (for example, one high-level semantic information among instance segmentation map information, stickman map information, etc.) can be imported into the predictor 222 (that is, high-level semantic information 1 One of the high-level semantic information in to n can be imported into the predictor 222), and the low-level features extracted by the feature extractor 211 of the original image can be imported into the predictor 222. Therefore, the predictor 222 can perform image reconstruction according to the low-level features and one high-level semantic information among multiple high-level semantic information, so as to obtain a predicted image. For other implementation processes at the image coding system 111 end, similarly, reference may be made to the description of the corresponding part above, and no further description is given here.

In some embodiments, the predictor 222 may also perform image reconstruction according to low-level features and at least two high-level semantic information among multiple high-level semantic information, so as to obtain a predicted image. For example, the predictor 222 may further introduce at least one high-level semantic information among multiple high-level semantic information during image reconstruction according to low-level features and instance segmentation map information, so as to make the reconstructed predicted image more accurate. As an optional implementation, the embodiment of the present application can import the high-level semantic information or at least two high-level semantic information with the highest semantic level among multiple high-level semantic information into the predictor 222; thus, the predictor 222 can In the semantic information, one or more high-level semantic information with the highest semantic level is used for image reconstruction to obtain a more accurate prediction image. In other possible implementations, the embodiment of the present application can also combine any at least two high-level semantic information among multiple high-level semantic information with low-level features to be used for image reconstruction of the original image, and is not limited to be used for image Semantic hierarchy of reconstructed high-level semantic information.

In the image coding system 111 shown in FIG. 7 , a plurality of first code streams generated by encoding multiple high-level semantic information 1 to n, a second code stream generated by encoding low-level features, and residual information of predicted images and original images are generated by encoding The third code stream can be transmitted to the image decoding system 121 .

In the image decoding system 121, for image reconstruction tasks, the FLIF decoder 650 can decode any first code stream and the second code stream among multiple first code streams, so as to obtain a high-level semantic information and low-level features , the high-level semantic information and low-level features can be imported into the predictor 222; the predictor 222 can perform image reconstruction according to the high-level semantic information and low-level features to obtain a predicted image; the image enhancer 224 can decode the third The residual information obtained by the code stream is used to perform image enhancement on the predicted image to obtain an enhanced image and complete the image reconstruction task.

For the image analysis task, the FLIF decoder 650 can select a corresponding first code stream from multiple first code streams for decoding based on the current image analysis task, so as to obtain high-level semantic information suitable for the current image analysis task to support the current Execution of image analysis tasks. As an optional implementation, the receiving end 120 can obtain the task instruction indicated by the user, and the task instruction can indicate the current image analysis task that the receiving end needs to perform, so that the image decoding system 121 can select from multiple first code streams that are compatible with the current The first code stream corresponding to the high-level semantic information used by the image analysis task is decoded to obtain the high-level semantic information suitable for the current image analysis task. In some further embodiments, the task instruction may indicate multiple current image analysis tasks, so that the image decoding system 121 may select a first code stream corresponding to each current image analysis task from multiple first code streams for decoding, In order to obtain high-level semantic information suitable for each current image analysis task. In an example, if the current image analysis task is any one of instance segmentation, image classification, object detection, etc., the embodiment of the present application can decode the first code stream corresponding to the instance segmentation map information; if the current image analysis task When the task is human body posture recognition, the embodiment of the present application can decode the first code stream corresponding to the stick figure information. The current image analysis tasks can be of one type or more types, and the details can be determined by user requirements or system settings.

Based on multiple types of current image analysis tasks, the FLIF decoder 650 can select at least two first code streams from multiple first code streams to decode, so as to obtain high-level semantic information applicable to multiple types of current image analysis tasks . In some further embodiments, the FLIF decoder 650 can also decode multiple first code streams, so as to simultaneously support multiple types of image analysis tasks.

The image coding system and the image decoding system provided by the embodiments of the present application can extract multiple high-level semantic information from the original image through multiple semantic extractors, so as to realize image analysis tasks supporting multiple types of original images, and multiple high-level semantic information Any high-level semantic information in can be combined with the low-level features of the original image to achieve image reconstruction tasks supporting the original image. It can be seen that in the embodiment of the present application, a set of image encoding system and corresponding image decoding system can be used to realize fusion image reconstruction tasks and multiple types of image analysis tasks.

Aiming at the image analysis level of the receiving end, the embodiment of the present application provides an image analysis method. As an optional implementation, FIG. 8 exemplarily shows a flow chart of the image analysis method provided by the embodiment of the present application. The method flow can be implemented by the receiver. As shown in FIG. 8 , the method flow may include the following steps.

In step S80, based on the image analysis task of the original image, a target code stream is obtained from multiple code streams.

Based on the image coding scheme provided by the embodiment of the present application, the sending end can transmit multiple code streams to the receiving end. In some embodiments, the multiple code streams may include a first code stream corresponding to high-level semantic information of the original image, and a second code stream corresponding to low-level features of the original image. In some further embodiments, the plurality of code streams may further include a third code stream corresponding to difference information between the predicted image and the original image. Wherein, the number of first code streams in the plurality of code streams may be at least one (that is, one or more first code streams), and the at least one first code stream may correspond to at least one high-level semantic information of the original image, and One high-level semantic information corresponds to one first code stream.

After obtaining the multiple code streams, the receiving end can obtain a target code stream suitable for the image analysis task from the multiple code streams based on the current image analysis task to be performed on the original image. Since in the embodiment of the present application, the high-level semantic information carried by the first code stream is used for the image analysis task of the original image, the embodiment of the present application can specifically obtain the target code from at least one first code stream of the multiple code streams flow. The target code stream may be the first code stream whose high-level semantic information is applicable to the image analysis task in the at least one first code stream.

In some embodiments, if there are at least two first code streams among the multiple code streams, the embodiment of the present application may obtain the information applicable to The target stream of the image analysis task. For example, the at least two first code streams include the first code stream corresponding to the instance segmentation graph information and the first code stream corresponding to the stickman graph information. If the image analysis tasks currently to be performed are instance segmentation, image classification, and object detection Any one of them, the embodiment of the present application can obtain the first code stream corresponding to the instance segmentation map information from at least two first code streams as the target code stream; if the image analysis task currently to be performed is human body posture identification, the embodiment of the present application may obtain the first code stream corresponding to the stick figure information from at least two first code streams as the target code stream.

In some embodiments, if there is only one first code stream among the multiple code streams, the image analysis task currently to be performed may be a preset fixed image analysis task, such as instance segmentation, image classification, object detection, Any task in human gesture recognition is set as an image analysis task fixedly performed by the receiving end. In this embodiment of the present application, only one first code stream among the multiple code streams can be used as the target code stream.

In step S81, the target code stream is decoded to obtain high-level semantic information suitable for image analysis tasks.

In some embodiments, the embodiments of the present application may perform lossless decoding on the target code stream to obtain high-level semantic information suitable for image analysis tasks.

In step S82, an image analysis task is performed according to the decoded high-level semantic information.

In some embodiments, based on the decoded high-level semantic information, the receiver can perform image analysis tasks on the original image through specific image analysis logic (such as image analysis logic that performs tasks such as image classification, object detection, and instance segmentation).

In some embodiments, steps S80 to S81 can be implemented by an image decoding system at the receiving end, and step S82 can be implemented by an image analysis logic configured at the receiving end to perform an image analysis task.

The image analysis method provided by the embodiment of the present application can perform specific image analysis tasks under the encoding and decoding scheme and framework of the fusion image reconstruction task and image analysis task, thereby providing technical support for the image analysis task at the receiving end, and can be used in different The implementation is applicable under the type of image analysis tasks.

Compared with other image encoding and decoding schemes such as VTM (VVC test model, VVC test model), the image encoding and decoding scheme provided by the embodiment of the present application has higher image reconstruction quality and better image analysis quality. In the VCM (Video Coding for Machin, machine video coding) standard test environment, the image encoding and decoding scheme provided by the embodiment of the application is compared with the VTM scheme, and it can be obtained as shown in Fig. 9A, Fig. 9B, Fig. 9C and Fig. 9D The effect comparison diagram shown. In the case of using the same image and QP for quantization, the comparison process between the image encoding and decoding scheme provided by the embodiment of the present application and the VTM scheme is as follows.

Image reconstruction quality comparison, using the scheme provided by the embodiment of the application and the VTM scheme to perform image compression and reconstruction respectively, and obtain SSIM (Structural Similarity) and corresponding BPP (Bits Per Pixel, pixel depth), PSNR (Peak Signal to Noise Ratio, peak signal-to-noise ratio) and the corresponding BPP. Use SSIM and the corresponding BPP to draw a schematic diagram of the effect comparison between the scheme provided by the embodiment of the present application and the VTM scheme, as shown in FIG. 9A . A schematic diagram of the effect comparison between the solution provided by the embodiment of the present application and the VTM solution is drawn using PSNR and the corresponding BPP, as shown in FIG. 9B .

The performance comparison of the image analysis task of machine vision, taking the object detection task as the image analysis task, and the mAP (a common indicator of the object detection task) as the performance index of the image analysis task as an example, will use the embodiment of this application to perform object detection The performance index of the task is compared with the performance index of the object detection task using the image compressed by VTM to obtain the mAP and the corresponding BPP; use the mAP and the corresponding BPP to draw a schematic diagram of the effect comparison between the scheme provided by the embodiment of the present application and the VTM scheme, As shown in Figure 9C. Taking the instance segmentation task as the image analysis task and the mAP as the performance index of the image analysis task as an example, the performance index of the instance segmentation task using the embodiment of the application is compared with the performance index of the instance segmentation task using the image compressed by VTM , to obtain the mAP and the corresponding BPP; use the mAP and the corresponding BPP to draw a schematic diagram of the effect comparison between the scheme provided by the embodiment of the present application and the VTM scheme, as shown in FIG. 9D .

It can be seen from Fig. 9A, Fig. 9B, Fig. 9C and Fig. 9D that in terms of the image reconstruction quality of the same BPP, the image encoding and decoding scheme provided by the embodiment of the present application has better performance than the current VVC VTM scheme with better performance. High image reconstruction quality (for example, the scheme provided by the embodiment of the present application is 4dB higher than VTM in PSNR); in terms of the image analysis quality of the same BPP, the image encoding and decoding scheme provided by the embodiment of the present application is compared with VTM It has better image analysis quality (for example, in terms of mAP, the solution provided by the embodiment of the present application completely surpasses VTM).

The image encoding and decoding scheme provided in the embodiment of the present application proposes a reasonable and efficient encoding and decoding framework, which can effectively deal with the fusion problem of VCM in user vision-oriented image reconstruction tasks and machine vision-oriented image analysis tasks.

In some application examples, the image encoding and decoding solutions provided by the embodiments of the present application can effectively integrate image reconstruction tasks and image analysis tasks for images generated by intelligent hardware. It is understandable that with the rise of smart cities and deep learning, smart hardware generates massive images every day to better meet various information interaction needs; the images generated by smart hardware can be presented to users for viewing, so that images, For purposes such as video surveillance, machine vision analysis systems can also be relied on for corresponding analysis and decision-making tasks (for example, license plate recognition in intelligent transportation systems, road planning; object detection in automatic driving systems, road tracking; human intelligence in smart medical systems Face and expression detection and analysis, abnormal behavior detection, etc.); therefore, a set of encoding and decoding schemes and frameworks are needed to integrate image reconstruction tasks and image analysis tasks for images generated by intelligent hardware.

Taking the intelligent hardware as the camera in the intelligent transportation system as an example, an application example of the image encoding and decoding solution provided by the embodiment of the present application will be introduced below. Fig. 10 shows an application example diagram provided by the embodiment of the present application. As shown in FIG. 10 , the camera 910 is an intelligent hardware capable of collecting video, and can collect traffic video images. The traffic video images collected by the camera 910 can be encoded by the image encoding system 111 provided in the embodiment of the present application, so as to output the first code stream, the second code stream and the third code stream to the traffic command center 920 .

The traffic command center 920 can use the image decoding system 121 provided by the embodiment of the present application to perform decoding processing, so as to reconstruct the traffic video image and output the high-level semantic information of the traffic video image. The traffic video image reconstructed by the image decoding system 121 can be displayed on the monitoring screen of the traffic command center 920 to realize traffic video monitoring. The high-level semantic information of the traffic video image output by the image decoding system 121 can be imported into the license plate recognition system of the traffic command center 920 to realize the license plate recognition of vehicles in the traffic video image.

Using the image encoding and decoding scheme provided by the embodiment of the present application, the reconstruction of traffic video images can be realized in the scene of intelligent transportation system to meet the needs of video surveillance, and at the same time, image analysis can be performed on traffic video images to meet the needs of vehicle license plate recognition, etc. Analyze needs. Of course, FIG. 10 is only an optional application example of the image encoding and decoding scheme provided by the embodiment of the present application, and the embodiment of the present application can be applied in any scene requiring image reconstruction tasks and image analysis tasks.

The embodiment of the present application is compatible with image coding for user vision and machine vision tasks. The encoding and decoding scheme provided by the embodiment of the present application can be widely used in the related processing of image data in the smart city system, thereby effectively improving the compression of image data efficiency, reduce the burden of network bandwidth, reduce the workload of cloud services, reduce the storage consumption of image data, etc., thereby reducing the operating cost of smart cities.

The embodiment of the present application also provides an electronic device, such as the sending end 110 or the receiving end 120 . Fig. 11 shows a block diagram of an electronic device provided by an embodiment of the present application. As shown in FIG. 11 , the electronic device may include: at least one processor 1 , at least one communication interface 2 , at least one memory 3 and at least one communication bus 4 .

In the embodiment of the present application, there are at least one processor 1 , communication interface 2 , memory 3 , and communication bus 4 , and the processor 1 , communication interface 2 , and memory 3 communicate with each other through the communication bus 4 .

Optionally, the communication interface 2 may be an interface of a communication module for network communication.

Optionally, processor 1 may be CPU (central processing unit), GPU (Graphics Processing Unit, graphics processing unit), NPU (embedded neural network processor), FPGA (Field Programmable Gate Array, Field Programmable Logic Gate Array ), TPU (tensor processing unit), AI chip, specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement the embodiments of the present application, etc.

The memory 3 may include a high-speed RAM memory, and may also include a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory.

Wherein, the memory 3 stores one or more computer-executable instructions, and the processor 1 invokes one or more computer-executable instructions to execute the image encoding method provided by the embodiment of the present application, or execute the image coding method provided by the embodiment of the present application. The decoding method, or the image reconstruction method provided in the embodiment of the present application, or the image analysis method provided in the embodiment of the present application.

The embodiment of the present application also provides a storage medium, which can store one or more computer-executable instructions, and when the one or more computer-executable instructions are executed, the image coding method provided by the embodiment of the present application is implemented, or The image decoding method provided in the embodiment of the present application, or the image reconstruction method provided in the embodiment of the present application, or the image analysis method provided in the embodiment of the present application.

The above describes multiple embodiment solutions provided by the embodiments of the present application. The optional modes introduced by each embodiment solution can be combined and cross-referenced without conflict, so as to extend a variety of possible embodiment solutions. All of these can be regarded as embodiment disclosures and disclosed embodiment solutions of the present application.

Although the embodiments of the present application are disclosed above, the present application is not limited thereto. Any person skilled in the art can make various changes and modifications without departing from the spirit and scope of the present application. Therefore, the protection scope of the present application should be based on the scope defined in the claims.

Claims

An image coding method, including:

get the original image;

extracting high-level semantic information of the original image, the high-level semantic information being used for an image analysis task of the original image; and,

extracting low-level features of the original image, the low-level features and the high-level semantic information are used for image reconstruction tasks of the original image;

Encoding the high-level semantic information to generate a first code stream; and,

Encoding the low-level features to generate a second code stream.
The image coding method according to claim 1, further comprising:

performing image reconstruction on the original image according to the low-level features and the high-level semantic information to obtain a predicted image;

determining difference information between the predicted image and the original image, the difference information being used to enhance the predicted image in the image reconstruction task;

Encode the difference information to generate a third code stream.
The image coding method according to claim 2, wherein, according to the low-level features and the high-level semantic information, performing image reconstruction on the original image to obtain a predicted image comprises:

Using the high-level semantic information as guidance information for image reconstruction, the specific image details expressed by the low-level features are reconstructed to obtain a predicted image.
The image coding method according to claim 3, wherein said using said high-level semantic information as guidance information for image reconstruction to reconstruct specific image details expressed by said low-level features to obtain a predicted image comprises:

processing the low-level features to obtain target features for further processing with the high-level semantic information;

performing convolution processing after stacking the target features and the high-level semantic information to obtain convolution features;

The convolutional features are combined with the low-level features to obtain a predicted image.
The image coding method according to claim 4, wherein the processing the low-level features comprises: upsampling the low-level features;

The step of stacking the target features and the high-level semantic information to obtain convolution features includes: stacking the target features and the high-level semantic information to obtain stacked features; The stacked feature is subjected to multiple convolution processing to obtain the first convolution feature; the first convolution feature is subjected to multiple filtering processing to obtain the filtering feature; the filtering feature is subjected to multiple convolution processing to obtain a second convolution feature; outputting the second convolution feature as the convolution feature via an activation function;

The combining the convolutional features and the low-level features to obtain a predicted image includes: adding the convolutional features to the low-level features to obtain a predicted image.
The image coding method according to claim 2, wherein the determining the difference information between the predicted image and the original image comprises: performing residual processing on the predicted image and the original image to obtain the predicted Residual information between the image and the original image;

The encoding the difference information to generate a third code stream includes: performing lossy encoding on the residual information to generate a third code stream;

The encoding the high-level semantic information to generate a first code stream includes: performing lossless coding on the high-level semantic information to generate a first code stream;

The encoding the low-layer features to generate a second code stream includes: performing lossless encoding on the low-layer features to generate a second code stream.
The image coding method according to claim 2, wherein said extracting the low-level features of said original image comprises:

Based on channel normalization technology, determine the low-level features of the original image corresponding to each channel;

According to the low-level features of the original image corresponding to each channel, the low-level features of the original image are determined.
The image coding method according to any one of claims 2-7, wherein said extracting the high-level semantic information of the original image comprises: extracting multiple high-level semantic information of the original image, one high-level semantic information is used for all One or more types of image analysis tasks describing the original image;

Wherein, the multiple high-level semantic information is coded respectively to obtain multiple first code streams; the low-level feature is combined with one high-level semantic information or at least two high-level semantic information among the multiple high-level semantic information, to perform image reconstruction on the original image.
The image coding method according to claim 2, wherein the high-level semantic information includes instance segmentation map information or stickman map information; the instance segmentation map information is used for instance segmentation tasks, image classification tasks, and object detection of original images At least one of the tasks, the stick figure information is used for the human body pose recognition task of the original image.
An image coding system, comprising:

A semantic extractor is used to extract high-level semantic information of the original image, and the high-level semantic information is used for image analysis tasks of the original image;

a feature extractor, configured to extract low-level features of the original image, and the low-level features and the high-level semantic information are used for image reconstruction tasks of the original image;

a first encoder, configured to encode the high-level semantic information to generate a first code stream;

The second encoder is configured to encode the low-level features to generate a second code stream.
The image coding system according to claim 10, further comprising:

A predictor, configured to perform image reconstruction on the original image according to the low-level features and the high-level semantic information, so as to obtain a predicted image;

a comparator for determining difference information between the predicted image and the original image, the difference information being used to enhance the predicted image in the image reconstruction task;

a third encoder, configured to encode the difference information to generate a third code stream.
The image coding system according to claim 11, wherein said predictor comprises:

An upsampler is used to upsample low-level features to obtain target features for further processing with high-level semantic information;

A stacker, which is used to stack high-level semantic information and target features to obtain stacked features;

A convolutional network is used to perform convolution processing on stacked features to obtain convolutional features;

An adder is used to add low-level features and convolutional features to obtain a predicted image;

The convolutional network includes: a first set of convolutional layers, a plurality of convolutional filter banks, a second set of convolutional layers, and an activation function; wherein the first set of convolutional layers includes sequentially connected multi-layer convolutions Layers, one convolutional filter bank includes multiple layers of convolutional filters, and the second set of convolutional layers includes sequentially connected multiple layers of convolutional layers.
The image coding system according to claim 11, wherein the high-level semantic information includes instance segmentation map information or stickman map information; the instance segmentation map information is used for instance segmentation tasks, image classification tasks, and object detection of original images. At least one of the tasks, the stick figure information is used for the human body pose recognition task of the original image.
An image decoding method, including:

Obtain a first code stream corresponding to the high-level semantic information of the original image, and a second code stream corresponding to the low-level features of the original image;

Decoding the first code stream to obtain the high-level semantic information, the high-level semantic information is used to perform an image analysis task of the original image;

Decoding the second code stream to obtain the low-level features;

Perform image reconstruction on the original image according to the low-level features and the high-level semantic information to obtain a predicted image.
The image decoding method according to claim 14, further comprising:

Acquiring a third code stream corresponding to the difference information between the predicted image and the original image;

Decoding the third code stream to obtain the difference information;

According to the difference information, image enhancement processing is performed on the predicted image to obtain an enhanced image for display.
An image decoding system, including:

The first decoder is configured to decode the first code stream corresponding to the high-level semantic information of the original image to obtain the high-level semantic information; the high-level semantic information is used to perform an image analysis task of the original image;

The second decoder is configured to decode the second code stream corresponding to the low-level features of the original image to obtain the low-level features;

The predictor performs image reconstruction on the original image according to the low-level features and the high-level semantic information to obtain a predicted image.
The image decoding system according to claim 16, further comprising:

A third decoder, configured to decode a third code stream corresponding to difference information between the predicted image and the original image, to obtain the difference information;

An image intensifier, configured to perform image enhancement processing on the predicted image according to the difference information, so as to obtain an enhanced image for display.
An image reconstruction method, including:

Obtain high-level semantic information and low-level features of the original image;

performing image reconstruction on the original image according to the low-level features and the high-level semantic information to obtain a predicted image; and,

Acquiring difference information between the predicted image and the original image, where the difference information is used to enhance the predicted image.
An image analysis method, comprising:

Based on the image analysis task of the original image, the target code stream is obtained from multiple code streams; the multiple code streams include at least one first code stream corresponding to at least one high-level semantic information of the original image, and a low-level code stream corresponding to the original image The second code stream corresponding to the feature; wherein, a high-level semantic information of the original image corresponds to a first code stream, and the target code stream is the high-level semantic information in the at least one first code stream that is suitable for the image analysis task first code stream;

Decoding the target code stream to obtain high-level semantic information applicable to the image analysis task;

The image analysis task is performed according to the decoded high-level semantic information.
An electronic device, including at least one memory and at least one processor, the memory stores one or more computer-executable instructions, and the processor invokes the one or more computer-executable instructions to perform the The image coding method according to any one of claims 1-9, or the image decoding method according to any one of claims 14-15, or the image reconstruction method according to claim 18, or, the image reconstruction method according to claim 1 The image analysis method described in 19.